Optimization

Selecting the optimal score() coefficients in a complex ruleset is tricky, but it can have a huge effect on accuracy. Manual tweaking, supported by a laboriously constructed test harness, becomes untenable with more than a few coefficients, so we recommend letting the machine figure them out.

The Annealer

To prepare your ruleset for automatic optimization, replace its scoring and other scaling constants with JS variables, and wrap it in a function which takes a series of values and plops them into those variables:

/**
 * @return A ruleset for extracting the paragraph with the most textual
 * content.
 */
function tunedRuleset(coeffLength = 3,
                      coeffParagraphTag = 2,
                      coeffLinkDensity = 5) {
    // If you use the best coefficients you've found so far as default
    // arg values, it makes tunedRuleset() a convenient production API
    // while allowing for future optimization runs.

    function scoreByLength(fnode) {
        const length = inlineTextLength(fnode.element) * coeffLength;
        return {score: length};
    }

    return ruleset(
        rule(dom('p,div,li,code,blockquote,pre,h1,h2,h3,h4,h5,h6'),
             props(scoreByLength).type('paragraphish')),
        rule(type('paragraphish'),
             score(fnode => (1 - linkDensity(fnode)) * coeffLinkDensity)),
        rule(dom('p'),
             score(coeffParagraphTag).type('paragraphish')),
        rule(type('paragraphish').max(),
             out('longestParagraph'))
    );
}

Fathom provides a numerical optimizer based on simulated annealing. This algorithm is a particularly good fit for the staircase functions that commonly come up when measuring the efficacy of Fathom output [1]. Continuous methods like Powell’s tend to sit in the middle of a stair, look to the right, look to the left, see no improvement, and declare premature victory, while annealing has enough randomness to shake itself out of such local minima.

Fathom’s optimizer is exposed as an abstract class in fathom-web/optimizers:

class Annealer()

Abstract base for simulated annealing runs

This works for fitness functions which are staircase functions, made of vertical falloffs and flat horizontal regions, where continuous numerical optimization methods get stuck. It starts off looking far afield for global minima and gradually shifts its focus to the best local one as time progresses.

More technically, we look around at random for changes that reduce the value of the cost function. Occasionally, a random change that increases cost is incorporated. The chance of incorporating a cost-increasing change lessens as the algorithim progresses.

anneal()

Iterate over a variety of random solutions for a finite time, and return the best we come up with.

Returns:Array.<number> – Coefficients we arrived at
initialSolution()
Returns:Array.<number> – Coefficients to begin the random walk from. The quality of this solution is not very important.
randomTransition(coeffs)
Returns:Array.<number> – Coefficients randomly changed slightly from the passed-in ones
solutionCost(coeffs)
Returns:number – A cost estimate for the passed-in solution, on an arbitrary scale. Lower signifies a better solution.

The last 3 methods are yours to fill in. Don’t try to be clever about them; it’s better to come up with something that runs quickly, because the annealer runs many thousand iterations by default. These simple parametrizations are not unreasonable for the toy ruleset above:

const {Annealer} = require('../optimizers');

class LongestParagraphTuner extends Annealer {
    initialSolution() {
        return [1, 1, 1];
    }

    /** Nudge a random coefficient in a random direction by 0.5. */
    randomTransition(coeffs) {
        const ret = coeffs.slice();  // Make a copy.
        ret[Math.floor(Math.random() * coeffs.length)] += Math.floor(Math.random() * 2) ? -.5 : .5;
        return ret;
    }

    /**
     * Loop through a list of documents whose longest paragraphs (or whatever
     * you're looking for) we already know, run the ruleset over them, and bump
     * up the cost each time it comes up with something different.
     */
    solutionCost(coeffs) {
        let cost = 0;
        for (let [doc, knownBestParagraph] of aListOfKnownGoodSolutions) {
            if (tunedRuleset(...coeffs)
                    .against(doc)
                    .get('longestParagraph')
                    .textContent !== knownBestParagraph) {
                cost += 1;
            }
        }
        return cost;
    }
}

Then all you have to do is run the annealer, and go have a sandwich:

const annealer = new LongestParagraphTuner();
const coeffs = annealer.anneal();
console.log('Tuned coefficients:', coeffs);

For a more complex, real-world example, see readability.js in the examples folder.

The Corpus Framework

Of course, this leaves open the question of how to loop over a collected corpus, the aListOfKnownGoodSolutions above. Fortunately, Fathom also includes a hierarchy of inversion-of-control classes that help frame your answer to this: Run(), Corpus(), and Sample().

We assume that your corpus is a folder of folders on disk, one inner folder for each member, or sample, of the corpus. The folder’s name is considered the name of the sample.

theCorpus
    sampleOne
        source.html
        ...
    sampleTwo
        source.html
        ...
    sampleThree
        source.html
        ...
    ...

Each sample folder contains these files:

source.html
The HTML of the captured page, UTF-8 encoded. Ideally, this should “hold still”, in that it shouldn’t have any JS that will run and mutate the page when you’re trying to examine it while designing your rules. JS will not, of course, be executed as the optimization harness itself runs over your corpus.
whateverOtherFilesYouLike.foo
You can add additional inputs if you like, like known-correct answers or captured metadata that is not expressed in source.html alone. You pull these into properties of your Sample() by overriding its constructor.

The job of your annealer’s solutionCost method is then to do a run over all the samples in your corpus, which can be represented like this:

// Load the corpus from disk.
constructor() {
    super();
    this.corpus = new LongestParagraphCorpus();
}

// Do a run over it, and see how the current coefficients determined by the
// annealer scored.
solutionCost(coeffs) {
    return new LongestParagraphRun(this.corpus, coeffs).score();
}

It’s up to you to implement the subclasses of Corpus() and Run(), as described below.

Corpus Framework Reference

class Run(corpus, coeffs)

A run of a ruleset over an entire supervised corpus of pages

Builds up a total score and reports it at the end.

Run ruleset against every document in the corpus, and make the final score ready for retrieval by calling score() or humanScore().

Arguments:
  • corpus (Corpus) – The documents over which to run the ruleset
  • coeffs (Array.<Number>|undefined) – The coefficients by which to parametrize the ruleset
score()

Return the score for the optimizer to minimize. This should read from scoreParts and return something as efficiently as possible, because the optimizer runs a lot of iterations.

humanScore()

Return a human-readable score, for cosmetic reporting. Like score(), it should read from scoreParts.

class Corpus()

A reusable, caching representation of a group of samples

This solves the problem of jsdom leaking on repeated instantiation and of the performance penalty inherent in re-parsing sample data.

On construct, this loops across the folders inside baseFolder(), caching each as a Sample().

baseFolder()
Returns:String – The path to the folder in which samples live.
sampleFromPath(sampleDirPath)
Returns:Sample – A new Sample() subclass representing the sample embodied by the folder sampleDirPath
class Sample(sampleDir)

One item in a corpus of training or testing data

This assumes a folder surrounds the sample and contains a source.html file containing markup we want to make use of. This lands in doc. name contains the name of the folder. Override the constructor to pull in additional information you’re interested in.

Arguments:
  • sampleDir (String) – Path to the folder representing the sample, containing source.html and other files at your discretion
doc

The DOM of the HTML document I represent

name

The name of the folder this sample came from

[1]This assumes a cost function which has sudden jumps, like when measuring the hard-edged inclusion or exclusion of nodes in the final output. If you can contrive a smoother one which, for instance, examines scores on nodes before they pass through hard-edged thresholds like max(), you may be able to take advantage of a continuous optimization method to get better or quicker results.