fathom train

Compute optimal numerical parameters for a Fathom ruleset.

The usual invocation is something like this:

fathom train samples/training --validation-set samples/validation --ruleset rulesets.js --trainee new

The first argument is a directory of labeled training pages. It can also be, for backward compatibility, a JSON file of vectors from FathomFox’s Vectorizer.

To see graphs of loss functions, install TensorBoard, then run tensorboard --logdir runs/. These will tell you whether you need to adjust the --learning-rate.

Definitions of terms used in output:

pruned
Said of a node that was prematurely eliminated from consideration
because it did not match the selector of any dom() call in the
ruleset
target
A “right answer”: a labeled, positive DOM node, one that should be
recognized.
fathom train [OPTIONS] TRAINING_SET_FOLDER

Options

-r, --ruleset <ruleset>

The rulesets.js file containing your rules. The file must have no imports except from fathom-web, so pre-bundle if necessary.

--trainee <trainee>

The trainee ID of the ruleset you want to train. Usually, this is the same as the type you are training for.

--training-cache <training_cache>

Where to cache training vectors to speed future training runs. Any existing file will be overwritten. [default: vectors/training_yourTraineeId.json next to your ruleset]

--validation-cache <validation_cache>

Where to cache validation vectors to speed future training runs. Any existing file will be overwritten. [default: vectors/validation_yourTraineeId.json next to your ruleset]

--delay <delay>

Number of seconds to wait for a page to load before vectorizing it [default: 5]

--tabs <tabs>

Number of concurrent browser tabs to use while vectorizing [default: 16]

--show-browser

Show browser window while vectorizing. (Browser runs in headless mode by default.)

-a, --validation-set <validation_set>

Either a folder of validation pages or a JSON file made manually by FathomFox’s Vectorizer. Validation pages are used to avoid overfitting.

-s, --stop-early, --no-early-stopping

Stop 1 iteration before validation loss begins to rise, to avoid overfitting. Before using this, check Tensorboard graphs to make sure validation loss is monotonically decreasing. [default: True]

-l, --learning-rate <learning_rate>

The learning rate to start from [default: 1.0]

-i, --iterations <iterations>

The number of training iterations to run through [default: 1000]

-p, --pos-weight <pos_weight>

The weighting factor given to all positive samples by the loss function. Raise this to increase recall at the expense of precision. See: https://pytorch.org/docs/stable/nn.html#bcewithlogitsloss

-c, --comment <comment>

Additional comment to append to the Tensorboard run name, for display in the web UI

-q, --quiet

Hide per-tag diagnostics that may help with ruleset debugging.

-y, --layer <layers>

Add a hidden layer of the given size. You can specify more than one, and they will be connected in the given order. EXPERIMENTAL.

-x, --exclude <exclude>

Exclude a rule while training. This helps with before-and-after tests to see if a rule is effective.

Arguments

TRAINING_SET_FOLDER

Required argument