fathom train¶

Compute optimal numerical parameters for a Fathom ruleset.

The usual invocation is something like this:

fathom train samples/training --validation-set samples/validation --ruleset rulesets.js --trainee new

The first argument is a directory of labeled training pages. It can also be, for backward compatibility, a JSON file of vectors from FathomFox’s Vectorizer.

To see graphs of loss functions, install TensorBoard, then run tensorboard --logdir runs/. These will tell you whether you need to adjust the --learning-rate.

Definitions of terms used in output:

pruned

Said of a node that was prematurely eliminated from consideration
because it did not match the selector of any dom() call in the
ruleset

target

A “right answer”: a labeled, positive DOM node, one that should be

recognized.

fathom train [OPTIONS] TRAINING_SET_FOLDER

Options

-r, --ruleset <ruleset>¶: The rulesets.js file containing your rules. The file must have no imports except from fathom-web, so pre-bundle if necessary.

--trainee <trainee>¶: The trainee ID of the ruleset you want to train. Usually, this is the same as the type you are training for.

--training-cache <training_cache>¶: Where to cache training vectors to speed future training runs. Any existing file will be overwritten. [default: vectors/training_yourTraineeId.json next to your ruleset]

--validation-cache <validation_cache>¶: Where to cache validation vectors to speed future training runs. Any existing file will be overwritten. [default: vectors/validation_yourTraineeId.json next to your ruleset]

--delay <delay>¶: Number of seconds to wait for a page to load before vectorizing it [default: 5]

--tabs <tabs>¶: Number of concurrent browser tabs to use while vectorizing [default: 16]

--show-browser¶: Show browser window while vectorizing. (Browser runs in headless mode by default.)

-a, --validation-set <validation_set>¶: Either a folder of validation pages or a JSON file made manually by FathomFox’s Vectorizer. Validation pages are used to avoid overfitting.

-s, --stop-early, --no-early-stopping¶: Stop 1 iteration before validation loss begins to rise, to avoid overfitting. Before using this, check Tensorboard graphs to make sure validation loss is monotonically decreasing. [default: True]

-l, --learning-rate <learning_rate>¶: The learning rate to start from [default: 1.0]

-i, --iterations <iterations>¶: The number of training iterations to run through [default: 1000]

-p, --pos-weight <pos_weight>¶: The weighting factor given to all positive samples by the loss function. Raise this to increase recall at the expense of precision. See: https://pytorch.org/docs/stable/nn.html#bcewithlogitsloss

-c, --comment <comment>¶: Additional comment to append to the Tensorboard run name, for display in the web UI

-q, --quiet¶: Hide per-tag diagnostics that may help with ruleset debugging.

-y, --layer <layers>¶: Add a hidden layer of the given size. You can specify more than one, and they will be connected in the given order. EXPERIMENTAL.

-x, --exclude <exclude>¶: Exclude a rule while training. This helps with before-and-after tests to see if a rule is effective.

Arguments

TRAINING_SET_FOLDER¶: Required argument