Compute optimal numerical parameters for a Fathom ruleset.
The usual invocation is something like this:
fathom train samples/training --validation-set samples/validation --ruleset rulesets.js --trainee new
The first argument is a directory of labeled training pages. It can also be, for backward compatibility, a JSON file of vectors from FathomFox’s Vectorizer.
To see graphs of loss functions, install TensorBoard, then run
tensorboard --logdir runs/. These will tell you whether you need to
Definitions of terms used in output:
dom()call in the
fathom train [OPTIONS] TRAINING_SET_FOLDER
The rulesets.js file containing your rules. The file must have no imports except from fathom-web, so pre-bundle if necessary.
The trainee ID of the ruleset you want to train. Usually, this is the same as the type you are training for.
Where to cache training vectors to speed future training runs. Any existing file will be overwritten. [default: vectors/training_yourTraineeId.json next to your ruleset]
Where to cache validation vectors to speed future training runs. Any existing file will be overwritten. [default: vectors/validation_yourTraineeId.json next to your ruleset]
Number of seconds to wait for a page to load before vectorizing it [default: 5]
Number of concurrent browser tabs to use while vectorizing [default: 16]
Show browser window while vectorizing. (Browser runs in headless mode by default.)
Either a folder of validation pages or a JSON file made manually by FathomFox’s Vectorizer. Validation pages are used to avoid overfitting.
Stop 1 iteration before validation loss begins to rise, to avoid overfitting. Before using this, check Tensorboard graphs to make sure validation loss is monotonically decreasing. [default: True]
The learning rate to start from [default: 1.0]
The number of training iterations to run through [default: 1000]
The weighting factor given to all positive samples by the loss function. Raise this to increase recall at the expense of precision. See: https://pytorch.org/docs/stable/nn.html#bcewithlogitsloss
Additional comment to append to the Tensorboard run name, for display in the web UI
Hide per-tag diagnostics that may help with ruleset debugging.
Add a hidden layer of the given size. You can specify more than one, and they will be connected in the given order. EXPERIMENTAL.
Exclude a rule while training. This helps with before-and-after tests to see if a rule is effective.