fathom train¶
Compute optimal numerical parameters for a Fathom ruleset.
The usual invocation is something like this:
fathom train samples/training --validation-set samples/validation --ruleset rulesets.js --trainee new
The first argument is a directory of labeled training pages. It can also be, for backward compatibility, a JSON file of vectors from FathomFox’s Vectorizer.
To see graphs of loss functions, install TensorBoard, then run
tensorboard --logdir runs/. These will tell you whether you need to
adjust the --learning-rate.
Definitions of terms used in output:
dom() call in thefathom train [OPTIONS] TRAINING_SET_FOLDER
Options
-
-r,--ruleset<ruleset>¶ The rulesets.js file containing your rules. The file must have no imports except from fathom-web, so pre-bundle if necessary.
-
--trainee<trainee>¶ The trainee ID of the ruleset you want to train. Usually, this is the same as the type you are training for.
-
--training-cache<training_cache>¶ Where to cache training vectors to speed future training runs. Any existing file will be overwritten. [default: vectors/training_yourTraineeId.json next to your ruleset]
-
--validation-cache<validation_cache>¶ Where to cache validation vectors to speed future training runs. Any existing file will be overwritten. [default: vectors/validation_yourTraineeId.json next to your ruleset]
-
--delay<delay>¶ Number of seconds to wait for a page to load before vectorizing it [default: 5]
-
--tabs<tabs>¶ Number of concurrent browser tabs to use while vectorizing [default: 16]
-
--show-browser¶ Show browser window while vectorizing. (Browser runs in headless mode by default.)
-
-a,--validation-set<validation_set>¶ Either a folder of validation pages or a JSON file made manually by FathomFox’s Vectorizer. Validation pages are used to avoid overfitting.
-
-s,--stop-early,--no-early-stopping¶ Stop 1 iteration before validation loss begins to rise, to avoid overfitting. Before using this, check Tensorboard graphs to make sure validation loss is monotonically decreasing. [default: True]
-
-l,--learning-rate<learning_rate>¶ The learning rate to start from [default: 1.0]
-
-i,--iterations<iterations>¶ The number of training iterations to run through [default: 1000]
-
-p,--pos-weight<pos_weight>¶ The weighting factor given to all positive samples by the loss function. Raise this to increase recall at the expense of precision. See: https://pytorch.org/docs/stable/nn.html#bcewithlogitsloss
-
-c,--comment<comment>¶ Additional comment to append to the Tensorboard run name, for display in the web UI
-
-q,--quiet¶ Hide per-tag diagnostics that may help with ruleset debugging.
-
-y,--layer<layers>¶ Add a hidden layer of the given size. You can specify more than one, and they will be connected in the given order. EXPERIMENTAL.
-
-x,--exclude<exclude>¶ Exclude a rule while training. This helps with before-and-after tests to see if a rule is effective.
Arguments
-
TRAINING_SET_FOLDER¶ Required argument