fathom train¶
Compute optimal numerical parameters for a Fathom ruleset.
The usual invocation is something like this:
fathom train samples/training --validation-set samples/validation --ruleset rulesets.js --trainee new
The first argument is a directory of labeled training pages. It can also be, for backward compatibility, a JSON file of vectors from FathomFox’s Vectorizer.
To see graphs of loss functions, install TensorBoard, then run
tensorboard --logdir runs/
. These will tell you whether you need to
adjust the --learning-rate
.
Definitions of terms used in output:
dom()
call in thefathom train [OPTIONS] TRAINING_SET_FOLDER
Options
-
-r
,
--ruleset
<ruleset>
¶ The rulesets.js file containing your rules. The file must have no imports except from fathom-web, so pre-bundle if necessary.
-
--trainee
<trainee>
¶ The trainee ID of the ruleset you want to train. Usually, this is the same as the type you are training for.
-
--training-cache
<training_cache>
¶ Where to cache training vectors to speed future training runs. Any existing file will be overwritten. [default: vectors/training_yourTraineeId.json next to your ruleset]
-
--validation-cache
<validation_cache>
¶ Where to cache validation vectors to speed future training runs. Any existing file will be overwritten. [default: vectors/validation_yourTraineeId.json next to your ruleset]
-
--delay
<delay>
¶ Number of seconds to wait for a page to load before vectorizing it [default: 5]
-
--tabs
<tabs>
¶ Number of concurrent browser tabs to use while vectorizing [default: 16]
-
--show-browser
¶
Show browser window while vectorizing. (Browser runs in headless mode by default.)
-
-a
,
--validation-set
<validation_set>
¶ Either a folder of validation pages or a JSON file made manually by FathomFox’s Vectorizer. Validation pages are used to avoid overfitting.
-
-s
,
--stop-early
,
--no-early-stopping
¶
Stop 1 iteration before validation loss begins to rise, to avoid overfitting. Before using this, check Tensorboard graphs to make sure validation loss is monotonically decreasing. [default: True]
-
-l
,
--learning-rate
<learning_rate>
¶ The learning rate to start from [default: 1.0]
-
-i
,
--iterations
<iterations>
¶ The number of training iterations to run through [default: 1000]
-
-p
,
--pos-weight
<pos_weight>
¶ The weighting factor given to all positive samples by the loss function. Raise this to increase recall at the expense of precision. See: https://pytorch.org/docs/stable/nn.html#bcewithlogitsloss
-
-c
,
--comment
<comment>
¶ Additional comment to append to the Tensorboard run name, for display in the web UI
-
-q
,
--quiet
¶
Hide per-tag diagnostics that may help with ruleset debugging.
-
-y
,
--layer
<layers>
¶ Add a hidden layer of the given size. You can specify more than one, and they will be connected in the given order. EXPERIMENTAL.
-
-x
,
--exclude
<exclude>
¶ Exclude a rule while training. This helps with before-and-after tests to see if a rule is effective.
Arguments
-
TRAINING_SET_FOLDER
¶
Required argument