Rules and Rulesets

Most everything on this page is a top-level object in the Fathom library, importable like this, for instance:

const {
   dom,
   element,
   out,
   rule,
   ruleset
 } = require('fathom-web');

Rulesets

The most important Fathom object is the ruleset, an unordered collection of rules. The plain old Ruleset() is what you typically construct, via the ruleset convenience function:

ruleset(rules, coeffs, biases)

A shortcut for creating a new Ruleset(), for symmetry with rule()

class Ruleset(rules, coeffs, biases)

An unbound ruleset. When you bind it by calling against(), the resulting BoundRuleset() will be immutable.

Arguments
  • rules (Array) – Rules returned from rule()

  • coeffs (Map) – A map of rule names to numerical weights, typically returned by the trainer. Example: [['someRuleName', 5.04], ...]. If not given, coefficients default to 1.

  • biases (object) – A map of type names to neural-net biases. These enable accurate confidence estimates. Example: [['someType', -2.08], ...]. If absent, biases default to 0.

Ruleset.against(doc)

Commit this ruleset to running against a specific DOM tree or subtree.

When run against a subtree, the root of the subtree is not considered as a possible match.

This doesn’t actually modify the Ruleset but rather returns a fresh BoundRuleset(), which contains caches and other stateful, per-DOM bric-a-brac.

Ruleset.rules()

Return all the rules (both inward and outward) that make up this ruleset.

From this, you can construct another ruleset like this one but with your own rules added.

Then you call Ruleset.against() to get back a BoundRuleset(), which is specific to a given DOM tree. From that, you pull answers.

class BoundRuleset(inRules, outRules)

A ruleset that is earmarked to analyze a certain DOM

Carries a cache of rule results on that DOM. Typically comes from against().

Arguments
  • inRules (Array) – Non-out() rules

  • outRules (Map) – Output key -> out() rule

BoundRuleset.get(thing)

Return an array of zero or more fnodes.

Arguments
  • thing (string|Lhs|Node) – Can be (1) A string which matches up with an “out” rule in the ruleset. If the out rule uses through(), the results of through’s callback (which might not be fnodes) will be returned. (2) An arbitrary LHS which we calculate and return the results of. (3) A DOM node, for which we will return the corresponding fnode. Results are cached for cases (1) and (3).

BoundRuleset.setCoeffsAndBiases(coeffs, biases)

Change my coefficients and biases after construction.

Arguments

Rules

These are the control structures which govern the flow of scores, types, and notes through a ruleset. You construct a rule by calling rule() and passing it a left-hand side and a right-hand side:

rule(lhs, rhs, options)

Construct and return the proper type of rule class based on the inwardness/outwardness of the RHS.

Arguments
  • lhs (Lhs) – The left-hand side of the rule

  • rhs (Rhs) – The right-hand side of the rule

  • options (object) – Other, optional information about the rule. Currently, the only recognized option is name, which points to a string that uniquely identifies this rule in a ruleset. The name correlates this rule with one of the coefficients passed into ruleset(). If no name is given, an identifier is assigned based on the index of this rule in the ruleset, but that is, of course, brittle.

Left-hand Sides

Left-hand sides are currently a few special forms which select nodes to be fed to right-hand sides.

dom(selector)

Take nodes that match a given DOM selector. Example: dom('meta[property="og:title"]')

Every ruleset has at least one dom or element() rule, as that is where nodes begin to flow into the system. If run against a subtree of a document, the root of the subtree is not considered as a possible match.

element(selector)

Take a single given node if it matches a given DOM selector, without looking through its descendents or ancestors. Otherwise, take no nodes. Example: element('input')

This is useful for applications in which you want Fathom to classify an element the user has selected, rather than scanning the whole page for candidates.

type(theType)

Take nodes that have the given type. Example: type('titley')

max()

Of the nodes selected by a type call to the left, constrain the LHS to return only the max-scoring one. If there is a tie, more than 1 node will be returned. Example: type('titley').max()

bestCluster(options)

Take the nodes selected by a type call to the left, group them into clusters, and return the nodes in the cluster that has the highest total score (on the relevant type).

Nodes come out in arbitrary order, so, if you plan to emit them, consider using .out('whatever').allThrough(domSort). See domSort().

If multiple clusters have equally high scores, return an arbitrary one, because Fathom has no way to represent arrays of arrays in rulesets.

Arguments
  • options (Object) – The same depth costs taken by distance(), plus splittingDistance, which is the distance beyond which 2 clusters will be considered separate. splittingDistance, if omitted, defaults to 3.

and(typeCall[, typeCall, ...])

Pull nodes that conform to multiple conditions at once.

For example: and(type('title'), type('english'))

Caveats: and supports only simple type calls as arguments for now, and it may fire off more rules as prerequisites than strictly necessary. not and or don’t exist yet, but you can express or the long way around by having 2 rules with identical RHSs.

nearest(typeCallA, typeCallB[, distance=euclidean])

Experimental. For each fnode from typeCallA, find the closest node from typeCallB, and attach it as a note. The note is attached to the type specified by the RHS, defaulting to the type of typeCallA. If no nodes are emitted from typeCallB, do nothing.

For example…

nearest(type('image'), type('price'))

The score of the typeCallA can be added to the new type’s score by using conserveScore() (though this routine has since been removed):

rule(nearest(type('image'), type('price')),
     type('imageWithPrice').score(2).conserveScore())

Caveats: nearest supports only simple type calls as arguments a and b for now.

Arguments
  • distance (function) – A function that takes 2 fnodes and returns a numerical distance between them. Included options are distance(), which is a weighted topological distance, and euclidean(), which is a spatial distance.

when(predicate)

Prune nodes from consideration early in run execution, before scoring is done.

Reserve this for where you are sure it is always correct or when performance demands it. It is generally preferable to use score() and let the trainer determine the relative significance of each rule. Human intuition as to what is important is often wrong: for example, one might assume that a music player website would include the word “play”, but this does not hold once you include sites in other languages.

Can be chained after type() or dom().

Example: dom('p').when(isVisible)

Arguments
  • predicate (function) – Accepts a fnode and returns a boolean

Right-hand Sides

A right-hand side takes the nodes chosen by the left-hand side and mutates them. Spelling-wise, a RHS is a strung-together series of calls like this:

type('smoo').props(someCallback).type('whee').score(2)

To facilitate factoring up repetition in right-hand sides, calls layer together like sheets of transparent acetate: if there are repeats, as with type in the above example, the rightmost takes precedence and the left becomes useless. Similarly, if props(), which can return multiple properties of a fact (element, note, score, and type), is missing any of these properties, we continue searching to the left for anything that provides them (excepting other props() calls—if you want that, write a combinator, and use it to combine the 2 functions you want)). To prevent this, return all properties explicitly from your props callback, even if they are no-ops (like {score: 1, note: undefined, type: undefined}). Aside from this layering precedence, the order of calls does not matter.

A good practice is to use more declarative calls—score(), note(), and type()—as much as possible and save props() for when you need it. The query planner can get more out of the more specialized calls without you having to tack on verbose hints like atMost() or typeIn().

atMost(score)

Declare that the maximum returned subscore is such and such, which helps the optimizer plan efficiently. This doesn’t force it to be true; it merely throws an error at runtime if it isn’t. To lift an atMost constraint, call atMost() (with no args). The reason atMost and typeIn apply until explicitly cleared is so that, if someone used them for safety reasons on a lexically distant rule you are extending, you won’t stomp on their constraint and break their invariants accidentally.

props(callback)

Determine any of type, note, score, and element using a callback. This overrides any previous call to props and, depending on what properties of the callback’s return value are filled out, may override the effects of other previous calls as well.

The callback should return…

  • An optional subscore

  • A type (required on dom(...) rules, defaulting to the input one on type(...) rules)

  • Optional notes

  • An element, defaulting to the input one. Overriding the default enables a callback to walk around the tree and say things about nodes other than the input one.

For example…

function callback(fnode) {
    return [{score: 3,
             element: fnode.element,  // unnecessary, since this is the default
             type: 'texty',
             note: {suspicious: true}}];
}

If you use props, Fathom cannot look inside your callback to see what type you are emitting, so you must declare your output types with typeIn() or set a single static type with type. Fathom will complain if you don’t. (You can still opt not to return any type if the node turns out not to be a good match, even if you declare a typeIn().)

note(callback)

Whatever the callback returns (even undefined) becomes the note of the fact. This overrides any previous call to note.

Since every node can have multiple, independent notes (one for each type), this applies to the type explicitly set by the RHS or, if none, to the type named by the type call on the LHS. If the LHS has none because it’s a dom(…) LHS, an error is raised.

When you query for fnodes of a certain type, you can expect to find notes of any form you specified on any RHS with that type. If no note is specified, it will be undefined. However, if two RHSs emits a given type, one adding a note and the other not adding one (or adding an undefined one), the meaningful note overrides the undefined one. This allows elaboration on a RHS’s score (for example) without needing to repeat note logic.

Indeed, undefined is not considered a note. So, though notes cannot in general be overwritten, a note that is undefined can. Symmetrically, an undefined returned from a note() or props() or the like will quietly decline to overwrite an existing defined note, where any other value would cause an error. Rationale: letting undefined be a valid note value would mean you couldn’t shadow a leftward note in a RHS without introducing a new singleton value to serve as a “no value” flag. It’s not worth the complexity and the potential differences between the (internal) fact and fnode note value semantics.

Best practice: any rule adding a type should apply the same note. If only one rule of several type-foo-emitting ones did, it should be made to emit a different type instead so downstream rules can explicitly state that they require the note to be there. Otherwise, there is nothing to guarantee the note-adding rule will run before the note-needing one.

out(key)

Expose the output of this rule’s LHS as a “final result” to the surrounding program. It will be available by calling get() on the ruleset and passing the key. You can run each node through a callback function first by adding through(), or you can run the entire set of nodes through a callback function by adding allThrough().

If you are not using through() or allThrough(), you can omit the call to out() and simply use specify the key as the RHS of the rule. For example: rule(type('titley').max(), out('title')) can be written as rule(type('titley').max(), 'title').

through(callback)

Append .through to out() to run each fnode emitted from the LHS through an arbitrary function before returning it to the containing program. Example:

out('titleLengths').through(fnode => fnode.noteFor('title').length)
allThrough(callback)

Append .allThrough to out() to run the entire iterable of emitted fnodes through an arbitrary function before returning them to the containing program. Example:

out('sortedTitles').allThrough(domSort)
score(scoreOrCallback)

Affect the confidence with which the input node should be considered a member of a type.

The parameter is generally between 0 and 1 (inclusive), with 0 meaning the node does not have the “smell” this rule checks for and 1 meaning it does. The range between 0 and 1 is available to represent “fuzzy” confidences. If you have an unbounded range to compress down to [0, 1], consider using sigmoid() or a scaling thereof.

Since every node can have multiple, independent scores (one for each type), this applies to the type explicitly set by the RHS or, if none, to the type named by the type call on the LHS. If the LHS has none because it’s a dom(...) LHS, an error is raised.

Arguments
  • scoreOrCallback (number|function) – Can either be a static number, generally 0 to 1 inclusive, or else a callback which takes the fnode and returns such a number. If the callback returns a boolean, it is cast to a number.

type(theType)

Set the type applied to fnodes processed by this RHS.

typeIn(type[, type, ...])

Constrain this rule to emit 1 of a set of given types. Pass no args to lift a previous typeIn constraint, as you might do when basing a LHS on a common value to factor out repetition.

typeIn is mostly a hint for the query planner when you’re emitting types dynamically from props calls—in fact, an error will be raised if props is used without a typeIn or type to constrain it—but it also checks conformance at runtime to ensure validity.