Clustering

Fathom provides a flexible clustering algorithm, useful for finding nodes that are bunched together spatially or according to some other metric. By default, it groups nodes based on their proximity and ancestry. It is documented here as top-level functions but is also available directly within rulesets as bestCluster(), which has the advantage of letting you direct its results to further rules.

The clustering routines hang off a clusters object in the top-level Fathom module. To import them, do something like this:

const {
  clusters: { distance },
} = require('fathom-web');

This will result in a top-level distance symbol.

Note

Clustering is computationally expensive (at least O(n^2)). It is powerful, but it should be used only when more efficient alternatives are exhausted.

clusters(fnodes, splittingDistance, getDistance)

Partition the given nodes into one or more clusters by position in the DOM tree.

This implements an agglomerative clustering. It uses single linkage, since we’re talking about adjacency here more than Euclidean proximity: the clusters we’re talking about in the DOM will tend to be adjacent, not overlapping. We haven’t tried other linkage criteria yet.

In a later release, we may consider score or notes.

Arguments
  • fnodes (Array.<Fnode>|Array.<Node>) – fnodes or DOM nodes to group into clusters

  • splittingDistance (number) – The closest-nodes distance() beyond which we will not attempt to unify 2 clusters. Make this larger to make larger clusters.

  • getDistance (function) – A function that returns some notion of numerical distance between 2 nodes. Default: distance()

Returns

Array – An Array of Arrays, with each Array containing all the nodes in one cluster. Note that neither the clusters nor the nodes are in any particular order. You may find domSort() helpful to remedy the latter.

Example:

const {clusters} = require('fathom-web/clusters');
theClusters = clusters(anArrayOfNodes, 4);

In the above, 4 is the distance beyond which Fathom will decide nodes belong in separate clusters. Turn it up to more aggressively invite nearby nodes into a cluster. Turn it down to keep clusters smaller. The output looks like a list of lists, with each list representing a cluster:

[[nodeA, nodeB, nodeC],
 [nodeD]]

Various factors influence the measured distance between nodes. The first is the obvious one: topological distance, the number of steps along the DOM tree from one node to another.

The second is structural similarity. In the following, the divs a and b are farther apart…

<center>
    <div id="a">
    </div>
</center>
<div>
    <div id="b">
    </div>
</div>

…than they would be if the center tag were a div as well:

<div>
    <div id="a">
    </div>
</div>
<div>
    <div id="b">
    </div>
</div>

Third is depth disparity. Nodes are considered farther from each other if they are not the same distance from the root.

Finally is the presence of “stride” nodes, which are siblings or siblings-of-ancestors that lie between 2 nodes. (These are the nodes that would appear between the 2 nodes in a straightforward rendering of the page.) Each stride node makes it less likely that the 2 nodes will be together in a cluster.

The costs for each factor can be customized by wrapping distance() in an arrow function and passing it as the third param.

Note

clusters() can actually cluster anything, not just DOM nodes. All you need to do is pass in a suitable distance function as the getDistance param.

distance(fnodeA, fnodeB, {differentDepthCost = 2, differentTagCost = 2, sameTagCost = 1, strideCost = 1, additionalCost = (fnodeA, fnodeB) => 0})

Return a topological distance between 2 DOM nodes or fnodes weighted according to the similarity of their ancestry in the DOM. For instance, if one node is situated inside <div><span><b><theNode> and the other node is at <differentDiv><span><b><otherNode>, they are considered close to each other for clustering purposes. This is useful for picking out nodes which have similar purposes.

Return Number.MAX_VALUE if one of the nodes contains the other.

This is largely an implementation detail of clusters(), but you can call it yourself if you wish to implement your own clustering. Takes O(n log n) time.

Note that the default costs may change; pass them in explicitly if they are important to you.

Arguments
  • fnodeA (Node|Fnode) –

  • fnodeB (Node|Fnode) –

  • differentDepthCost (number) – Cost for each level deeper one node is than the other below their common ancestor

  • differentTagCost (number) – Cost for a level below the common ancestor where tagNames differ

  • sameTagCost (number) – Cost for a level below the common ancestor where tagNames are the same

  • strideCost (number) – Cost for each stride node between A and B. Stride nodes are siblings or siblings-of-ancestors that lie between the 2 nodes. These interposed nodes make it less likely that the 2 nodes should be together in a cluster.

  • additionalCost (function) – Return an additional cost, given 2 fnodes or nodes.

euclidean(fnodeA, fnodeB)

Return the spatial distance between 2 fnodes or elements, assuming a rendered page.

Specifically, return the distance in pixels between the centers of fnodeA.element.getBoundingClientRect() and fnodeB.element.getBoundingClientRect().