Filtering Data
Table of Contents
Overview
Data is filtered out of production streams at almost every stage of the pipeline. The following outlines each stage and both the data currently filtered and the data that could be filtered. This should help answer two classes of questions:
- Did my data get filtered out?
- We've uncovered spurious data being ingested, how should we handle that?
Note: JSON Schema filtering is our primary method of filtering out bad data. That should be used before any other methods of dropping data from the pipeline.
Stages
Where - Which stage of the pipeline this filtering occurs in
What - What happens to the data when filtered here
When - Which situations this filtering should be, and is, used in
How - What kind of data can be filtered at this stage
Edge filtering
Where: Filtered by nginx, currently we use it to filter out non-v4 pings (to be removed, but capability to remain).
What: Drops data entirely from the pipeline; there will be no traces of it downstream from the edge server.
When: Only to be used in extreme situations (e.g. PII exposure). We also use it for dropping too-large messages and headers.
How: Can be used to filter by URI, namespaces, apps, etc. (from the URL or from the HTTP headers); but not anything in the payload.
Beam Filtering
Where: Filtered in the message scrubber.
What: Causes data to be written to the error stream or to be dropped entirely.
When: Filter out data we absolutely know we will never need to see (e.g. data from forked applications).
How: Can filter out namespaces, doctypes, or URIs currently; in the extreme can filter on any message Attribute or payload field.
JSON Schema Filtering
Where: During ingestion, as defined in the payload schema.
What: Causes data to be written to the error stream.
When: When trying to remove bad analysis data that we know we will never need (e.g. huge values, improper strings, etc.). Usually these indicate something went wrong with the payload.
How: Can filter on values in the payload, using the JSON schema.
Filtering from the stable tables
Where: After ingestion to live tables, but before copying to the stable tables.
What: Allows data to exist in the live tables, but removes it from the stable tables.
When: Use for data that may be needed for some analyses on recent data, but not for data that will need long-term historical analyses or for use in any downstream reporting. For example, we filter out pings from automation (e.g. CI) here, so that analysis is unaffected by them, but we can still analyze what the recent CI data looks like. We also drop duplicate pings (per the document-id).
How: Can filter on any field in the schema, or any metadata.
Filtering from the exposed views
Where: After ingestion to stable tables (example).
What: Allows data to exist in stable tables, but not be exposed to users when accessing views.
When: Use for data that is a footgun for end-users (e.g. data that was collected before a product was launched), but will probably be needed by data science or eng.
How: Can filter on any field in the schema, or any metadata.
Optional filtering in Looker Explores
Where: In the explore, Looker creates a default filter for a field.
What: Allows data to exist in views, and optionally allows users to query that data (but not by default).
When: Use this for data that most of the time should not be queried in Looker. Downside is too many of these will clutter the Looker explore.
How: Can filter on any field in the schema, or any metadata.
Querying the Error Stream
The data engineering team has exposed some tables to make querying the error stream easier.
The schema errors dashboard will let you choose your namespace and doctype to see errors over the past week.
If that data is not granular enough, the error stream can be queried directly:
SELECT
udf.parse_desktop_telemetry_uri(uri) AS parsed_uri,
* EXCEPT(payload),
udf_js.gunzip(payload) AS payload
FROM
`moz-fx-data-shared-prod.payload_bytes_error.telemetry`
WHERE
DATE(submission_timestamp) = "2021-01-07"
LIMIT
1000