Recommended practices
Queries
- Should be defined in files named as
sql/<project>/<dataset>/<table>_<version>/query.sqle.g.<project>defines both where the destination table resides and in which project the query job runssql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v7/query.sql- Queries that populate tables should always be named with a version suffix; we assume that future optimizations to the data representation may require schema-incompatible changes such as dropping columns
- May be generated using a python script that prints the query to stdout
- Should save output as
sql/<project>/<dataset>/<table>_<version>/query.sqlas above - Should be named as
sql/<project>/query_type.sql.pye.g.sql/moz-fx-data-shared-prod/clients_daily.sql.py - May use options to generate queries for different destination tables e.g.
using
--source telemetry_core_parquet_v3to generatesql/moz-fx-data-shared-prod/telemetry/core_clients_daily_v1/query.sqland using--source main_summary_v4to generatesql/moz-fx-data-shared-prod/telemetry/clients_daily_v7/query.sql - Should output a header indicating options used e.g.
-- Query generated by: sql/moz-fx-data-shared-prod/clients_daily.sql.py --source telemetry_core_parquet
- Should save output as
- For tables in
moz-fx-data-shared-prodthe project prefix should be omitted to simplify testing. (Other projects do need the project prefix) - Should be incremental
- Should filter input tables on partition and clustering columns
- Should use
_prefix in generated column names not meant for output - Should use
_bitssuffix for any integer column that represents a bit pattern - Should not use
DATETIMEtype, due to incompatibility with spark-bigquery-connector - Should read from
*_stabletables instead of including custom deduplication- Should use the earliest row for each
document_idbysubmission_timestampwhere filtering duplicates is necessary
- Should use the earliest row for each
- Should not refer to views in the
mozdataproject which are duplicates of views in another project (commonlymoz-fx-data-shared-prod). Refer to the original view instead. - Should escape identifiers that match keywords, even if they aren't reserved keywords
- Queries are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions
Querying Metrics
- Queries, views and UDFs can reference metrics and data sources that have been defined in metric-hub
- To reference metrics use
{{ metrics.calculate() }}:SELECT * FROM {{ metrics.calculate( metrics=['days_of_use', 'active_hours'], platform='firefox_desktop', group_by={'sample_id': 'sample_id', 'channel': 'application.channel'}, where='submission_date = "2023-01-01"' ) }} -- this translates to SELECT * FROM ( WITH clients_daily AS ( SELECT client_id AS client_id, submission_date AS submission_date, COALESCE(SUM(active_hours_sum), 0) AS active_hours, COUNT(submission_date) AS days_of_use, FROM mozdata.telemetry.clients_daily GROUP BY client_id, submission_date ) SELECT clients_daily.client_id, clients_daily.submission_date, active_hours, days_of_use, FROM clients_daily )metrics: unique reference(s) to metric definition, all metric definitions are aggregations (e.g. SUM, AVG, ...)platform: platform to compute metrics for (e.g.firefox_desktop,firefox_ios,fenix, ...)group_by: fields used in the GROUP BY statement; this is a dictionary where the key represents the alias, the value is the field path;GROUP BYalways includes the configuredclient_idandsubmission_datefieldswhere: SQL filter clausegroup_by_client_id: Whether the field configured asclient_id(defined as part of the data source specification in metric-hub) should be part of theGROUP BY.Trueby defaultgroup_by_submission_date: Whether the field configured assubmission_date(defined as part of the data source specification in metric-hub) should be part of theGROUP BY.Trueby default
- To reference data source definitions use
{{ metrics.data_source() }}:SELECT * FROM {{ metrics.data_source( data_source='main', platform='firefox_desktop', where='submission_date = "2023-01-01"' ) }} -- this translates to SELECT * FROM ( SELECT * FROM `mozdata.telemetry.main` WHERE submission_date = "2023-01-01" )
- To reference metrics use
- To render queries that use Jinja expressions or statements use
./bqetl query render path/to/query.sql - The
generated-sqlbranch has rendered queries/views/UDFs ./bqetl query rundoes support running Jinja queries
Query Metadata
- For each query, a
metadata.yamlfile should be created in the same directory - This file contains a description, owners and labels. As an example:
friendly_name: SSL Ratios
description: >
Percentages of page loads Firefox users have performed that were
conducted over SSL broken down by country.
owners:
- example@mozilla.com
labels:
application: firefox
incremental: true # incremental queries add data to existing tables
schedule: daily # scheduled in Airflow to run daily
public_json: true
public_bigquery: true
review_bugs:
- 1414839 # Bugzilla bug ID of data review
incremental_export: false # non-incremental JSON export writes all data to a single location
- only labels where value types are eithers integers or strings are published, all other values types are being skipped
Dynamic Schemas
For tables whose schemas may evolve over time (e.g., tables that use --schema_update_option=ALLOW_FIELD_ADDITION), you can indicate this in the metadata to ensure schema updates are not skipped:
schema:
allow_field_addition: true
This setting ensures that when running ./bqetl query schema update --skip-existing, the schema for this query will still be updated even if a schema.yaml file already exists. This is particularly useful for:
- Tables that receive new fields dynamically
- Tables using BigQuery's schema auto-detection
- Tables that use
ALLOW_FIELD_ADDITIONschema update option
Without this flag, --skip-existing will skip schema updates for queries that already have a schema.yaml file.
Views
- Should be defined in files named as
sql/<project>/<dataset>/<table>/view.sqle.g.sql/moz-fx-data-shared-prod/telemetry/core/view.sql- Views should generally not be named with a version suffix; a view represents a
stable interface for users and whenever possible should maintain compatibility
with existing queries; if the view logic cannot be adapted to changes in underlying
tables, breaking changes must be communicated to
fx-data-dev@mozilla.org
- Views should generally not be named with a version suffix; a view represents a
stable interface for users and whenever possible should maintain compatibility
with existing queries; if the view logic cannot be adapted to changes in underlying
tables, breaking changes must be communicated to
- Must specify project and dataset in all table names
- Should default to using the
moz-fx-data-shared-prodproject; thescripts/publish_viewstooling can handle parsing the definitions to publish to other projects such asderived-datasets
- Should default to using the
- Should not refer to views in the
mozdataproject which are duplicates of views in another project (commonlymoz-fx-data-shared-prod). Refer to the original view instead. - Views are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions
UDFs
- Should limit the number of expression subqueries to avoid:
BigQuery error in query operation: Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex. - Should be used to avoid code duplication
- Must be named in files with lower snake case names ending in
.sqle.g.mode_last.sql- Each file must only define effectively private helper functions and one
public function which must be defined last
- Helper functions must not conflict with function names in other files
- SQL UDFs must be defined in the
udf/directory and JS UDFs must be defined in theudf_jsdirectory- The
udf_legacy/directory is an exception which must only contain compatibility functions for queries migrated from Athena/Presto.
- The
- Each file must only define effectively private helper functions and one
public function which must be defined last
- Functions must be defined as persistent UDFs
using
CREATE OR REPLACE FUNCTIONsyntax- Function names must be prefixed with a dataset of
<dir_name>.so, for example, all functions inudf/*.sqlare part of theudfdataset- The final syntax for creating a function in a file will look like
CREATE OR REPLACE FUNCTION <dir_name>.<file_name>
- The final syntax for creating a function in a file will look like
- We provide tooling in
scripts/publish_persistent_udfsfor publishing these UDFs to BigQuery- Changes made to UDFs need to be published manually in order for the dry run CI task to pass
- Function names must be prefixed with a dataset of
- In order to run UDFs in
private-bigquery-etlin the BigQuery Console, you will need to not only write the UDF but also a dummy/stub file inbigquery-etl.- Actual UDF in
private-bigquery-etl:CREATE OR REPLACE FUNCTION udf.distribution_model_ga_metrics() RETURNS STRING AS ( (SELECT 'non-distribution' AS distribution_model) ); SELECT mozfun.assert.equals(udf.distribution_model_ga_metrics(), 'non-distribution'); - Stub file in
bigquery-etl:CREATE OR REPLACE FUNCTION udf.distribution_model_ga_metrics() RETURNS STRING AS ( 'helloworld' ); SELECT mozfun.assert.equals(udf.distribution_model_ga_metrics(), 'helloworld');
- Actual UDF in
- Should use
SQLoverjsfor performance - UDFs are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions
Large Backfills
- Should be documented and reviewed by a peer using a new bug that describes the context that required the backfill and the command or script used.
- Frequent backfills should be avoided
- Backfills may increase storage costs for a table for 90 days by moving data from long-term storage to short-term storage and requiring a production snapshot.
- Should combine multiple backfills happening around the same time
- Should delay column deletes until the next other backfill
- Should use
NULLfor new data andEXCEPTto exclude from views until dropped
- Should use
- After running the backfill, it is important to validate that the job(s) ran without errors
and the execution times and bytes processed are as expected.
Here is a query you may use for this purpose:
SELECT job_type, state, submission_date, destination_dataset_id, destination_table_id, total_terabytes_billed, total_slot_ms, error_location, error_reason, error_message FROM moz-fx-data-shared-prod.monitoring.bigquery_usage WHERE submission_date <= CURRENT_DATE() AND destination_dataset_id LIKE "%backfills_staging_derived%" AND destination_table_id LIKE "%{{ your table name }}%" ORDER BY submission_date DESC