Recommended practices
Queries
- Should be defined in files named as
sql/<project>/<dataset>/<table>_<version>/query.sql e.g.
<project> defines both where the destination table resides and in which project the query job runs
sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v7/query.sql
- Queries that populate tables should always be named with a version suffix;
we assume that future optimizations to the data representation may require
schema-incompatible changes such as dropping columns
- May be generated using a python script that prints the query to stdout
- Should save output as
sql/<project>/<dataset>/<table>_<version>/query.sql as above
- Should be named as
sql/<project>/query_type.sql.py e.g. sql/moz-fx-data-shared-prod/clients_daily.sql.py
- May use options to generate queries for different destination tables e.g.
using
--source telemetry_core_parquet_v3 to generate
sql/moz-fx-data-shared-prod/telemetry/core_clients_daily_v1/query.sql and using --source main_summary_v4 to
generate sql/moz-fx-data-shared-prod/telemetry/clients_daily_v7/query.sql
- Should output a header indicating options used e.g.
-- Query generated by: sql/moz-fx-data-shared-prod/clients_daily.sql.py --source telemetry_core_parquet
- For tables in
moz-fx-data-shared-prod the project prefix should be omitted to simplify testing. (Other projects do need the project prefix)
- Should be incremental
- Should filter input tables on partition and clustering columns
- Should use
_ prefix in generated column names not meant for output
- Should use
_bits suffix for any integer column that represents a bit pattern
- Should not use
DATETIME type, due to incompatibility with
spark-bigquery-connector
- Should read from
*_stable tables instead of including custom deduplication
- Should use the earliest row for each
document_id by submission_timestamp
where filtering duplicates is necessary
- Should not refer to views in the
mozdata project which are duplicates of views in another project
(commonly moz-fx-data-shared-prod). Refer to the original view instead.
- Should escape identifiers that match keywords, even if they aren't reserved keywords
- Queries are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions
Querying Metrics
- Queries, views and UDFs can reference metrics and data sources that have been defined in metric-hub
- To reference metrics use
{{ metrics.calculate() }}:
SELECT
*
FROM
{{ metrics.calculate(
metrics=['days_of_use', 'active_hours'],
platform='firefox_desktop',
group_by={'sample_id': 'sample_id', 'channel': 'application.channel'},
where='submission_date = "2023-01-01"'
) }}
-- this translates to
SELECT
*
FROM
(
WITH clients_daily AS (
SELECT
client_id AS client_id,
submission_date AS submission_date,
COALESCE(SUM(active_hours_sum), 0) AS active_hours,
COUNT(submission_date) AS days_of_use,
FROM
mozdata.telemetry.clients_daily
GROUP BY
client_id,
submission_date
)
SELECT
clients_daily.client_id,
clients_daily.submission_date,
active_hours,
days_of_use,
FROM
clients_daily
)
metrics: unique reference(s) to metric definition, all metric definitions are aggregations (e.g. SUM, AVG, ...)
platform: platform to compute metrics for (e.g. firefox_desktop, firefox_ios, fenix, ...)
group_by: fields used in the GROUP BY statement; this is a dictionary where the key represents the alias, the value is the field path; GROUP BY always includes the configured client_id and submission_date fields
where: SQL filter clause
group_by_client_id: Whether the field configured as client_id (defined as part of the data source specification in metric-hub) should be part of the GROUP BY. True by default
group_by_submission_date: Whether the field configured as submission_date (defined as part of the data source specification in metric-hub) should be part of the GROUP BY. True by default
- To reference data source definitions use
{{ metrics.data_source() }}:
SELECT
*
FROM
{{ metrics.data_source(
data_source='main',
platform='firefox_desktop',
where='submission_date = "2023-01-01"'
) }}
-- this translates to
SELECT
*
FROM
(
SELECT *
FROM `mozdata.telemetry.main`
WHERE submission_date = "2023-01-01"
)
- To render queries that use Jinja expressions or statements use
./bqetl query render path/to/query.sql
- The
generated-sql branch has rendered queries/views/UDFs
./bqetl query run does support running Jinja queries
- For each query, a
metadata.yaml file should be created in the same directory
- This file contains a description, owners and labels. As an example:
friendly_name: SSL Ratios
description: >
Percentages of page loads Firefox users have performed that were
conducted over SSL broken down by country.
owners:
- example@mozilla.com
labels:
application: firefox
incremental: true # incremental queries add data to existing tables
schedule: daily # scheduled in Airflow to run daily
public_json: true
public_bigquery: true
review_bugs:
- 1414839 # Bugzilla bug ID of data review
incremental_export: false # non-incremental JSON export writes all data to a single location
- only labels where value types are eithers integers or strings are published, all other values types are being skipped
Views
- Should be defined in files named as
sql/<project>/<dataset>/<table>/view.sql e.g.
sql/moz-fx-data-shared-prod/telemetry/core/view.sql
- Views should generally not be named with a version suffix; a view represents a
stable interface for users and whenever possible should maintain compatibility
with existing queries; if the view logic cannot be adapted to changes in underlying
tables, breaking changes must be communicated to
fx-data-dev@mozilla.org
- Must specify project and dataset in all table names
- Should default to using the
moz-fx-data-shared-prod project;
the scripts/publish_views tooling can handle parsing the definitions to publish
to other projects such as derived-datasets
- Should not refer to views in the
mozdata project which are duplicates of views in another project
(commonly moz-fx-data-shared-prod). Refer to the original view instead.
- Views are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions
UDFs
- Should limit the number of expression subqueries to avoid:
BigQuery error in query operation: Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.
- Should be used to avoid code duplication
- Must be named in files with lower snake case names ending in
.sql
e.g. mode_last.sql
- Each file must only define effectively private helper functions and one
public function which must be defined last
- Helper functions must not conflict with function names in other files
- SQL UDFs must be defined in the
udf/ directory and JS UDFs must be defined
in the udf_js directory
- The
udf_legacy/ directory is an exception which must only contain
compatibility functions for queries migrated from Athena/Presto.
- Functions must be defined as persistent UDFs
using
CREATE OR REPLACE FUNCTION syntax
- Function names must be prefixed with a dataset of
<dir_name>. so, for example,
all functions in udf/*.sql are part of the udf dataset
- The final syntax for creating a function in a file will look like
CREATE OR REPLACE FUNCTION <dir_name>.<file_name>
- We provide tooling in
scripts/publish_persistent_udfs for
publishing these UDFs to BigQuery
- Changes made to UDFs need to be published manually in order for the
dry run CI task to pass
- In order to run UDFs in
private-bigquery-etl in the BigQuery Console, you will
need to not only write the UDF but also a dummy/stub file in bigquery-etl.
- Actual UDF in
private-bigquery-etl:
CREATE OR REPLACE FUNCTION udf.distribution_model_ga_metrics()
RETURNS STRING AS (
(SELECT 'non-distribution' AS distribution_model)
);
SELECT
mozfun.assert.equals(udf.distribution_model_ga_metrics(), 'non-distribution');
- Stub file in
bigquery-etl:
CREATE OR REPLACE FUNCTION udf.distribution_model_ga_metrics()
RETURNS STRING AS (
'helloworld'
);
SELECT
mozfun.assert.equals(udf.distribution_model_ga_metrics(), 'helloworld');
- Should use
SQL over js for performance
- UDFs are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions
Large Backfills
- Should be documented and reviewed by a peer using a
new bug that describes
the context that required the backfill and the command or script used.
- Frequent backfills should be avoided
- Backfills may increase storage costs for a table for 90 days by moving
data from long-term storage to short-term storage and requiring a production snapshot.
- Should combine multiple backfills happening around the same time
- Should delay column deletes until the next other backfill
- Should use
NULL for new data and EXCEPT to exclude from views until
dropped
- After running the backfill, it is important to validate that the job(s) ran without errors
and the execution times and bytes processed are as expected.
Here is a query you may use for this purpose:
SELECT
job_type,
state,
submission_date,
destination_dataset_id,
destination_table_id,
total_terabytes_billed,
total_slot_ms,
error_location,
error_reason,
error_message
FROM
moz-fx-data-shared-prod.monitoring.bigquery_usage
WHERE
submission_date <= CURRENT_DATE()
AND destination_dataset_id LIKE "%backfills_staging_derived%"
AND destination_table_id LIKE "%{{ your table name }}%"
ORDER BY
submission_date DESC