Recommended practices
Queries
- Should be defined in files named as
sql/<project>/<dataset>/<table>_<version>/query.sql
e.g.
<project>
defines both where the destination table resides and in which project the query job runs
sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v7/query.sql
- Queries that populate tables should always be named with a version suffix;
we assume that future optimizations to the data representation may require
schema-incompatible changes such as dropping columns
- May be generated using a python script that prints the query to stdout
- Should save output as
sql/<project>/<dataset>/<table>_<version>/query.sql
as above
- Should be named as
sql/<project>/query_type.sql.py
e.g. sql/moz-fx-data-shared-prod/clients_daily.sql.py
- May use options to generate queries for different destination tables e.g.
using
--source telemetry_core_parquet_v3
to generate
sql/moz-fx-data-shared-prod/telemetry/core_clients_daily_v1/query.sql
and using --source main_summary_v4
to
generate sql/moz-fx-data-shared-prod/telemetry/clients_daily_v7/query.sql
- Should output a header indicating options used e.g.
-- Query generated by: sql/moz-fx-data-shared-prod/clients_daily.sql.py --source telemetry_core_parquet
- For tables in
moz-fx-data-shared-prod
the project prefix should be omitted to simplify testing. (Other projects do need the project prefix)
- Should be incremental
- Should filter input tables on partition and clustering columns
- Should use
_
prefix in generated column names not meant for output
- Should use
_bits
suffix for any integer column that represents a bit pattern
- Should not use
DATETIME
type, due to incompatibility with
spark-bigquery-connector
- Should read from
*_stable
tables instead of including custom deduplication
- Should use the earliest row for each
document_id
by submission_timestamp
where filtering duplicates is necessary
- Should not refer to views in the
mozdata
project which are duplicates of views in another project
(commonly moz-fx-data-shared-prod
). Refer to the original view instead.
- Should escape identifiers that match keywords, even if they aren't reserved keywords
- Queries are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions
Querying Metrics
- Queries, views and UDFs can reference metrics and data sources that have been defined in metric-hub
- To reference metrics use
{{ metrics.calculate() }}
:
SELECT
*
FROM
{{ metrics.calculate(
metrics=['days_of_use', 'active_hours'],
platform='firefox_desktop',
group_by={'sample_id': 'sample_id', 'channel': 'application.channel'},
where='submission_date = "2023-01-01"'
) }}
-- this translates to
SELECT
*
FROM
(
WITH clients_daily AS (
SELECT
client_id AS client_id,
submission_date AS submission_date,
COALESCE(SUM(active_hours_sum), 0) AS active_hours,
COUNT(submission_date) AS days_of_use,
FROM
mozdata.telemetry.clients_daily
GROUP BY
client_id,
submission_date
)
SELECT
clients_daily.client_id,
clients_daily.submission_date,
active_hours,
days_of_use,
FROM
clients_daily
)
metrics
: unique reference(s) to metric definition, all metric definitions are aggregations (e.g. SUM, AVG, ...)
platform
: platform to compute metrics for (e.g. firefox_desktop
, firefox_ios
, fenix
, ...)
group_by
: fields used in the GROUP BY statement; this is a dictionary where the key represents the alias, the value is the field path; GROUP BY
always includes the configured client_id
and submission_date
fields
where
: SQL filter clause
group_by_client_id
: Whether the field configured as client_id
(defined as part of the data source specification in metric-hub) should be part of the GROUP BY
. True
by default
group_by_submission_date
: Whether the field configured as submission_date
(defined as part of the data source specification in metric-hub) should be part of the GROUP BY
. True
by default
- To reference data source definitions use
{{ metrics.data_source() }}
:
SELECT
*
FROM
{{ metrics.data_source(
data_source='main',
platform='firefox_desktop',
where='submission_date = "2023-01-01"'
) }}
-- this translates to
SELECT
*
FROM
(
SELECT *
FROM `mozdata.telemetry.main`
WHERE submission_date = "2023-01-01"
)
- To render queries that use Jinja expressions or statements use
./bqetl query render path/to/query.sql
- The
generated-sql
branch has rendered queries/views/UDFs
./bqetl query run
does support running Jinja queries
- For each query, a
metadata.yaml
file should be created in the same directory
- This file contains a description, owners and labels. As an example:
friendly_name: SSL Ratios
description: >
Percentages of page loads Firefox users have performed that were
conducted over SSL broken down by country.
owners:
- example@mozilla.com
labels:
application: firefox
incremental: true # incremental queries add data to existing tables
schedule: daily # scheduled in Airflow to run daily
public_json: true
public_bigquery: true
review_bugs:
- 1414839 # Bugzilla bug ID of data review
incremental_export: false # non-incremental JSON export writes all data to a single location
- only labels where value types are eithers integers or strings are published, all other values types are being skipped
Views
- Should be defined in files named as
sql/<project>/<dataset>/<table>/view.sql
e.g.
sql/moz-fx-data-shared-prod/telemetry/core/view.sql
- Views should generally not be named with a version suffix; a view represents a
stable interface for users and whenever possible should maintain compatibility
with existing queries; if the view logic cannot be adapted to changes in underlying
tables, breaking changes must be communicated to
fx-data-dev@mozilla.org
- Must specify project and dataset in all table names
- Should default to using the
moz-fx-data-shared-prod
project;
the scripts/publish_views
tooling can handle parsing the definitions to publish
to other projects such as derived-datasets
- Should not refer to views in the
mozdata
project which are duplicates of views in another project
(commonly moz-fx-data-shared-prod
). Refer to the original view instead.
- Views are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions
UDFs
- Should limit the number of expression subqueries to avoid:
BigQuery error in query operation: Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.
- Should be used to avoid code duplication
- Must be named in files with lower snake case names ending in
.sql
e.g. mode_last.sql
- Each file must only define effectively private helper functions and one
public function which must be defined last
- Helper functions must not conflict with function names in other files
- SQL UDFs must be defined in the
udf/
directory and JS UDFs must be defined
in the udf_js
directory
- The
udf_legacy/
directory is an exception which must only contain
compatibility functions for queries migrated from Athena/Presto.
- Functions must be defined as persistent UDFs
using
CREATE OR REPLACE FUNCTION
syntax
- Function names must be prefixed with a dataset of
<dir_name>.
so, for example,
all functions in udf/*.sql
are part of the udf
dataset
- The final syntax for creating a function in a file will look like
CREATE OR REPLACE FUNCTION <dir_name>.<file_name>
- We provide tooling in
scripts/publish_persistent_udfs
for
publishing these UDFs to BigQuery
- Changes made to UDFs need to be published manually in order for the
dry run CI task to pass
- Should use
SQL
over js
for performance
- UDFs are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions
Large Backfills
- Should be documented and reviewed by a peer using a
new bug that describes
the context that required the backfill and the command or script used.
- Frequent backfills should be avoided
- Backfills may increase storage costs for a table for 90 days by moving
data from long-term storage to short-term storage and requiring a production snapshot.
- Should combine multiple backfills happening around the same time
- Should delay column deletes until the next other backfill
- Should use
NULL
for new data and EXCEPT
to exclude from views until
dropped
- After running the backfill, it is important to validate that the job(s) ran without errors
and the execution times and bytes processed are as expected.
Here is a query you may use for this purpose:
SELECT
job_type,
state,
submission_date,
destination_dataset_id,
destination_table_id,
total_terabytes_billed,
total_slot_ms,
error_location,
error_reason,
error_message
FROM
moz-fx-data-shared-prod.monitoring.bigquery_usage
WHERE
submission_date <= CURRENT_DATE()
AND destination_dataset_id LIKE "%backfills_staging_derived%"
AND destination_table_id LIKE "%{{ your table name }}%"
ORDER BY
submission_date DESC