Recommended practices

Queries

Should be defined in files named as sql/<project>/<dataset>/<table>_<version>/query.sql e.g.
- <project> defines both where the destination table resides and in which project the query job runs sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v7/query.sql
- Queries that populate tables should always be named with a version suffix; we assume that future optimizations to the data representation may require schema-incompatible changes such as dropping columns
May be generated using a python script that prints the query to stdout
- Should save output as sql/<project>/<dataset>/<table>_<version>/query.sql as above
- Should be named as sql/<project>/query_type.sql.py e.g. sql/moz-fx-data-shared-prod/clients_daily.sql.py
- May use options to generate queries for different destination tables e.g. using --source telemetry_core_parquet_v3 to generate sql/moz-fx-data-shared-prod/telemetry/core_clients_daily_v1/query.sql and using --source main_summary_v4 to generate sql/moz-fx-data-shared-prod/telemetry/clients_daily_v7/query.sql
- Should output a header indicating options used e.g.
```
-- Query generated by: sql/moz-fx-data-shared-prod/clients_daily.sql.py --source telemetry_core_parquet
```
For tables in moz-fx-data-shared-prod the project prefix should be omitted to simplify testing. (Other projects do need the project prefix)
Should be incremental
Should filter input tables on partition and clustering columns
Should use _ prefix in generated column names not meant for output
Should use _bits suffix for any integer column that represents a bit pattern
Should not use DATETIME type, due to incompatibility with spark-bigquery-connector
Should read from *_stable tables instead of including custom deduplication
- Should use the earliest row for each document_id by submission_timestamp where filtering duplicates is necessary
Should not refer to views in the mozdata project which are duplicates of views in another project (commonly moz-fx-data-shared-prod). Refer to the original view instead.
Should escape identifiers that match keywords, even if they aren't reserved keywords
Queries are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions

Querying Metrics

Queries, views and UDFs can reference metrics and data sources that have been defined in metric-hub

To reference metrics use {{ metrics.calculate() }}:

SELECT
  *
FROM
  {{ metrics.calculate(
    metrics=['days_of_use', 'active_hours'],
    platform='firefox_desktop',
    group_by={'sample_id': 'sample_id', 'channel': 'application.channel'},
    where='submission_date = "2023-01-01"'
  ) }}

-- this translates to
SELECT
  *
FROM
  (
    WITH clients_daily AS (
      SELECT
        client_id AS client_id,
        submission_date AS submission_date,
        COALESCE(SUM(active_hours_sum), 0) AS active_hours,
        COUNT(submission_date) AS days_of_use,
      FROM
        mozdata.telemetry.clients_daily
      GROUP BY
        client_id,
        submission_date
    )
    SELECT
      clients_daily.client_id,
      clients_daily.submission_date,
      active_hours,
      days_of_use,
    FROM
      clients_daily
  )

metrics: unique reference(s) to metric definition, all metric definitions are aggregations (e.g. SUM, AVG, ...)
platform: platform to compute metrics for (e.g. firefox_desktop, firefox_ios, fenix, ...)
group_by: fields used in the GROUP BY statement; this is a dictionary where the key represents the alias, the value is the field path; GROUP BY always includes the configured client_id and submission_date fields
where: SQL filter clause
group_by_client_id: Whether the field configured as client_id (defined as part of the data source specification in metric-hub) should be part of the GROUP BY. True by default
group_by_submission_date: Whether the field configured as submission_date (defined as part of the data source specification in metric-hub) should be part of the GROUP BY. True by default

To reference data source definitions use {{ metrics.data_source() }}:

SELECT
  *
FROM
  {{ metrics.data_source(
    data_source='main',
    platform='firefox_desktop',
    where='submission_date = "2023-01-01"'
  ) }}

-- this translates to
SELECT
  *
FROM
  (
    SELECT *
    FROM `mozdata.telemetry.main`
    WHERE submission_date = "2023-01-01"
  )

To render queries that use Jinja expressions or statements use ./bqetl query render path/to/query.sql
The generated-sql branch has rendered queries/views/UDFs
./bqetl query run does support running Jinja queries

Query Metadata

For each query, a metadata.yaml file should be created in the same directory
This file contains a description, owners and labels. As an example:

friendly_name: SSL Ratios
description: >
  Percentages of page loads Firefox users have performed that were
  conducted over SSL broken down by country.
owners:
  - example@mozilla.com
labels:
  application: firefox
  incremental: true # incremental queries add data to existing tables
  schedule: daily # scheduled in Airflow to run daily
  public_json: true
  public_bigquery: true
  review_bugs:
    - 1414839 # Bugzilla bug ID of data review
  incremental_export: false # non-incremental JSON export writes all data to a single location

only labels where value types are eithers integers or strings are published, all other values types are being skipped

Views

Should be defined in files named as sql/<project>/<dataset>/<table>/view.sql e.g. sql/moz-fx-data-shared-prod/telemetry/core/view.sql
- Views should generally not be named with a version suffix; a view represents a stable interface for users and whenever possible should maintain compatibility with existing queries; if the view logic cannot be adapted to changes in underlying tables, breaking changes must be communicated to fx-data-dev@mozilla.org
Must specify project and dataset in all table names
- Should default to using the moz-fx-data-shared-prod project; the scripts/publish_views tooling can handle parsing the definitions to publish to other projects such as derived-datasets
Should not refer to views in the mozdata project which are duplicates of views in another project (commonly moz-fx-data-shared-prod). Refer to the original view instead.
Views are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions

UDFs

Should limit the number of expression subqueries to avoid: BigQuery error in query operation: Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.
Should be used to avoid code duplication
Must be named in files with lower snake case names ending in .sql e.g. mode_last.sql
- Each file must only define effectively private helper functions and one public function which must be defined last
  - Helper functions must not conflict with function names in other files
- SQL UDFs must be defined in the udf/ directory and JS UDFs must be defined in the udf_js directory
  - The udf_legacy/ directory is an exception which must only contain compatibility functions for queries migrated from Athena/Presto.
Functions must be defined as persistent UDFs using CREATE OR REPLACE FUNCTION syntax
- Function names must be prefixed with a dataset of <dir_name>. so, for example, all functions in udf/*.sql are part of the udf dataset
  - The final syntax for creating a function in a file will look like CREATE OR REPLACE FUNCTION <dir_name>.<file_name>
- We provide tooling in scripts/publish_persistent_udfs for publishing these UDFs to BigQuery
  - Changes made to UDFs need to be published manually in order for the dry run CI task to pass
Should use SQL over js for performance
UDFs are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions

Large Backfills

Should be documented and reviewed by a peer using a new bug that describes the context that required the backfill and the command or script used.
Frequent backfills should be avoided
- Backfills may increase storage costs for a table for 90 days by moving data from long-term storage to short-term storage and requiring a production snapshot.
- Should combine multiple backfills happening around the same time
- Should delay column deletes until the next other backfill
  - Should use NULL for new data and EXCEPT to exclude from views until dropped

After running the backfill, it is important to validate that the job(s) ran without errors and the execution times and bytes processed are as expected. Here is a query you may use for this purpose:

SELECT
  job_type,
  state,
  submission_date,
  destination_dataset_id,
  destination_table_id,
  total_terabytes_billed,
  total_slot_ms,
  error_location,
  error_reason,
  error_message
FROM
  moz-fx-data-shared-prod.monitoring.bigquery_usage
WHERE
  submission_date <= CURRENT_DATE()
  AND destination_dataset_id LIKE "%backfills_staging_derived%"
  AND destination_table_id LIKE "%{{ your table name }}%"
ORDER BY
  submission_date DESC