Skip to content

Recommended practices

Queries

  • Should be defined in files named as sql/<project>/<dataset>/<table>_<version>/query.sql e.g.
    • <project> defines both where the destination table resides and in which project the query job runs sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v7/query.sql
    • Queries that populate tables should always be named with a version suffix; we assume that future optimizations to the data representation may require schema-incompatible changes such as dropping columns
  • May be generated using a python script that prints the query to stdout
    • Should save output as sql/<project>/<dataset>/<table>_<version>/query.sql as above
    • Should be named as sql/<project>/query_type.sql.py e.g. sql/moz-fx-data-shared-prod/clients_daily.sql.py
    • May use options to generate queries for different destination tables e.g. using --source telemetry_core_parquet_v3 to generate sql/moz-fx-data-shared-prod/telemetry/core_clients_daily_v1/query.sql and using --source main_summary_v4 to generate sql/moz-fx-data-shared-prod/telemetry/clients_daily_v7/query.sql
    • Should output a header indicating options used e.g.
      -- Query generated by: sql/moz-fx-data-shared-prod/clients_daily.sql.py --source telemetry_core_parquet
      
  • For tables in moz-fx-data-shared-prod the project prefix should be omitted to simplify testing. (Other projects do need the project prefix)
  • Should be incremental
  • Should filter input tables on partition and clustering columns
  • Should use _ prefix in generated column names not meant for output
  • Should use _bits suffix for any integer column that represents a bit pattern
  • Should not use DATETIME type, due to incompatibility with spark-bigquery-connector
  • Should read from *_stable tables instead of including custom deduplication
    • Should use the earliest row for each document_id by submission_timestamp where filtering duplicates is necessary
  • Should not refer to views in the mozdata project which are duplicates of views in another project (commonly moz-fx-data-shared-prod). Refer to the original view instead.
  • Should escape identifiers that match keywords, even if they aren't reserved keywords
  • Queries are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions

Querying Metrics

  • Queries, views and UDFs can reference metrics and data sources that have been defined in metric-hub
    • To reference metrics use {{ metrics.calculate() }}:
      SELECT
        *
      FROM
        {{ metrics.calculate(
          metrics=['days_of_use', 'active_hours'],
          platform='firefox_desktop',
          group_by={'sample_id': 'sample_id', 'channel': 'application.channel'},
          where='submission_date = "2023-01-01"'
        ) }}
      
      -- this translates to
      SELECT
        *
      FROM
        (
          WITH clients_daily AS (
            SELECT
              client_id AS client_id,
              submission_date AS submission_date,
              COALESCE(SUM(active_hours_sum), 0) AS active_hours,
              COUNT(submission_date) AS days_of_use,
            FROM
              mozdata.telemetry.clients_daily
            GROUP BY
              client_id,
              submission_date
          )
          SELECT
            clients_daily.client_id,
            clients_daily.submission_date,
            active_hours,
            days_of_use,
          FROM
            clients_daily
        )
      
      • metrics: unique reference(s) to metric definition, all metric definitions are aggregations (e.g. SUM, AVG, ...)
      • platform: platform to compute metrics for (e.g. firefox_desktop, firefox_ios, fenix, ...)
      • group_by: fields used in the GROUP BY statement; this is a dictionary where the key represents the alias, the value is the field path; GROUP BY always includes the configured client_id and submission_date fields
      • where: SQL filter clause
      • group_by_client_id: Whether the field configured as client_id (defined as part of the data source specification in metric-hub) should be part of the GROUP BY. True by default
      • group_by_submission_date: Whether the field configured as submission_date (defined as part of the data source specification in metric-hub) should be part of the GROUP BY. True by default
    • To reference data source definitions use {{ metrics.data_source() }}:
      SELECT
        *
      FROM
        {{ metrics.data_source(
          data_source='main',
          platform='firefox_desktop',
          where='submission_date = "2023-01-01"'
        ) }}
      
      -- this translates to
      SELECT
        *
      FROM
        (
          SELECT *
          FROM `mozdata.telemetry.main`
          WHERE submission_date = "2023-01-01"
        )
      
  • To render queries that use Jinja expressions or statements use ./bqetl query render path/to/query.sql
  • The generated-sql branch has rendered queries/views/UDFs
  • ./bqetl query run does support running Jinja queries

Query Metadata

  • For each query, a metadata.yaml file should be created in the same directory
  • This file contains a description, owners and labels. As an example:
friendly_name: SSL Ratios
description: >
  Percentages of page loads Firefox users have performed that were
  conducted over SSL broken down by country.
owners:
  - example@mozilla.com
labels:
  application: firefox
  incremental: true # incremental queries add data to existing tables
  schedule: daily # scheduled in Airflow to run daily
  public_json: true
  public_bigquery: true
  review_bugs:
    - 1414839 # Bugzilla bug ID of data review
  incremental_export: false # non-incremental JSON export writes all data to a single location
  • only labels where value types are eithers integers or strings are published, all other values types are being skipped

Views

  • Should be defined in files named as sql/<project>/<dataset>/<table>/view.sql e.g. sql/moz-fx-data-shared-prod/telemetry/core/view.sql
    • Views should generally not be named with a version suffix; a view represents a stable interface for users and whenever possible should maintain compatibility with existing queries; if the view logic cannot be adapted to changes in underlying tables, breaking changes must be communicated to fx-data-dev@mozilla.org
  • Must specify project and dataset in all table names
    • Should default to using the moz-fx-data-shared-prod project; the scripts/publish_views tooling can handle parsing the definitions to publish to other projects such as derived-datasets
  • Should not refer to views in the mozdata project which are duplicates of views in another project (commonly moz-fx-data-shared-prod). Refer to the original view instead.
  • Views are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions

UDFs

  • Should limit the number of expression subqueries to avoid: BigQuery error in query operation: Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.
  • Should be used to avoid code duplication
  • Must be named in files with lower snake case names ending in .sql e.g. mode_last.sql
    • Each file must only define effectively private helper functions and one public function which must be defined last
      • Helper functions must not conflict with function names in other files
    • SQL UDFs must be defined in the udf/ directory and JS UDFs must be defined in the udf_js directory
      • The udf_legacy/ directory is an exception which must only contain compatibility functions for queries migrated from Athena/Presto.
  • Functions must be defined as persistent UDFs using CREATE OR REPLACE FUNCTION syntax
    • Function names must be prefixed with a dataset of <dir_name>. so, for example, all functions in udf/*.sql are part of the udf dataset
      • The final syntax for creating a function in a file will look like CREATE OR REPLACE FUNCTION <dir_name>.<file_name>
    • We provide tooling in scripts/publish_persistent_udfs for publishing these UDFs to BigQuery
      • Changes made to UDFs need to be published manually in order for the dry run CI task to pass
  • Should use SQL over js for performance
  • UDFs are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions

Large Backfills

  • Should be documented and reviewed by a peer using a new bug that describes the context that required the backfill and the command or script used.
  • Frequent backfills should be avoided
    • Backfills may increase storage costs for a table for 90 days by moving data from long-term storage to short-term storage and requiring a production snapshot.
    • Should combine multiple backfills happening around the same time
    • Should delay column deletes until the next other backfill
      • Should use NULL for new data and EXCEPT to exclude from views until dropped
  • After running the backfill, it is important to validate that the job(s) ran without errors and the execution times and bytes processed are as expected. Here is a query you may use for this purpose:
    SELECT
      job_type,
      state,
      submission_date,
      destination_dataset_id,
      destination_table_id,
      total_terabytes_billed,
      total_slot_ms,
      error_location,
      error_reason,
      error_message
    FROM
      moz-fx-data-shared-prod.monitoring.bigquery_usage
    WHERE
      submission_date <= CURRENT_DATE()
      AND destination_dataset_id LIKE "%backfills_staging_derived%"
      AND destination_table_id LIKE "%{{ your table name }}%"
    ORDER BY
      submission_date DESC