Common bigquery-etl workflows

This is a quick guide of how to perform common workflows in bigquery-etl using the bqetl CLI.

For any workflow, the bigquery-etl repositiory needs to be locally available, for example by cloning the repository, and the bqetl CLI needs to be installed by running ./bqetl bootstrap.

Adding a new scheduled query

The Creating derived datasets tutorial provides a more detailed guide on creating scheduled queries.

Run ./bqetl query create <dataset>.<table>_<version>
1. Specify the desired destination dataset and table name for <dataset>.<table>_<version>
2. Directories and files are generated automatically
Open query.sql file that has been created in sql/moz-fx-data-shared-prod/<dataset>/<table>_<version>/ to write the query
[Optional] Run ./bqetl query schema update <dataset>.<table>_<version> to generate the schema.yaml file
- Optionally add column descriptions to schema.yaml
Open the metadata.yaml file in sql/moz-fx-data-shared-prod/<dataset>/<table>_<version>/
- Add a description of the query
- Add BigQuery information such as table partitioning or clustering
  - See clients_daily_v6 for reference
Run ./bqetl query validate <dataset>.<table>_<version> to dry run and format the query
To schedule the query, first select a DAG from the ./bqetl dag info list or create a new DAG ./bqetl dag create <bqetl_new_dag>
Run ./bqetl query schedule <dataset>.<table>_<version> --dag <bqetl_dag> to schedule the query
Create a pull request
PR gets reviewed and eventually approved
Merge pull-request
Table deploys happen on a nightly cadence through the bqetl_artifact_deployment Airflow DAG
- Clear the most recent DAG run once a new version of bigquery-etl has been deployed to create new datasets earlier
Backfill data
- Option 1: via Airflow interface
- Option 2: ./bqetl query backfill --project-id <project id> <dataset>.<table>_<version>

Update an existing query

Open the query.sql file of the query to be updated and make changes
Run ./bqetl query validate <dataset>.<table>_<version> to dry run and format the query
If the query scheduling metadata has changed, run ./bqetl dag generate <bqetl_dag> to update the DAG file
If the query adds new columns, run ./bqetl query schema update <dataset>.<table>_<version> to make local schema.yaml updates
Open PR with changes
PR reviewed and approved
Merge pull-request
Table deploys (including schema changes) happen on a nightly cadence through the bqetl_artifact_deployment Airflow DAG
- Clear the most recent DAG run once a new version of bigquery-etl has been deployed to apply changes earlier

Formatting SQL

We enforce consistent SQL formatting as part of CI. After adding or changing a query, use ./bqetl format to apply formatting rules.

Directories and files passed as arguments to ./bqetl format will be formatted in place, with directories recursively searched for files with a .sql extension, e.g.:

$ echo 'SELECT 1,2,3' > test.sql
$ ./bqetl format test.sql
modified test.sql
1 file(s) modified
$ cat test.sql
SELECT
  1,
  2,
  3

If no arguments are specified the script will read from stdin and write to stdout, e.g.:

$ echo 'SELECT 1,2,3' | ./bqetl format
SELECT
  1,
  2,
  3

To turn off sql formatting for a block of SQL, wrap it in format:off and format:on comments, like this:

SELECT
  -- format:off
  submission_date, sample_id, client_id
  -- format:on

Add a new field to a table schema

Adding a new field to a table schema also means that the field has to propagate to several downstream tables, which makes it a more complex case.

Open the query.sql file inside the <dataset>.<table> location and add the new definitions for the field.
Run ./bqetl format <path to the query> to format the query. Alternatively, run ./bqetl format $(git ls-tree -d HEAD --name-only) validate the format of all queries that have been modified.
Run ./bqetl query validate <dataset>.<table> to dry run the query.
- For data scientists (and anyone without jobs.create permissions in moz-fx-data-shared-prod), run:
  - (a) gcloud auth login --update-adc # to authenticate to GCP
  - (b) gcloud config set project mozdata # to set the project
  - (c) ./bqetl query validate --use-cloud-function=false --project-id=mozdata <full path to the query file>
Run ./bqetl query schema update <dataset>.<table> --update_downstream to make local schema.yaml updates and update schemas of downstream dependencies.
- This requires GCP access.
- --update_downstream is optional as it takes longer. It is recommended when you know that there are downstream dependencies whose schema.yaml need to be updated, in which case, the update will happen automatically.
- --force should only be used in very specific cases, particularly the clients_last_seen tables. It skips some checks that would otherwise catch some error scenarios.
Open a new PR with these changes.
PR reviewed and approved.
Find and run again the CI pipeline for the PR.
- Make sure all dry runs are successful.
Merge pull-request.
Table deploys happen on a nightly cadence through the bqetl_artifact_deployment Airflow DAG
- Clear the most recent DAG run once a new version of bigquery-etl has been deployed to apply changes earlier

The following is an example to update a new field in telemetry_derived.clients_daily_v6

Example: Add a new field to clients_daily

Open the clients_daily_v6 query.sql file and add new field definitions.
Run ./bqetl format sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/query.sql
Run ./bqetl query validate telemetry_derived.clients_daily_v6.
Authenticate to GCP: gcloud auth login --update-adc
Run ./bqetl query schema update telemetry_derived.clients_daily_v6 --update_downstream --ignore-dryrun-skip --use-cloud-function=false.
- schema.yaml files of downstream dependencies, like clients_last_seen_v1 are updated.
- If the schema has no changes, we do not run schema updates on any of its downstream dependencies.
- --use-cloud-function=false is necessary when updating tables related to clients_daily but optional for other tables. The dry run cloud function times out when fetching the deployed table schema for some of clients_dailys downstream dependencies. Using GCP credentials instead works, however this means users need to have permissions to run queries in moz-fx-data-shared-prod.
Open a PR with these changes.
PR is reviewed and approved.
Merge pull-request.
Table deploys happen on a nightly cadence through the bqetl_artifact_deployment Airflow DAG
- Clear the most recent DAG run once a new version of bigquery-etl has been deployed to apply changes earlier

Remove a field from a table schema

Deleting a field from an existing table schema should be done only when is totally neccessary. If you decide to delete it: 1. Validate if there is data in the column and make sure data it is either backed up or it can be reprocessed. 1. Follow Big Query docs recommendations for deleting. 1. If the column size exceeds the allowed limit, consider setting the field as NULL. See this search_clients_daily_v8 PR for an example.

Adding a new mozfun UDF

Run ./bqetl mozfun create <dataset>.<name> --udf.
Navigate to the udf.sql file in sql/mozfun/<dataset>/<name>/ and add UDF the definition and tests.
Run ./bqetl mozfun validate <dataset>.<name> for formatting and running tests.
- Before running the tests, you need to setup the access to the Google Cloud API.
Open a PR.
PR gets reviewed, approved and merged.
To publish UDF immediately:
- Go to Airflow mozfun DAG and clear latest run.
- Or else it will get published within a day when mozfun is executed next.

Adding a new internal UDF

Internal UDFs are usually only used by specific queries. If your UDF might be useful to others consider publishing it as a mozfun UDF.

Run ./bqetl routine create <dataset>.<name> --udf
Navigate to the udf.sql in sql/moz-fx-data-shared-prod/<dataset>/<name>/ file and add UDF definition and tests
Run ./bqetl routine validate <dataset>.<name> for formatting and running tests
- Before running the tests, you need to setup the access to the Google Cloud API.
Open a PR
PR gets reviewed and approved and merged
UDF deploys happen on a nightly cadence through the bqetl_artifact_deployment Airflow DAG
- Clear the most recent DAG run once a new version of bigquery-etl has been deployed to apply changes earlier

Adding a stored procedure

The same steps as creating a new UDF apply for creating stored procedures, except when initially creating the procedure execute ./bqetl mozfun create <dataset>.<name> --stored_procedure or ./bqetl routine create <dataset>.<name> --stored_procedure for internal stored procedures.

Updating an existing UDF

Navigate to the udf.sql file and make updates
Run ./bqetl mozfun validate <dataset>.<name> or ./bqetl routine validate <dataset>.<name> for formatting and running tests
Open a PR
PR gets reviewed, approved and merged

Renaming an existing UDF

Run ./bqetl mozfun rename <dataset>.<name> <new_dataset>.<new_name>
- References in queries to the UDF are automatically updated
Open a PR
PR gets reviews, approved and merged

Using a private internal UDF

Follow the steps for Adding a new internal UDF above to create a stub of the private UDF. Note this should not contain actual private UDF code or logic. The directory name and function parameters should match the private UDF.
Do Not publish the stub UDF. This could result in incorrect results for other users of the private UDF.
Open a PR
PR gets reviewed, approved and merged

Creating a new BigQuery Dataset

To provision a new BigQuery dataset for holding tables, you'll need to create a dataset_metadata.yaml which will cause the dataset to be automatically deployed after merging. Changes to existing datasets may trigger manual operator approval (such as changing access policies). For more on access controls, see Data Access Workgroups in Mana.

The bqetl query create command will automatically generate a skeleton dataset_metadata.yaml file if the query name contains a dataset that is not yet defined.

See example with commentary for telemetry_derived:

friendly_name: Telemetry Derived
description: |-
  Derived data based on pings from legacy Firefox telemetry, plus many other
  general-purpose derived tables
labels: {}

# Base ACL should can be:
#   "derived" for `_derived` datasets that contain concrete tables
#   "view" for user-facing datasets containing virtual views
dataset_base_acl: derived

# Datasets with user-facing set to true will be created both in shared-prod
# and in mozdata; this should be false for all `_derived` datasets
user_facing: false

# Most datasets can have mozilla-confidential access like below, but some
# datasets will be defined with more restricted access or with additional
# access for services; see "Data Access Workgroups" link above.
workgroup_access:
- role: roles/bigquery.dataViewer
  members:
  - workgroup:mozilla-confidential

Publishing data

Adding new Python requirements

When adding a new library to the Python requirements, first add the library to the requirements and then add any meta-dependencies into constraints. Constraints are discovered by installing requirements into a fresh virtual environment. A dependency should be added to either requirements.txt or constraints.txt, but not both.

# Create a python virtual environment (not necessary if you have already
# run `./bqetl bootstrap`)
python3 -m venv venv/

# Activate the virtual environment
source venv/bin/activate

# If not installed:
pip install pip-tools --constraint requirements.in

# Add the dependency to requirements.in e.g. Jinja2.
echo Jinja2==2.11.1 >> requirements.in

# Compile hashes for new dependencies.
pip-compile --generate-hashes requirements.in

# Deactivate the python virtual environment.
deactivate

Making a pull request from a fork

When opening a pull-request to merge a fork, the manual-trigger-required-for-fork CI task will fail and some integration test tasks will be skipped. A user with repository write permissions will have to run the Push to upstream workflow and provide the <username>:<branch> of the fork as parameter. The parameter will also show up in the logs of the manual-trigger-required-for-fork CI task together with more detailed instructions. Once the workflow has been executed, the CI tasks, including the integration tests, of the PR will be executed.

Building the Documentation

The repository documentation is built using MkDocs. To generate and check the docs locally:

Run ./bqetl docs generate --output_dir generated_docs
Navigate to the generated_docs directory
Run mkdocs serve to start a local mkdocs server.

Setting up change control to code files

Each code files in the bigquery-etl repository can have a set of owners who are responsible to review and approve changes, and are automatically assigned as PR reviewers. The query files in the repo also benefit from the metadata labels to be able to validate and identify the data that is change controlled.

Here is a sample PR with the implementation of change control for contextual services data.

Select or create a Github team or identity and add the GitHub emails of the query codeowners. A GitHub identity is particularly useful when you need to include non @mozilla emails or to randomly assign PR reviewers from the team members. This team requires edit permissions to bigquery-etl, to achieve this, inherit the team from one that has the required permissions e.g. mozilla > telemetry.
Open the metadata.yaml for the query where you want to apply change control:
- In the section owners, add the selected GitHub identity, along with the list of owners' emails.
- In the section labels, add change_controlled: true. This enables identifying change controlled data in the BigQuery console and in the Data Catalog.
Setup the CODEOWNERS:
- Open the CODEOWNERS file located in the root of the repo.
- Add a new row with the path and owners for the query. You can place it in the corresponding section or create a new section in the file, e.g. /sql_generators/active_users/templates/ @mozilla/kpi_table_reviewers.
The queries labeled change_controlled are automatically validated in the CI. To run the validation locally:
- Run the command script/bqetl query validate <query_path>.
- If the query is generated using the /sql-generators, first run ./script/bqetl generate <path> and then run script/bqetl query validate <query_path>.