Skip to content

bqetl CLI

The bqetl command-line tool aims to simplify working with the bigquery-etl repository by supporting common workflows, such as creating, validating and scheduling queries or adding new UDFs.

Running some commands, for example to create or query tables, will require Mozilla GCP access.

Installation

Follow the Quick Start to set up bigquery-etl and the bqetl CLI.

Configuration

bqetl can be configured via the bqetl_project.yaml file. See Configuration to find available configuration options.

Commands

To list all available commands in the bqetl CLI:

$ ./bqetl

Usage: bqetl [OPTIONS] COMMAND [ARGS]...

  CLI tools for working with bigquery-etl.

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  alchemer    Commands for importing alchemer data.
  dag         Commands for managing DAGs.
  dependency  Build and use query dependency graphs.
  dryrun      Dry run SQL.
  format      Format SQL.
  glam        Tools for GLAM ETL.
  mozfun      Commands for managing mozfun routines.
  query       Commands for managing queries.
  routine     Commands for managing routines.
  stripe      Commands for Stripe ETL.
  view        Commands for managing views.
  backfill    Commands for managing backfills.

See help for any command:

$ ./bqetl [command] --help

query

Commands for managing queries.

create

Create a new query with name ., for example: telemetry_derived.active_profiles. Use the --project_id option to change the project the query is added to; default is moz-fx-data-shared-prod. Views are automatically generated in the publicly facing dataset.

Usage

$ ./bqetl query create [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--owner: Owner of the query (email address)
--dag: Name of the DAG the query should be scheduled under.If there is no DAG name specified, the query isscheduled by default in DAG bqetl_default.To skip the automated scheduling use --no_schedule.To see available DAGs run `bqetl dag info`.To create a new DAG run `bqetl dag create`.
--no_schedule: Using this option creates the query without scheduling information. Use `bqetl query schedule` to add it manually if required.

Examples

./bqetl query create telemetry_derived.deviations_v1 \
  --owner=example@mozilla.com


# The query version gets autocompleted to v1. Queries are created in the
# _derived dataset and accompanying views in the public dataset.
./bqetl query create telemetry.deviations --owner=example@mozilla.com

schedule

Schedule an existing query

Usage

$ ./bqetl query schedule [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--dag: Name of the DAG the query should be scheduled under. To see available DAGs run `bqetl dag info`. To create a new DAG run `bqetl dag create`.
--depends_on_past: Only execute query if previous scheduled run succeeded.
--task_name: Custom name for the Airflow task. By default the task name is a combination of the dataset and table name.

Examples

./bqetl query schedule telemetry_derived.deviations_v1 \
  --dag=bqetl_deviations


# Set a specific name for the task
./bqetl query schedule telemetry_derived.deviations_v1 \
  --dag=bqetl_deviations \
  --task-name=deviations

info

Get information about all or specific queries.

Usage

$ ./bqetl query info [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--cost: Include information about query costs
--last_updated: Include timestamps when destination tables were last updated

Examples

# Get info for specific queries
./bqetl query info telemetry_derived.*


# Get cost and last update timestamp information
./bqetl query info telemetry_derived.clients_daily_v6 \
  --cost --last_updated

backfill

Run a backfill for a query. Additional parameters will get passed to bq.

Usage

$ ./bqetl query backfill [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--start_date: First date to be backfilled
--end_date: Last date to be backfilled
--exclude: Dates excluded from backfill. Date format: yyyy-mm-dd
--dry_run: Dry run the backfill
--max_rows: How many rows to return in the result
--parallelism: How many threads to run backfill in parallel
--destination_table: Destination table name results are written to. If not set, determines destination table based on query.
--checks: Whether to run checks during backfill
--scheduling_parameters_override: Pass a list of parameters to override query's existing scheduling parameters. 

Examples

# Backfill for specific date range
# second comment line
./bqetl query backfill telemetry_derived.ssl_ratios_v1 \
  --start_date=2021-03-01 \
  --end_date=2021-03-31


# Dryrun backfill for specific date range and exclude date
./bqetl query backfill telemetry_derived.ssl_ratios_v1 \
  --start_date=2021-03-01 \
  --end_date=2021-03-31 \
  --exclude=2021-03-03 \
  --dry_run

run

Run a query. Additional parameters will get passed to bq.
If a destination_table is set, the query result will be written to BigQuery. Without a destination_table specified, the results are not stored.
If the name is not found within the sql/ folder bqetl assumes it hasn't been generated yet and will start the generating process for all sql_generators/ files. This generation process will take some time and run dryrun calls against BigQuery but this is expected.
Additional parameters (all parameters that are not specified in the Options) must come after the query-name. Otherwise the first parameter that is not an option is interpreted as the query-name and since it can't be found the generation process will start.

Usage

$ ./bqetl query run [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--public_project_id: Project with publicly accessible data
--destination_table: Destination table name results are written to. If not set, the query result will not be written to BigQuery.
--dataset_id: Destination dataset results are written to. If not set, determines destination dataset based on query.

Examples

# Run a query by name
./bqetl query run telemetry_derived.ssl_ratios_v1


# Run a query file
./bqetl query run /path/to/query.sql


# Run a query and save the result to BigQuery
./bqetl query run telemetry_derived.ssl_ratios_v1         --project_id=moz-fx-data-shared-prod         --dataset_id=telemetry_derived         --destination_table=ssl_ratios_v1

run-multipart

Run a multipart query.

Usage

$ ./bqetl query run-multipart [OPTIONS] [query_dir]

Options:

--using: comma separated list of join columns to use when combining results
--parallelism: Maximum number of queries to execute concurrently
--dataset_id: Default dataset, if not specified all tables must be qualified with dataset
--project_id: GCP project ID
--temp_dataset: Dataset where intermediate query results will be temporarily stored, formatted as PROJECT_ID.DATASET_ID
--destination_table: table where combined results will be written
--time_partitioning_field: time partition field on the destination table
--clustering_fields: comma separated list of clustering fields on the destination table
--dry_run: Print bytes that would be processed for each part and don't run queries
--parameters: query parameter(s) to pass when running parts
--priority: Priority for BigQuery query jobs; BATCH priority will significantly slow down queries if reserved slots are not enabled for the billing project; defaults to INTERACTIVE
--schema_update_options: Optional options for updating the schema.

Examples

# Run a multipart query
./bqetl query run_multipart /path/to/query.sql

validate

Validate a query. Checks formatting, scheduling information and dry runs the query.

Usage

$ ./bqetl query validate [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.
--validate_schemas: Require dry run schema to match destination table and file if present.
--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --ignore-dryrun-skip.
--no_dryrun: Skip running dryrun. Default is False.

Examples

./bqetl query validate telemetry_derived.clients_daily_v6


# Validate query not in shared-prod
./bqetl query validate \
  --use_cloud_function=false \
  --project_id=moz-fx-data-marketing-prod \
  ga_derived.blogs_goals_v1

initialize

Run a full backfill on the destination table for the query. Using this command will: - Create the table if it doesn't exist and run a full backfill. - Run a full backfill if the table exists and is empty. - Raise an exception if the table exists and has data, or if the table exists and the schema doesn't match the query. It supports query.sql files that use the is_init() pattern. To run in parallel per sample_id, include a @sample_id parameter in the query.

Usage

$ ./bqetl query initialize [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--dry_run: Dry run the initialization
--parallelism: Number of threads for parallel processing
--skip_existing: Skip initialization for existing artifacts. This ensures that artifacts, like materialized views only get initialized if they don't already exist.
--force: Run the initialization even if the destination table contains data.

Examples

Examples:
   - For init.sql files: ./bqetl query initialize telemetry_derived.ssl_ratios_v1
   - For query.sql files and parallel run: ./bqetl query initialize sql/moz-fx-data-shared-prod/telemetry_derived/clients_first_seen_v2/query.sql

render

Render a query Jinja template.

Usage

$ ./bqetl query render [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--output_dir: Output directory generated SQL is written to. If not specified, rendered queries are printed to console.
--parallelism: Number of threads for parallel processing

Examples

./bqetl query render telemetry_derived.ssl_ratios_v1 \
  --output-dir=/tmp

schema

Commands for managing query schemas.

update

Update the query schema based on the destination table schema and the query schema. If no schema.yaml file exists for a query, one will be created.

Usage

$ ./bqetl query schema update [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--update_downstream: Update downstream dependencies. GCP authentication required.
--tmp_dataset: GCP datasets for creating updated tables temporarily.
--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.
--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.
--parallelism: Number of threads for parallel processing
--is_init: Indicates whether the `is_init()` condition should be set to true of false.

Examples

./bqetl query schema update telemetry_derived.clients_daily_v6

# Update schema including downstream dependencies (requires GCP)
./bqetl query schema update telemetry_derived.clients_daily_v6 --update-downstream
deploy

Deploy the query schema.

Usage

$ ./bqetl query schema deploy [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--force: Deploy the schema file without validating that it matches the query
--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.
--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.
--skip_existing: Skip updating existing tables. This option ensures that only new tables get deployed.
--skip_external_data: Skip publishing external data, such as Google Sheets.
--destination_table: Destination table name results are written to. If not set, determines destination table based on query.  Must be fully qualified (project.dataset.table).
--parallelism: Number of threads for parallel processing

Examples

./bqetl query schema deploy telemetry_derived.clients_daily_v6
validate

Validate the query schema

Usage

$ ./bqetl query schema validate [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.
--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.

Examples

./bqetl query schema validate telemetry_derived.clients_daily_v6

dag

Commands for managing DAGs.

info

Get information about available DAGs.

Usage

$ ./bqetl dag info [OPTIONS] [name]

Options:

--dags_config: Path to dags.yaml config file
--sql_dir: Path to directory which contains queries.
--with_tasks: Include scheduled tasks

Examples

# Get information about all available DAGs
./bqetl dag info

# Get information about a specific DAG
./bqetl dag info bqetl_ssl_ratios

# Get information about a specific DAG including scheduled tasks
./bqetl dag info --with_tasks bqetl_ssl_ratios

create

Create a new DAG with name bqetl_, for example: bqetl_search When creating new DAGs, the DAG name must have a bqetl_ prefix. Created DAGs are added to the dags.yaml file.

Usage

$ ./bqetl dag create [OPTIONS] [name]

Options:

--dags_config: Path to dags.yaml config file
--schedule_interval: Schedule interval of the new DAG. Schedule intervals can be either in CRON format or one of: once, hourly, daily, weekly, monthly, yearly or a timedelta []d[]h[]m
--owner: Email address of the DAG owner
--description: Description for DAG
--tag: Tag to apply to the DAG
--start_date: First date for which scheduled queries should be executed
--email: Email addresses that Airflow will send alerts to
--retries: Number of retries Airflow will attempt in case of failures
--retry_delay: Time period Airflow will wait after failures before running failed tasks again

Examples

./bqetl dag create bqetl_core \
--schedule-interval="0 2 * * *" \
--owner=example@mozilla.com \
--description="Tables derived from `core` pings sent by mobile applications." \
--tag=impact/tier_1 \
--start-date=2019-07-25


# Create DAG and overwrite default settings
./bqetl dag create bqetl_ssl_ratios --schedule-interval="0 2 * * *" \
--owner=example@mozilla.com \
--description="The DAG schedules SSL ratios queries." \
--tag=impact/tier_1 \
--start-date=2019-07-20 \
--email=example2@mozilla.com \
--email=example3@mozilla.com \
--retries=2 \
--retry_delay=30m

generate

Generate Airflow DAGs from DAG definitions.

Usage

$ ./bqetl dag generate [OPTIONS] [name]

Options:

--dags_config: Path to dags.yaml config file
--sql_dir: Path to directory which contains queries.
--output_dir: Path directory with generated DAGs

Examples

# Generate all DAGs
./bqetl dag generate

# Generate a specific DAG
./bqetl dag generate bqetl_ssl_ratios

remove

Remove a DAG. This will also remove the scheduling information from the queries that were scheduled as part of the DAG.

Usage

$ ./bqetl dag remove [OPTIONS] [name]

Options:

--dags_config: Path to dags.yaml config file
--sql_dir: Path to directory which contains queries.
--output_dir: Path directory with generated DAGs

Examples

# Remove a specific DAG
./bqetl dag remove bqetl_vrbrowser

dependency

Build and use query dependency graphs.

show

Show table references in sql files.

Usage

$ ./bqetl dependency show [OPTIONS] [paths]

record

Record table references in metadata. Fails if metadata already contains references section.

Usage

$ ./bqetl dependency record [OPTIONS] [paths]

dryrun

Dry run SQL. Uses the dryrun Cloud Function by default which only has access to shared-prod. To dryrun queries accessing tables in another project use set --use-cloud-function=false and ensure that the command line has access to a GCP service account.

Usage

$ ./bqetl dryrun [OPTIONS] [paths]

Options:

--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.
--validate_schemas: Require dry run schema to match destination table and file if present.
--respect_skip: Respect or ignore query skip configuration. Default is --respect-skip.
--project: GCP project to perform dry run in when --use_cloud_function=False

Examples

Examples:
./bqetl dryrun sql/moz-fx-data-shared-prod/telemetry_derived/

# Dry run SQL with tables that are not in shared prod
./bqetl dryrun --use-cloud-function=false sql/moz-fx-data-marketing-prod/

format

Format SQL files.

Usage

$ ./bqetl format [OPTIONS] [paths]

Examples

# Format a specific file
./bqetl format sql/moz-fx-data-shared-prod/telemetry/core/view.sql

# Format all SQL files in `sql/`
./bqetl format sql

# Format standard in (will write to standard out)
echo 'SELECT 1,2,3' | ./bqetl format

routine

Commands for managing routines for internal use.

create

Create a new routine. Specify whether the routine is a UDF or stored procedure by adding a --udf or --stored_prodecure flag.

Usage

$ ./bqetl routine create [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--udf: Create a new UDF
--stored_procedure: Create a new stored procedure

Examples

# Create a UDF
./bqetl routine create --udf udf.array_slice


# Create a stored procedure
./bqetl routine create --stored_procedure udf.events_daily


# Create a UDF in a project other than shared-prod
./bqetl routine create --udf udf.active_last_week --project=moz-fx-data-marketing-prod

info

Get routine information.

Usage

$ ./bqetl routine info [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--usages: Show routine usages

Examples

# Get information about all internal routines in a specific dataset
./bqetl routine info udf.*


# Get usage information of specific routine
./bqetl routine info --usages udf.get_key

validate

Validate formatting of routines and run tests.

Usage

$ ./bqetl routine validate [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--docs_only: Only validate docs.

Examples

# Validate all routines
./bqetl routine validate


# Validate selected routines
./bqetl routine validate udf.*

publish

Publish routines to BigQuery. Requires service account access.

Usage

$ ./bqetl routine publish [OPTIONS] [name]

Options:

--project_id: GCP project ID
--dependency_dir: The directory JavaScript dependency files for UDFs are stored.
--gcs_bucket: The GCS bucket where dependency files are uploaded to.
--gcs_path: The GCS path in the bucket where dependency files are uploaded to.
--dry_run: Dry run publishing udfs.

Examples

# Publish all routines
./bqetl routine publish


# Publish selected routines
./bqetl routine validate udf.*

rename

Rename routine or routine dataset. Replaces all usages in queries with the new name.

Usage

$ ./bqetl routine rename [OPTIONS] [name] [new_name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID

Examples

# Rename routine
./bqetl routine rename udf.array_slice udf.list_slice


# Rename routine matching a specific pattern
./bqetl routine rename udf.array_* udf.list_*

mozfun

Commands for managing public mozfun routines.

create

Create a new mozfun routine. Specify whether the routine is a UDF or stored procedure by adding a --udf or --stored_prodecure flag. UDFs are added to the mozfun project.

Usage

$ ./bqetl mozfun create [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--udf: Create a new UDF
--stored_procedure: Create a new stored procedure

Examples

# Create a UDF
./bqetl mozfun create --udf bytes.zero_right


# Create a stored procedure
./bqetl mozfun create --stored_procedure event_analysis.events_daily

info

Get mozfun routine information.

Usage

$ ./bqetl mozfun info [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--usages: Show routine usages

Examples

# Get information about all internal routines in a specific dataset
./bqetl mozfun info hist.*


# Get usage information of specific routine
./bqetl mozfun info --usages hist.mean

validate

Validate formatting of mozfun routines and run tests.

Usage

$ ./bqetl mozfun validate [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--docs_only: Only validate docs.

Examples

# Validate all routines
./bqetl mozfun validate


# Validate selected routines
./bqetl mozfun validate hist.*

publish

Publish mozfun routines. This command is used by Airflow only.

Usage

$ ./bqetl mozfun publish [OPTIONS] [name]

Options:

--project_id: GCP project ID
--dependency_dir: The directory JavaScript dependency files for UDFs are stored.
--gcs_bucket: The GCS bucket where dependency files are uploaded to.
--gcs_path: The GCS path in the bucket where dependency files are uploaded to.
--dry_run: Dry run publishing udfs.

rename

Rename mozfun routine or mozfun routine dataset. Replaces all usages in queries with the new name.

Usage

$ ./bqetl mozfun rename [OPTIONS] [name] [new_name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID

Examples

# Rename routine
./bqetl mozfun rename hist.extract hist.ext


# Rename routine matching a specific pattern
./bqetl mozfun rename *.array_* *.list_*


# Rename routine dataset
./bqetl mozfun rename hist.* histogram.*

backfill

Commands for managing backfills.

create

Create a new backfill entry in the backfill.yaml file. Create a backfill.yaml file if it does not already exist.

Usage

$ ./bqetl backfill create [OPTIONS] [qualified_table_name]

Options:

--sql_dir: Path to directory which contains queries.
--start_date: First date to be backfilled. Date format: yyyy-mm-dd
--end_date: Last date to be backfilled. Date format: yyyy-mm-dd
--exclude: Dates excluded from backfill. Date format: yyyy-mm-dd
--watcher: Watcher of the backfill (email address)

Examples

./bqetl backfill create moz-fx-data-shared-prod.telemetry_derived.deviations_v1 \
  --start_date=2021-03-01 \
  --end_date=2021-03-31 \
  --exclude=2021-03-03 \

validate

Validate backfill.yaml file format and content.

Usage

$ ./bqetl backfill validate [OPTIONS] [qualified_table_name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID

Examples

./bqetl backfill validate moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6


# validate all backfill.yaml files if table is not specified
Use the `--project_id` option to change the project to be validated;
default is `moz-fx-data-shared-prod`.

    ./bqetl backfill validate

info

Get backfill(s) information from all or specific table(s).

Usage

$ ./bqetl backfill info [OPTIONS] [qualified_table_name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--status: Filter backfills with this status.

Examples

# Get info for specific table.
./bqetl backfill info moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6


# Get info for all tables.
./bqetl backfill info


# Get info from all tables with specific status.
./bqetl backfill info --status=Initiate

scheduled

Get information on backfill(s) that require processing.

Usage

$ ./bqetl backfill scheduled [OPTIONS] [qualified_table_name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--status: Whether to get backfills to process or to complete.
--json_path: None

Examples

# Get info for specific table.
./bqetl backfill scheduled moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6


# Get info for all tables.
./bqetl backfill scheduled

initiate

Process entry in backfill.yaml with Initiate status that has not yet been processed.

Usage

$ ./bqetl backfill initiate [OPTIONS] [qualified_table_name]

Options:

--parallelism: Maximum number of queries to execute concurrently
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID

Examples

# Initiate backfill entry for specific table
./bqetl backfill initiate moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6

Use the `--project_id` option to change the project;
default project_id is `moz-fx-data-shared-prod`.

complete

Complete entry in backfill.yaml with Complete status that has not yet been processed..

Usage

$ ./bqetl backfill complete [OPTIONS] [qualified_table_name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID

Examples

# Complete backfill entry for specific table
./bqetl backfill complete moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6

Use the `--project_id` option to change the project;
default project_id is `moz-fx-data-shared-prod`.