Skip to content

bqetl CLI

The bqetl command-line tool aims to simplify working with the bigquery-etl repository by supporting common workflows, such as creating, validating and scheduling queries or adding new UDFs.

Installation

Follow the Quick Start to set up bigquery-etl and the bqetl CLI.

Commands

To list all available commands in the bqetl CLI:

$ ./bqetl

Usage: bqetl [OPTIONS] COMMAND [ARGS]...

  CLI tools for working with bigquery-etl.

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  alchemer    Commands for importing alchemer data.
  dag         Commands for managing DAGs.
  dependency  Build and use query dependency graphs.
  dryrun      Dry run SQL.
  format      Format SQL.
  glam        Tools for GLAM ETL.
  mozfun      Commands for managing mozfun routines.
  query       Commands for managing queries.
  routine     Commands for managing routines.
  stripe      Commands for Stripe ETL.
  view        Commands for managing views.

See help for any command:

$ ./bqetl [command] --help

query

Commands for managing queries.

create

Create a new query with name ., for example: telemetry_derived.asn_aggregates. Use the --project_id option to change the project the query is added to; default is moz-fx-data-shared-prod. Views are automatically generated in the publicly facing dataset.

Usage

$ ./bqetl query create [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--owner: Owner of the query (email address)
--init: Create an init.sql file to initialize the table

Examples

./bqetl query create telemetry_derived.deviations_v1 \
  --owner=example@mozilla.com


# The query version gets autocompleted to v1. Queries are created in the
# _derived dataset and accompanying views in the public dataset.
./bqetl query create telemetry.deviations --owner=example@mozilla.com

schedule

Schedule an existing query

Usage

$ ./bqetl query schedule [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--dag: Name of the DAG the query should be scheduled under. To see available DAGs run `bqetl dag info`. To create a new DAG run `bqetl dag create`.
--depends_on_past: Only execute query if previous scheduled run succeeded.
--task_name: Custom name for the Airflow task. By default the task name is a combination of the dataset and table name.

Examples

./bqetl query schedule telemetry_derived.deviations_v1 \
  --dag=bqetl_deviations


# Set a specific name for the task
./bqetl query schedule telemetry_derived.deviations_v1 \
  --dag=bqetl_deviations \
  --task-name=deviations

info

Get information about all or specific queries.

Usage

$ ./bqetl query info [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--cost: Include information about query costs
--last_updated: Include timestamps when destination tables were last updated

Examples

# Get info for specific queries
./bqetl query info telemetry_derived.*


# Get cost and last update timestamp information
./bqetl query info telemetry_derived.clients_daily_v6 \
  --cost --last_updated

backfill

Run a backfill for a query. Additional parameters will get passed to bq.

Usage

$ ./bqetl query backfill [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--start_date: First date to be backfilled
--end_date: Last date to be backfilled
--exclude: Dates excluded from backfill. Date format: yyyy-mm-dd
--dry_run: Dry run the backfill
--max_rows: How many rows to return in the result
--parallelism: How many threads to run backfill in parallel

Examples

# Backfill for specific date range
# second comment line
./bqetl query backfill telemetry_derived.ssl_ratios_v1 \
  --start_date=2021-03-01 \
  --end_date=2021-03-31


# Dryrun backfill for specific date range and exclude date
./bqetl query backfill telemetry_derived.ssl_ratios_v1 \
  --start_date=2021-03-01 \
  --end_date=2021-03-31 \
  --exclude=2021-03-03 \
  --dry_run

validate

Validate a query. Checks formatting, scheduling information and dry runs the query.

Usage

$ ./bqetl query validate [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.

Examples

./bqetl query validate telemetry_derived.clients_daily_v6


# Validate query not in shared-prod
./bqetl query validate \
  --use_cloud_function=false \
  --project_id=moz-fx-data-marketing-prod \
  ga_derived.blogs_goals_v1

initialize

Create and initialize the destination table for the query. Only for queries that have an init.sql file.

Usage

$ ./bqetl query initialize [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--dry_run: Dry run the backfill

Examples

./bqetl query initialize telemetry_derived.ssl_ratios_v1

schema

Commands for managing query schemas.

update

Update the query schema based on the destination table schema and the query schema. If no schema.yaml file exists for a query, one will be created.

Usage

$ ./bqetl query schema update [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--update_downstream: Update downstream dependencies. GCP authentication required.
--tmp_dataset: GCP datasets for creating updated tables temporarily.

Examples

./bqetl query schema update telemetry_derived.clients_daily_v6
deploy

Deploy the query schema.

Usage

$ ./bqetl query schema deploy [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--force: Deploy the schema file without validating that it matches the query

Examples

./bqetl query schema deploy telemetry_derived.clients_daily_v6
validate

Validate the query schema

Usage

$ ./bqetl query schema validate [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID

Examples

./bqetl query schema validate telemetry_derived.clients_daily_v6

dag

Commands for managing DAGs.

info

Get information about available DAGs.

Usage

$ ./bqetl dag info [OPTIONS] [name]

Options:

--dags_config: Path to dags.yaml config file
--with_tasks: Include scheduled tasks

Examples

# Get information about all available DAGs
./bqetl dag info

# Get information about a specific DAG
./bqetl dag info bqetl_ssl_ratios

# Get information about a specific DAG including scheduled tasks
./bqetl dag info --with_tasks bqetl_ssl_ratios

create

Create a new DAG with name bqetl_, for example: bqetl_search When creating new DAGs, the DAG name must have a bqetl_ prefix. Created DAGs are added to the dags.yaml file.

Usage

$ ./bqetl dag create [OPTIONS] [name]

Options:

--dags_config: Path to dags.yaml config file
--schedule_interval: Schedule interval of the new DAG. Schedule intervals can be either in CRON format or one of: once, hourly, daily, weekly, monthly, yearly or a timedelta []d[]h[]m
--owner: Email address of the DAG owner
--description: Description for DAG
--start_date: First date for which scheduled queries should be executed
--email: List of email addresses that Airflow will send alerts to
--retries: Number of retries Airflow will attempt in case of failures
--retry_delay: Time period Airflow will wait after failures before running failed tasks again

Examples

./bqetl dag create bqetl_core \
--schedule-interval="0 2 * * *" \
--owner=example@mozilla.com \
--description="Tables derived from `core` pings sent by mobile applications." \
--start-date=2019-07-25


# Create DAG and overwrite default settings
./bqetl dag create bqetl_ssl_ratios --schedule-interval="0 2 * * *" \
--owner=example@mozilla.com \
--description="The DAG schedules SSL ratios queries." \
--start-date=2019-07-20 \
--email=example2@mozilla.com,example3@mozilla.com \
--retries=2 \
--retry_delay=30m

generate

Generate Airflow DAGs from DAG definitions. Requires Java.

Usage

$ ./bqetl dag generate [OPTIONS] [name]

Options:

--dags_config: Path to dags.yaml config file
--output_dir: Path directory with generated DAGs

Examples

# Generate all DAGs
./bqetl dag generate

# Generate a specific DAG
./bqetl dag generate bqetl_ssl_ratios

remove

Remove a DAG. This will also remove the scheduling information from the queries that were scheduled as part of the DAG.

Usage

$ ./bqetl dag remove [OPTIONS] [name]

Options:

--dags_config: Path to dags.yaml config file
--output_dir: Path directory with generated DAGs

Examples

# Remove a specific DAG
./bqetl dag remove bqetl_vrbrowser

dependency

Build and use query dependency graphs.

show

Show table references in sql files. Requires Java.

Usage

$ ./bqetl dependency show [OPTIONS] [paths]

record

Record table references in metadata. Fails if metadata already contains references section. Requires Java.

Usage

$ ./bqetl dependency record [OPTIONS] [paths]

dryrun

Dry run SQL. Uses the dryrun Cloud Function by default which only has access to shared-prod. To dryrun queries accessing tables in another project use set --use-cloud-function=false and ensure that the command line has access to a GCP service account.

Usage

$ ./bqetl dryrun [OPTIONS] [path]

Options:

--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.
--project: GCP project to perform dry run in when --use_cloud_function=False

Examples

Examples:
./bqetl dryrun sql/moz-fx-data-shared-prod/telemetry_derived/

# Dry run SQL with tables that are not in shared prod
./bqetl dryrun --use-cloud-function=false sql/moz-fx-data-marketing-prod/

format

Format SQL files.

Usage

$ ./bqetl format [OPTIONS] [path]

Examples

# Format all SQL files
./bqetl format

# Format a specific file
./bqetl format sql/moz-fx-data-shared-prod/telemetry/core/view.sql

routine

Commands for managing routines for internal use.

create

Create a new routine. Specify whether the routine is a UDF or stored procedure by adding a --udf or --stored_prodecure flag.

Usage

$ ./bqetl routine create [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--udf: Create a new UDF
--stored_procedure: Create a new stored procedure

Examples

# Create a UDF
./bqetl routine create --udf udf.array_slice


# Create a stored procedure
./bqetl routine create --stored_procedure udf.events_daily


# Create a UDF in a project other than shared-prod
./bqetl routine create --udf udf.active_last_week --project=moz-fx-data-marketing-prod

info

Get routine information.

Usage

$ ./bqetl routine info [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--usages: Show routine usages

Examples

# Get information about all internal routines in a specific dataset
./bqetl routine info udf.*


# Get usage information of specific routine
./bqetl routine info --usages udf.get_key

validate

Validate formatting of routines and run tests.

Usage

$ ./bqetl routine validate [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID

Examples

# Validate all routines
./bqetl routine validate


# Validate selected routines
./bqetl routine validate udf.*

publish

Publish routines to BigQuery. Requires service account access.

Usage

$ ./bqetl routine publish [OPTIONS] [path]

Options:

--project_id: GCP project ID
--dependency_dir: The directory JavaScript dependency files for UDFs are stored.
--gcs_bucket: The GCS bucket where dependency files are uploaded to.
--gcs_path: The GCS path in the bucket where dependency files are uploaded to.

Examples

# Publish all routines
./bqetl routine publish

rename

Rename routine or routine dataset. Replaces all usages in queries with the new name.

Usage

$ ./bqetl routine rename [OPTIONS] [name] [new_name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID

Examples

# Rename routine
./bqetl routine rename udf.array_slice udf.list_slice


# Rename routine matching a specific pattern
./bqetl routine rename udf.array_* udf.list_*

mozfun

Commands for managing public mozfun routines.

create

Create a new mozfun routine. Specify whether the routine is a UDF or stored procedure by adding a --udf or --stored_prodecure flag. UDFs are added to the mozfun project.

Usage

$ ./bqetl mozfun create [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--udf: Create a new UDF
--stored_procedure: Create a new stored procedure

Examples

# Create a UDF
./bqetl mozfun create --udf bytes.zero_right


# Create a stored procedure
./bqetl mozfun create --stored_procedure event_analysis.events_daily

info

Get mozfun routine information.

Usage

$ ./bqetl mozfun info [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--usages: Show routine usages

Examples

# Get information about all internal routines in a specific dataset
./bqetl mozfun info hist.*


# Get usage information of specific routine
./bqetl mozfun info --usages hist.mean

validate

Validate formatting of mozfun routines and run tests.

Usage

$ ./bqetl mozfun validate [OPTIONS] [name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID

Examples

# Validate all routines
./bqetl mozfun validate


# Validate selected routines
./bqetl mozfun validate hist.*

publish

Publish mozfun routines. This command is used by Airflow only.

Usage

$ ./bqetl mozfun publish [OPTIONS] [path]

Options:

--project_id: GCP project ID
--dependency_dir: The directory JavaScript dependency files for UDFs are stored.
--gcs_bucket: The GCS bucket where dependency files are uploaded to.
--gcs_path: The GCS path in the bucket where dependency files are uploaded to.

rename

Rename mozfun routine or mozfun routine dataset. Replaces all usages in queries with the new name.

Usage

$ ./bqetl mozfun rename [OPTIONS] [name] [new_name]

Options:

--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID

Examples

# Rename routine
./bqetl mozfun rename hist.extract hist.ext


# Rename routine matching a specific pattern
./bqetl mozfun rename *.array_* *.list_*


# Rename routine dataset
./bqetl mozfun rename hist.* histogram.*