bqetl CLI
The bqetl
command-line tool aims to simplify working with the bigquery-etl repository by supporting common workflows, such as creating, validating and scheduling queries or adding new UDFs.
Running some commands, for example to create or query tables, will require Mozilla GCP access.
Installation
Follow the Quick Start to set up bigquery-etl and the bqetl CLI.
Configuration
bqetl
can be configured via the bqetl_project.yaml
file. See Configuration to find available configuration options.
Commands
To list all available commands in the bqetl CLI:
$ ./bqetl
Usage: bqetl [OPTIONS] COMMAND [ARGS]...
CLI tools for working with bigquery-etl.
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
alchemer Commands for importing alchemer data.
dag Commands for managing DAGs.
dependency Build and use query dependency graphs.
dryrun Dry run SQL.
format Format SQL.
glam Tools for GLAM ETL.
mozfun Commands for managing mozfun routines.
query Commands for managing queries.
routine Commands for managing routines.
stripe Commands for Stripe ETL.
view Commands for managing views.
backfill Commands for managing backfills.
See help for any command:
$ ./bqetl [command] --help
query
Commands for managing queries.
create
Create a new query with name
--project_id
option to change the project the query is added to;
default is moz-fx-data-shared-prod
. Views are automatically generated
in the publicly facing dataset.
Usage
$ ./bqetl query create [OPTIONS] [name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--owner: Owner of the query (email address)
--dag: Name of the DAG the query should be scheduled under.If there is no DAG name specified, the query isscheduled by default in DAG bqetl_default.To skip the automated scheduling use --no_schedule.To see available DAGs run `bqetl dag info`.To create a new DAG run `bqetl dag create`.
--no_schedule: Using this option creates the query without scheduling information. Use `bqetl query schedule` to add it manually if required.
Examples
./bqetl query create telemetry_derived.deviations_v1 \
--owner=example@mozilla.com
# The query version gets autocompleted to v1. Queries are created in the
# _derived dataset and accompanying views in the public dataset.
./bqetl query create telemetry.deviations --owner=example@mozilla.com
schedule
Schedule an existing query
Usage
$ ./bqetl query schedule [OPTIONS] [name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--dag: Name of the DAG the query should be scheduled under. To see available DAGs run `bqetl dag info`. To create a new DAG run `bqetl dag create`.
--depends_on_past: Only execute query if previous scheduled run succeeded.
--task_name: Custom name for the Airflow task. By default the task name is a combination of the dataset and table name.
Examples
./bqetl query schedule telemetry_derived.deviations_v1 \
--dag=bqetl_deviations
# Set a specific name for the task
./bqetl query schedule telemetry_derived.deviations_v1 \
--dag=bqetl_deviations \
--task-name=deviations
info
Get information about all or specific queries.
Usage
$ ./bqetl query info [OPTIONS] [name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
Examples
# Get info for specific queries
./bqetl query info telemetry_derived.*
# Get cost and last update timestamp information
./bqetl query info telemetry_derived.clients_daily_v6 \
--cost --last_updated
backfill
Run a backfill for a query. Additional parameters will get passed to bq.
Usage
$ ./bqetl query backfill [OPTIONS] [name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.
--start_date: First date to be backfilled
--end_date: Last date to be backfilled
--exclude: Dates excluded from backfill. Date format: yyyy-mm-dd
--dry_run: Dry run the backfill
--max_rows: How many rows to return in the result
--parallelism: How many threads to run backfill in parallel
--destination_table: Destination table name results are written to. If not set, determines destination table based on query.
--checks: Whether to run checks during backfill
--custom_query_path: Name of a custom query to run the backfill. If not given, the proces runs as usual.
--checks_file_name: Name of a custom data checks file to run after each partition backfill. E.g. custom_checks.sql. Optional.
--scheduling_overrides: Pass overrides as a JSON string for scheduling sections: parameters and/or date_partition_parameter as needed.
Examples
# Backfill for specific date range
# second comment line
./bqetl query backfill telemetry_derived.ssl_ratios_v1 \
--start_date=2021-03-01 \
--end_date=2021-03-31
# Dryrun backfill for specific date range and exclude date
./bqetl query backfill telemetry_derived.ssl_ratios_v1 \
--start_date=2021-03-01 \
--end_date=2021-03-31 \
--exclude=2021-03-03 \
--dry_run
run
Run a query. Additional parameters will get passed to bq.
If a destination_table is set, the query result will be written to BigQuery. Without a destination_table specified, the results are not stored.
If the name
is not found within the sql/
folder bqetl assumes it hasn't been generated yet
and will start the generating process for all sql_generators/
files.
This generation process will take some time and run dryrun calls against BigQuery but this is expected.
Additional parameters (all parameters that are not specified in the Options) must come after the query-name.
Otherwise the first parameter that is not an option is interpreted as the query-name and since it can't be found the generation process will start.
Usage
$ ./bqetl query run [OPTIONS] [name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.
--public_project_id: Project with publicly accessible data
--destination_table: Destination table name results are written to. If not set, the query result will not be written to BigQuery.
--dataset_id: Destination dataset results are written to. If not set, determines destination dataset based on query.
Examples
# Run a query by name
./bqetl query run telemetry_derived.ssl_ratios_v1
# Run a query file
./bqetl query run /path/to/query.sql
# Run a query and save the result to BigQuery
./bqetl query run telemetry_derived.ssl_ratios_v1 --project_id=moz-fx-data-shared-prod --dataset_id=telemetry_derived --destination_table=ssl_ratios_v1
run-multipart
Run a multipart query.
Usage
$ ./bqetl query run-multipart [OPTIONS] [query_dir]
Options:
--using: comma separated list of join columns to use when combining results
--parallelism: Maximum number of queries to execute concurrently
--dataset_id: Default dataset, if not specified all tables must be qualified with dataset
--project_id: GCP project ID
--temp_dataset: Dataset where intermediate query results will be temporarily stored, formatted as PROJECT_ID.DATASET_ID
--destination_table: table where combined results will be written
--time_partitioning_field: time partition field on the destination table
--clustering_fields: comma separated list of clustering fields on the destination table
--dry_run: Print bytes that would be processed for each part and don't run queries
--parameters: query parameter(s) to pass when running parts
--priority: Priority for BigQuery query jobs; BATCH priority will significantly slow down queries if reserved slots are not enabled for the billing project; defaults to INTERACTIVE
--schema_update_options: Optional options for updating the schema.
Examples
# Run a multipart query
./bqetl query run_multipart /path/to/query.sql
validate
Validate a query. Checks formatting, scheduling information and dry runs the query.
Usage
$ ./bqetl query validate [OPTIONS] [name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.
--validate_schemas: Require dry run schema to match destination table and file if present.
--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --ignore-dryrun-skip.
--no_dryrun: Skip running dryrun. Default is False.
Examples
./bqetl query validate telemetry_derived.clients_daily_v6
# Validate query not in shared-prod
./bqetl query validate \
--use_cloud_function=false \
--project_id=moz-fx-data-marketing-prod \
ga_derived.blogs_goals_v1
initialize
Run a full backfill on the destination table for the query.
Using this command will:
- Create the table if it doesn't exist and run a full backfill.
- Run a full backfill if the table exists and is empty.
- Raise an exception if the table exists and has data, or if the table exists and the schema doesn't match the query.
It supports query.sql
files that use the is_init() pattern.
To run in parallel per sample_id, include a @sample_id parameter in the query.
Usage
$ ./bqetl query initialize [OPTIONS] [name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.
--dry_run: Dry run the initialization
--parallelism: Number of threads for parallel processing
--skip_existing: Skip initialization for existing artifacts. This ensures that artifacts, like materialized views only get initialized if they don't already exist.
--force: Run the initialization even if the destination table contains data.
Examples
Examples:
- For init.sql files: ./bqetl query initialize telemetry_derived.ssl_ratios_v1
- For query.sql files and parallel run: ./bqetl query initialize sql/moz-fx-data-shared-prod/telemetry_derived/clients_first_seen_v2/query.sql
render
Render a query Jinja template.
Usage
$ ./bqetl query render [OPTIONS] [name]
Options:
--sql_dir: Path to directory which contains queries.
--output_dir: Output directory generated SQL is written to. If not specified, rendered queries are printed to console.
--parallelism: Number of threads for parallel processing
Examples
./bqetl query render telemetry_derived.ssl_ratios_v1 \
--output-dir=/tmp
schema
Commands for managing query schemas.
update
Update the query schema based on the destination table schema and the query schema. If no schema.yaml file exists for a query, one will be created.
Usage
$ ./bqetl query schema update [OPTIONS] [name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--update_downstream: Update downstream dependencies. GCP authentication required.
--tmp_dataset: GCP datasets for creating updated tables temporarily.
--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.
--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.
--parallelism: Number of threads for parallel processing
--is_init: Indicates whether the `is_init()` condition should be set to true of false.
Examples
./bqetl query schema update telemetry_derived.clients_daily_v6
# Update schema including downstream dependencies (requires GCP)
./bqetl query schema update telemetry_derived.clients_daily_v6 --update-downstream
deploy
Deploy the query schema.
Usage
$ ./bqetl query schema deploy [OPTIONS] [name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--force: Deploy the schema file without validating that it matches the query
--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.
--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.
--skip_existing: Skip updating existing tables. This option ensures that only new tables get deployed.
--skip_external_data: Skip publishing external data, such as Google Sheets.
--destination_table: Destination table name results are written to. If not set, determines destination table based on query. Must be fully qualified (project.dataset.table).
--parallelism: Number of threads for parallel processing
Examples
./bqetl query schema deploy telemetry_derived.clients_daily_v6
validate
Validate the query schema
Usage
$ ./bqetl query schema validate [OPTIONS] [name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.
--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.
Examples
./bqetl query schema validate telemetry_derived.clients_daily_v6
dag
Commands for managing DAGs.
info
Get information about available DAGs.
Usage
$ ./bqetl dag info [OPTIONS] [name]
Options:
--dags_config: Path to dags.yaml config file
--sql_dir: Path to directory which contains queries.
--with_tasks: Include scheduled tasks
Examples
# Get information about all available DAGs
./bqetl dag info
# Get information about a specific DAG
./bqetl dag info bqetl_ssl_ratios
# Get information about a specific DAG including scheduled tasks
./bqetl dag info --with_tasks bqetl_ssl_ratios
create
Create a new DAG with name bqetl_bqetl_
prefix.
Created DAGs are added to the dags.yaml
file.
Usage
$ ./bqetl dag create [OPTIONS] [name]
Options:
--dags_config: Path to dags.yaml config file
--schedule_interval: Schedule interval of the new DAG. Schedule intervals can be either in CRON format or one of: once, hourly, daily, weekly, monthly, yearly or a timedelta []d[]h[]m
--owner: Email address of the DAG owner
--description: Description for DAG
--tag: Tag to apply to the DAG
--start_date: First date for which scheduled queries should be executed
--email: Email addresses that Airflow will send alerts to
--retries: Number of retries Airflow will attempt in case of failures
--retry_delay: Time period Airflow will wait after failures before running failed tasks again
Examples
./bqetl dag create bqetl_core \
--schedule-interval="0 2 * * *" \
--owner=example@mozilla.com \
--description="Tables derived from `core` pings sent by mobile applications." \
--tag=impact/tier_1 \
--start-date=2019-07-25
# Create DAG and overwrite default settings
./bqetl dag create bqetl_ssl_ratios --schedule-interval="0 2 * * *" \
--owner=example@mozilla.com \
--description="The DAG schedules SSL ratios queries." \
--tag=impact/tier_1 \
--start-date=2019-07-20 \
--email=example2@mozilla.com \
--email=example3@mozilla.com \
--retries=2 \
--retry_delay=30m
generate
Generate Airflow DAGs from DAG definitions.
Usage
$ ./bqetl dag generate [OPTIONS] [name]
Options:
--dags_config: Path to dags.yaml config file
--sql_dir: Path to directory which contains queries.
--output_dir: Path directory with generated DAGs
Examples
# Generate all DAGs
./bqetl dag generate
# Generate a specific DAG
./bqetl dag generate bqetl_ssl_ratios
remove
Remove a DAG. This will also remove the scheduling information from the queries that were scheduled as part of the DAG.
Usage
$ ./bqetl dag remove [OPTIONS] [name]
Options:
--dags_config: Path to dags.yaml config file
--sql_dir: Path to directory which contains queries.
--output_dir: Path directory with generated DAGs
Examples
# Remove a specific DAG
./bqetl dag remove bqetl_vrbrowser
dependency
Build and use query dependency graphs.
show
Show table references in sql files.
Usage
$ ./bqetl dependency show [OPTIONS] [paths]
record
Record table references in metadata. Fails if metadata already contains references section.
Usage
$ ./bqetl dependency record [OPTIONS] [paths]
dryrun
Dry run SQL.
Uses the dryrun Cloud Function by default which only has access to shared-prod.
To dryrun queries accessing tables in another project use set
--use-cloud-function=false
and ensure that the command line has access to a
GCP service account.
Usage
$ ./bqetl dryrun [OPTIONS] [paths]
Options:
--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.
--validate_schemas: Require dry run schema to match destination table and file if present.
--respect_skip: Respect or ignore query skip configuration. Default is --respect-skip.
--project: GCP project to perform dry run in when --use_cloud_function=False
Examples
Examples:
./bqetl dryrun sql/moz-fx-data-shared-prod/telemetry_derived/
# Dry run SQL with tables that are not in shared prod
./bqetl dryrun --use-cloud-function=false sql/moz-fx-data-marketing-prod/
format
Format SQL files.
Usage
$ ./bqetl format [OPTIONS] [paths]
Options:
--check: do not write changes, just return status; return code 0 indicates nothing would change; return code 1 indicates some files would be reformatted
--parallelism: Number of threads for parallel processing
Examples
# Format a specific file
./bqetl format sql/moz-fx-data-shared-prod/telemetry/core/view.sql
# Format all SQL files in `sql/`
./bqetl format sql
# Format standard in (will write to standard out)
echo 'SELECT 1,2,3' | ./bqetl format
routine
Commands for managing routines for internal use.
create
Create a new routine. Specify whether the routine is a UDF or stored procedure by adding a --udf or --stored_prodecure flag.
Usage
$ ./bqetl routine create [OPTIONS] [name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--udf: Create a new UDF
--stored_procedure: Create a new stored procedure
Examples
# Create a UDF
./bqetl routine create --udf udf.array_slice
# Create a stored procedure
./bqetl routine create --stored_procedure udf.events_daily
# Create a UDF in a project other than shared-prod
./bqetl routine create --udf udf.active_last_week --project=moz-fx-data-marketing-prod
info
Get routine information.
Usage
$ ./bqetl routine info [OPTIONS] [name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--usages: Show routine usages
Examples
# Get information about all internal routines in a specific dataset
./bqetl routine info udf.*
# Get usage information of specific routine
./bqetl routine info --usages udf.get_key
validate
Validate formatting of routines and run tests.
Usage
$ ./bqetl routine validate [OPTIONS] [name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--docs_only: Only validate docs.
Examples
# Validate all routines
./bqetl routine validate
# Validate selected routines
./bqetl routine validate udf.*
publish
Publish routines to BigQuery. Requires service account access.
Usage
$ ./bqetl routine publish [OPTIONS] [name]
Options:
--project_id: GCP project ID
--dependency_dir: The directory JavaScript dependency files for UDFs are stored.
--gcs_bucket: The GCS bucket where dependency files are uploaded to.
--gcs_path: The GCS path in the bucket where dependency files are uploaded to.
--dry_run: Dry run publishing udfs.
Examples
# Publish all routines
./bqetl routine publish
# Publish selected routines
./bqetl routine validate udf.*
rename
Rename routine or routine dataset. Replaces all usages in queries with the new name.
Usage
$ ./bqetl routine rename [OPTIONS] [name] [new_name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
Examples
# Rename routine
./bqetl routine rename udf.array_slice udf.list_slice
# Rename routine matching a specific pattern
./bqetl routine rename udf.array_* udf.list_*
mozfun
Commands for managing public mozfun routines.
create
Create a new mozfun routine. Specify whether the routine is a UDF or
stored procedure by adding a --udf or --stored_prodecure flag. UDFs
are added to the mozfun
project.
Usage
$ ./bqetl mozfun create [OPTIONS] [name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--udf: Create a new UDF
--stored_procedure: Create a new stored procedure
Examples
# Create a UDF
./bqetl mozfun create --udf bytes.zero_right
# Create a stored procedure
./bqetl mozfun create --stored_procedure event_analysis.events_daily
info
Get mozfun routine information.
Usage
$ ./bqetl mozfun info [OPTIONS] [name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--usages: Show routine usages
Examples
# Get information about all internal routines in a specific dataset
./bqetl mozfun info hist.*
# Get usage information of specific routine
./bqetl mozfun info --usages hist.mean
validate
Validate formatting of mozfun routines and run tests.
Usage
$ ./bqetl mozfun validate [OPTIONS] [name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--docs_only: Only validate docs.
Examples
# Validate all routines
./bqetl mozfun validate
# Validate selected routines
./bqetl mozfun validate hist.*
publish
Publish mozfun routines. This command is used by Airflow only.
Usage
$ ./bqetl mozfun publish [OPTIONS] [name]
Options:
--project_id: GCP project ID
--dependency_dir: The directory JavaScript dependency files for UDFs are stored.
--gcs_bucket: The GCS bucket where dependency files are uploaded to.
--gcs_path: The GCS path in the bucket where dependency files are uploaded to.
--dry_run: Dry run publishing udfs.
rename
Rename mozfun routine or mozfun routine dataset. Replaces all usages in queries with the new name.
Usage
$ ./bqetl mozfun rename [OPTIONS] [name] [new_name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
Examples
# Rename routine
./bqetl mozfun rename hist.extract hist.ext
# Rename routine matching a specific pattern
./bqetl mozfun rename *.array_* *.list_*
# Rename routine dataset
./bqetl mozfun rename hist.* histogram.*
backfill
Commands for managing backfills.
create
Create a new backfill entry in the backfill.yaml file. Create a backfill.yaml file if it does not already exist.
Usage
$ ./bqetl backfill create [OPTIONS] [qualified_table_name]
Options:
--sql_dir: Path to directory which contains queries.
--start_date: First date to be backfilled. Date format: yyyy-mm-dd
--end_date: Last date to be backfilled. Date format: yyyy-mm-dd
--exclude: Dates excluded from backfill. Date format: yyyy-mm-dd
--watcher: Watcher of the backfill (email address)
--custom_query_path: Path of the custom query to run the backfill. Optional.
--shredder_mitigation: Wether to run a backfill using an auto-generated query that mitigates shredder effect.
--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.
Examples
./bqetl backfill create moz-fx-data-shared-prod.telemetry_derived.deviations_v1 \
--start_date=2021-03-01 \
--end_date=2021-03-31 \
--exclude=2021-03-03 \
validate
Validate backfill.yaml file format and content.
Usage
$ ./bqetl backfill validate [OPTIONS] [qualified_table_name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
Examples
./bqetl backfill validate moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6
# validate all backfill.yaml files if table is not specified
Use the `--project_id` option to change the project to be validated;
default is `moz-fx-data-shared-prod`.
./bqetl backfill validate
info
Get backfill(s) information from all or specific table(s).
Usage
$ ./bqetl backfill info [OPTIONS] [qualified_table_name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--status: Filter backfills with this status.
Examples
# Get info for specific table.
./bqetl backfill info moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6
# Get info for all tables.
./bqetl backfill info
# Get info from all tables with specific status.
./bqetl backfill info --status=Initiate
scheduled
Get information on backfill(s) that require processing.
Usage
$ ./bqetl backfill scheduled [OPTIONS] [qualified_table_name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
--status: Whether to get backfills to process or to complete.
--json_path: None
Examples
# Get info for specific table.
./bqetl backfill scheduled moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6
# Get info for all tables.
./bqetl backfill scheduled
initiate
Process entry in backfill.yaml with Initiate status that has not yet been processed.
Usage
$ ./bqetl backfill initiate [OPTIONS] [qualified_table_name]
Options:
--parallelism: Maximum number of queries to execute concurrently
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
Examples
# Initiate backfill entry for specific table
./bqetl backfill initiate moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6
Use the `--project_id` option to change the project;
default project_id is `moz-fx-data-shared-prod`.
complete
Complete entry in backfill.yaml with Complete status that has not yet been processed..
Usage
$ ./bqetl backfill complete [OPTIONS] [qualified_table_name]
Options:
--sql_dir: Path to directory which contains queries.
--project_id: GCP project ID
Examples
# Complete backfill entry for specific table
./bqetl backfill complete moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6
Use the `--project_id` option to change the project;
default project_id is `moz-fx-data-shared-prod`.