Skip to content

BigQuery ETL

Udf js

mozilla/bigquery-etl

Udf js

bootstrap_percentile_ci (UDF)

Calculate a confidence interval using an efficient bootstrap sampling technique for a given percentile of a histogram. This implementation relies on the stdlib.js library and the binomial quantile function (https://github.com/stdlib-js/stats-base-dists-binomial-quantile/) for randomly sampling from a binomial distribution.

Parameters

INPUTS

percentiles ARRAY<INT64>, histogram STRUCT<values ARRAY<STRUCT<key FLOAT64, value FLOAT64>>>, metric STRING

OUTPUTS

ARRAY<STRUCT<metric STRING, statistic STRING, point FLOAT64, lower FLOAT64, upper FLOAT64, parameter STRING>>DETERMINISTIC

crc32 (UDF)

Calculate the CRC-32 hash of an input string. The implementation here could be optimized. In particular, it calculates a lookup table on every invocation which could be cached and reused. In practice, though, this implementation appears to be fast enough that further optimization is not yet warranted. Based on https://stackoverflow.com/a/18639999/1260237 See https://en.wikipedia.org/wiki/Cyclic_redundancy_check

Parameters

INPUTS

data STRING

OUTPUTS

INT64 DETERMINISTIC

decode_uri_attribution (UDF)

URL decodes the raw firefox_installer.install.attribution string to a STRUCT. The fields campaign, content, dlsource, dltoken, experiment, medium, source, ua, variation the string are extracted. If any value is (not+set) it is converted to (not set) to match the text from GA when the fields are not set.

Parameters

INPUTS

attribution STRING

OUTPUTS

STRUCT<campaign STRING, content STRING, dlsource STRING, dltoken STRING, experiment STRING, medium STRING, source STRING, ua STRING, variation STRING>DETERMINISTIC

extract_string_from_bytes (UDF)

Related to https://mozilla-hub.atlassian.net/browse/RS-682. The function extracts string data from payload which is in bytes.

Parameters

INPUTS

payload BYTES

OUTPUTS

STRING

gunzip (UDF)

Unzips a GZIP string. This implementation relies on the zlib.js library (https://github.com/imaya/zlib.js) and the atob function for decoding base64.

Parameters

INPUTS

input BYTES

OUTPUTS

STRING DETERMINISTIC

jackknife_mean_ci (UDF)

Calculates a confidence interval using a jackknife resampling technique for the mean of an array of values for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements. Usage generally involves first calculating an aggregate per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function.

Parameters

INPUTS

n_buckets INT64, values_per_bucket ARRAY<FLOAT64>

OUTPUTS

STRUCT<low FLOAT64, high FLOAT64, pm FLOAT64>DETERMINISTIC

jackknife_percentile_ci (UDF)

Calculate a confidence interval using a jackknife resampling technique for a given percentile of a histogram.

Parameters

INPUTS

percentile FLOAT64, histogram STRUCT<values ARRAY<STRUCT<key FLOAT64, value FLOAT64>>>

OUTPUTS

STRUCT<low FLOAT64, high FLOAT64, percentile FLOAT64>DETERMINISTIC

jackknife_ratio_ci (UDF)

Calculates a confidence interval using a jackknife resampling technique for the weighted mean of an array of ratios for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements. Usage generally involves first calculating an aggregate per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function. Example: WITH bucketed AS ( SELECT submission_date, SUM(active_days_in_week) AS active_days_in_week, SUM(wau) AS wau FROM mytable GROUP BY submission_date, bucket_id ) SELECT submission_date, udf_js.jackknife_ratio_ci(20, ARRAY_AGG(STRUCT(CAST(active_days_in_week AS float64), CAST(wau as FLOAT64)))) AS intensity FROM bucketed GROUP BY submission_date

Parameters

INPUTS

n_buckets INT64, values_per_bucket ARRAY<STRUCT<numerator FLOAT64, denominator FLOAT64>>

OUTPUTS

intensity FROM bucketed GROUP BY submission_date */ CREATE OR REPLACE FUNCTION udf_js.jackknife_ratio_ci( n_buckets INT64, values_per_bucket ARRAY<STRUCT<numerator FLOAT64, denominator FLOAT64>>

jackknife_sum_ci (UDF)

Calculates a confidence interval using a jackknife resampling technique for the sum of an array of counts for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements. Usage generally involves first calculating an aggregate count per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function. Example: WITH bucketed AS ( SELECT submission_date, SUM(dau) AS dau_sum FROM mytable GROUP BY submission_date, bucket_id ) SELECT submission_date, udf_js.jackknife_sum_ci(ARRAY_AGG(dau_sum)).* FROM bucketed GROUP BY submission_date

Parameters

INPUTS

n_buckets INT64, counts_per_bucket ARRAY<INT64>

OUTPUTS

STRUCT<total INT64, low INT64, high INT64, pm INT64>DETERMINISTIC

json_extract_events (UDF)

Parameters

INPUTS

input STRING

OUTPUTS

ARRAY<STRUCT<event_process STRING, event_timestamp INT64, event_category STRING, event_object STRING, event_method STRING, event_string_value STRING, event_map_values ARRAY<STRUCT<key STRING, value STRING>>>>DETERMINISTIC

json_extract_histogram (UDF)

Returns a parsed struct from a JSON string representing a histogram. This implementation uses JavaScript and is provided for performance comparison; see udf/udf_json_extract_histogram for a pure SQL implementation that will likely be more usable in practice.

Parameters

INPUTS

input STRING

OUTPUTS

STRUCT<bucket_count INT64, histogram_type INT64, `sum` INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value INT64>>>DETERMINISTIC

json_extract_keyed_histogram (UDF)

Returns an array of parsed structs from a JSON string representing a keyed histogram. This is likely only useful for histograms that weren't properly parsed to fields, so ended up embedded in an additional_properties JSON blob. Normally, keyed histograms will be modeled as a key/value struct where the values are JSON representations of single histograms. There is no pure SQL equivalent to this function, since BigQuery does not provide any functions for listing or iterating over keysn in a JSON map.

Parameters

INPUTS

input STRING

OUTPUTS

ARRAY<STRUCT<key STRING, bucket_count INT64, histogram_type INT64, `sum` INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value INT64>>>>DETERMINISTIC

json_extract_missing_cols (UDF)

Extract missing columns from additional properties. More generally, get a list of nodes from a JSON blob. Array elements are indicated as [...]. param input: The JSON blob to explode param indicates_node: An array of strings. If a key's value is an object, and contains one of these values, that key is returned as a node. param known_nodes: An array of strings. If a key is in this array, it is returned as a node. Notes: - Use indicates_node for things like histograms. For example ['histogram_type'] will ensure that each histogram will be returned as a missing node, rather than the subvalues within the histogram (e.g. values, sum, etc.) - Use known_nodes if you're aware of a missing section, like ['simpleMeasurements'] See here for an example usage https://sql.telemetry.mozilla.org/queries/64460/source

Parameters

INPUTS

input STRING, indicates_node ARRAY<STRING>, known_nodes ARRAY<STRING>

OUTPUTS

ARRAY<STRING>DETERMINISTIC

main_summary_active_addons (UDF)

Add fields from additional_attributes to active_addons in main pings. Return an array instead of a "map" for backwards compatibility. The INT64 columns from BigQuery may be passed as strings, so parseInt before returning them if they will be coerced to BOOL. The fields from additional_attributes due to union types: integer or boolean for foreignInstall and userDisabled; string or number for version. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala#L422-L449

Parameters

INPUTS

active_addons ARRAY<STRUCT<key STRING, value STRUCT<app_disabled BOOL, blocklisted BOOL, description STRING, foreign_install INT64, has_binary_components BOOL, install_day INT64, is_system BOOL, is_web_extension BOOL, multiprocess_compatible BOOL, name STRING, scope INT64, signed_state INT64, type STRING, update_day INT64, user_disabled INT64, version STRING>>>, active_addons_json STRING

OUTPUTS

ARRAY<STRUCT<addon_id STRING, blocklisted BOOL, name STRING, user_disabled BOOL, app_disabled BOOL, version STRING, scope INT64, type STRING, foreign_install BOOL, has_binary_components BOOL, install_day INT64, update_day INT64, signed_state INT64, is_system BOOL, is_web_extension BOOL, multiprocess_compatible BOOL>>DETERMINISTIC

main_summary_addon_scalars (UDF)

Parse scalars from payload.processes.dynamic into map columns for each value type. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L385-L399

Parameters

INPUTS

dynamic_scalars_json STRING, dynamic_keyed_scalars_json STRING

OUTPUTS

STRUCT<keyed_boolean_addon_scalars ARRAY<STRUCT<key STRING, value ARRAY<STRUCT<key STRING, value BOOL>>>>, keyed_uint_addon_scalars ARRAY<STRUCT<key STRING, value ARRAY<STRUCT<key STRING, value INT64>>>>, string_addon_scalars ARRAY<STRUCT<key STRING, value STRING>>, keyed_string_addon_scalars ARRAY<STRUCT<key STRING, value ARRAY<STRUCT<key STRING, value STRING>>>>, uint_addon_scalars ARRAY<STRUCT<key STRING, value INT64>>, boolean_addon_scalars ARRAY<STRUCT<key STRING, value BOOL>>>DETERMINISTIC

main_summary_disabled_addons (UDF)

Report the ids of the addons which are in the addonDetails but not in the activeAddons. They are the disabled addons (possibly because they are legacy). We need this as addonDetails may contain both disabled and active addons. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala#L451-L464

Parameters

INPUTS

active_addon_ids ARRAY<STRING>, addon_details_json STRING

OUTPUTS

ARRAY<STRING>DETERMINISTIC

parse_sponsored_interaction (UDF)

Related to https://mozilla-hub.atlassian.net/browse/RS-682. The function parses the sponsored interaction column from payload_error_bytes.contextual_services table.

Parameters

INPUTS

params STRING

OUTPUTS

STRUCT<`source` STRING, formFactor STRING, scenario STRING, interactionType STRING, contextId STRING, reportingUrl STRING, requestId STRING, submissionTimestamp TIMESTAMP, parsedReportingUrl JSON, originalDocType STRING, originalNamespace STRING, interactionCount INTEGER, flaggedFraud BOOLEAN>

sample_id (UDF)

Stably hash a client_id to an integer between 0 and 99. This function is technically defined in SQL, but it calls a JS UDF implementation of a CRC-32 hash, so we defined it here to make it clear that its performance may be limited by BigQuery's JavaScript UDF environment.

Parameters

INPUTS

client_id STRING

OUTPUTS

INT64