Skip to content

norm

Functions for normalizing data.

extract_version (UDF)

Extracts numeric version data from a version string like <major>.<minor>.<patch>.

Note: Non-zero minor and patch versions will be floating point Numeric.

Usage:

SELECT
    mozfun.norm.extract_version(version_string, 'major') as major_version,
    mozfun.norm.extract_version(version_string, 'minor') as minor_version,
    mozfun.norm.extract_version(version_string, 'patch') as patch_version

Example using "96.05.01":

SELECT
    mozfun.norm.extract_version('96.05.01', 'major') as major_version, -- 96
    mozfun.norm.extract_version('96.05.01', 'minor') as minor_version, -- 5
    mozfun.norm.extract_version('96.05.01', 'patch') as patch_version  -- 1
```[Source](https://github.com/mozilla/bigquery-etl/blob/generated-sql/sql/mozfun/norm/extract_version)  |  [Edit](https://github.com/mozilla/bigquery-etl/edit/generated-sql/sql/mozfun/norm/extract_version/metadata.yaml)



## fenix_app_info (UDF)

Returns canonical, human-understandable identification info for Fenix sources.

The Glean telemetry library for Android by design routes pings based
on the Play Store appId value of the published application.
As of August 2020, there have been 5 separate Play Store appId
 values associated with different builds of Fenix,
 each corresponding to different datasets in BigQuery,
and the mapping of appId to logical app names
(Firefox vs. Firefox Preview) and channel names
 (nightly, beta, or release) has changed over time; see the
[spreadsheet of naming history for Mozilla's mobile browsers](
  <https://docs.google.com/spreadsheets/d/18PzkzZxdpFl23__-CIO735NumYDqu7jHpqllo0sBbPA/edit#gid=0).>

This function is intended as the source of truth for how to map a
specific ping in BigQuery to a logical app names and channel.
It should be expected that the output of this function may evolve
over time. If we rename a product or channel, we may choose to
update the values here so that analyses consistently get the new name.

The first argument (`app_id`) can be fairly fuzzy; it is tolerant
of actual Google Play Store appId values like 'org.mozilla.firefox_beta'
(mix of periods and underscores) as well as BigQuery dataset names
with suffixes like 'org_mozilla_firefox_beta_stable'.

The second argument (`app_build_id`) should be the value
in client_info.app_build.

The function returns a `STRUCT` that contains the logical `app_name`
and `channel` as well as the Play Store `app_id` in the canonical
form which would appear in Play Store URLs.

Note that the naming of Fenix applications changed on 2020-07-03,
so to get a continuous view of the pings associated with a logical
app channel, you may need to union together tables from multiple
BigQuery datasets. To see data for all Fenix channels together, it
is necessary to union together tables from all 5 datasets.
For basic usage information, consider using
`telemetry.fenix_clients_last_seen` which already handles the
union. Otherwise, see the example below as a template for how
construct a custom union.

Mapping of channels to datasets:

- release: `org_mozilla_firefox`
- beta: `org_mozilla_firefox_beta` (current) and `org_mozilla_fenix`
- nightly: `org_mozilla_fenix` (current), `org_mozilla_fennec_aurora`,
and `org_mozilla_fenix_nightly`


```sql
-- Example of a query over all Fenix builds advertised as "Firefox Beta"
CREATE TEMP FUNCTION extract_fields(app_id STRING, m ANY TYPE) AS (
  (
    SELECT AS STRUCT
      m.submission_timestamp,
      m.metrics.string.geckoview_version,
      mozfun.norm.fenix_app_info(app_id, m.client_info.app_build).*
  )
);

WITH base AS (
  SELECT
    extract_fields('org_mozilla_firefox_beta', m).*
  FROM
    org_mozilla_firefox_beta.metrics AS m
  UNION ALL
  SELECT
    extract_fields('org_mozilla_fenix', m).*
  FROM
    org_mozilla_fenix.metrics AS m
)
SELECT
  DATE(submission_timestamp) AS submission_date,
  geckoview_version,
  COUNT(*)
FROM
  base
WHERE
  app_name = 'Fenix'  -- excludes 'Firefox Preview'
  AND channel = 'beta'
  AND DATE(submission_timestamp) = '2020-08-01'
GROUP BY
  submission_date,
  geckoview_version

Source | Edit

fenix_build_to_date (UDF)

Convert the Fenix client_info.app_build-format string to a DATETIME. May return NULL on failure.

Fenix originally used an 8-digit app_build format>

In short it is yDDDHHmm:

  • y is years since 2018
  • DDD is day of year, 0-padded, 001-366
  • HH is hour of day, 00-23
  • mm is minute of hour, 00-59

The last date seen with an 8-digit build ID is 2020-08-10.

Newer builds use a 10-digit format> where the integer represents a pattern consisting of 32 bits. The 17 bits starting 13 bits from the left represent a number of hours since UTC midnight beginning 2014-12-28.

This function tolerates both formats.

After using this you may wish to DATETIME_TRUNC(result, DAY) for grouping by build date.

Source | Edit

glean_ping_info (UDF)

Accepts a glean ping_info struct as input and returns a modified struct that includes a few parsed or normalized variants of the input fields.

Source | Edit

metadata (UDF)

Accepts a pipeline metadata struct as input and returns a modified struct that includes a few parsed or normalized variants of the input metadata fields.

Source | Edit

os (UDF)

Normalize an operating system string to one of the three major desktop platforms, one of the two major mobile platforms, or "Other".

This is a reimplementation of logic used in the data pipeline> to populate normalized_os.

Source | Edit

glean_baseline_client_info (UDF)

Accepts a glean client_info struct as input and returns a modified struct that includes a few parsed or normalized variants of the input fields.

Source | Edit

truncate_version (UDF)

Truncates a version string like <major>.<minor>.<patch> to either the major or minor version. The return value is NUMERIC, which means that you can sort the results without fear (e.g. 100 will be categorized as greater than 80, which isn't the case when sorting lexigraphically).

For example, "5.1.0" would be translated to 5.1 if the parameter is "minor" or 5 if the parameter is major.

If the version is only a major and/or minor version, then it will be left unchanged (for example "10" would stay as 10 when run through this function, no matter what the arguments).

This is useful for grouping Linux and Mac operating system versions inside aggregate datasets or queries where there may be many different patch releases in the field.

Source | Edit

vpn_attribution (UDF)

Accepts vpn attribution fields as input and returns a struct of normalized fields.

Source | Edit

product_info (UDF)

Returns a normalized app_name and canonical_app_name for a product based on legacy_app_name and normalized_os values. Thus, this function serves as a bridge to get from legacy application identifiers to the consistent identifiers we are using for reporting in 2021.

As of 2021, most Mozilla products are sending telemetry via the Glean SDK, with Glean telemetry in active development for desktop Firefox as well. The probeinfo API is the single source of truth for metadata about applications sending Glean telemetry; the values for app_name and canonical_app_name returned here correspond to the "end-to-end identifier" values documented in the v2 Glean app listings endpoint . For non-Glean telemetry, we provide values in the same style to provide continuity as we continue the migration to Glean.

For legacy telemetry pings like main ping for desktop and core ping for mobile products, the legacy_app_name given as input to this function should come from the submission URI (stored as metadata.uri.app_name in BigQuery ping tables). For Glean pings, we have invented product values that can be passed in to this function as the legacy_app_name parameter.

The returned app_name values are intended to be readable and unambiguous, but short and easy to type. They are suitable for use as a key in derived tables. product is a deprecated field that was similar in intent.

The returned canonical_app_name is more verbose and is suited for displaying in visualizations. canonical_name is a synonym that we provide for historical compatibility with previous versions of this function.

The returned struct also contains boolean contributes_to_2021_kpi as the canonical reference for whether the given application is included in KPI reporting. Additional fields may be added for future years.

The normalized_os value that's passed in should be the top-level normalized_os value present in any ping table or you may want to wrap a raw value in mozfun.norm.os like mozfun.norm.product_info(app_name, mozfun.norm.os(os)).

This function also tolerates passing in a product value as legacy_app_name so that this function is still useful for derived tables which have thrown away the raw app_name value from legacy pings.

The mappings are as follows:

legacy_app_name normalized_os app_name product canonical_app_name 2019 2020 2021
Firefox * firefox_desktop Firefox Firefox for Desktop true true true
Fenix Android fenix Fenix Firefox for Android (Fenix) true true true
Fennec Android fennec Fennec Firefox for Android (Fennec) true true true
Firefox Preview Android firefox_preview Firefox Preview Firefox Preview for Android true true true
Fennec iOS firefox_ios Firefox iOS Firefox for iOS true true true
FirefoxForFireTV Android firefox_fire_tv Firefox Fire TV Firefox for Fire TV false false false
FirefoxConnect Android firefox_connect Firefox Echo Firefox for Echo Show true true false
Zerda Android firefox_lite Firefox Lite Firefox Lite true true false
Zerda_cn Android firefox_lite_cn Firefox Lite CN Firefox Lite (China) false false false
Focus Android focus_android Focus Android Firefox Focus for Android true true true
Focus iOS focus_ios Focus iOS Firefox Focus for iOS true true true
Klar Android klar_android Klar Android Firefox Klar for Android false false false
Klar iOS klar_ios Klar iOS Firefox Klar for iOS false false false
Lockbox Android lockwise_android Lockwise Android Lockwise for Android true true false
Lockbox iOS lockwise_ios Lockwise iOS Lockwise for iOS true true false
FirefoxReality* Android firefox_reality Firefox Reality Firefox Reality false false false
Source Edit

browser_version_info (UDF)

Adds metadata related to the browser version in a struct.

This is a temporary solution that allows browser version analysis. It should eventually be replaced with one or more browser version tables that serves as a source of truth for version releases.

Source | Edit

diff_months (UDF)

Determine the number of whole months after grace period between start and end. Month is dependent on timezone, so start and end must both be datetimes, or both be dates, in the correct timezone. Grace period can be used to account for billing delay, usually 1 day, and is counted after months. When inclusive is FALSE, start and end are not included in whole months. For example, diff_months(start => '2021-01-01', end => '2021-03-01', grace_period => INTERVAL 0 day, inclusive => FALSE) returns 1, because start plus two months plus grace period is not less than end. Changing inclusive to TRUE returns 2, because start plus two months plus grace period is less than or equal to end. diff_months(start => '2021-01-01', end => '2021-03-02 00:00:00.000001', grace_period => INTERVAL 1 DAY, inclusive => FALSE) returns 2, because start plus two months plus grace period is less than end.

Source | Edit