Skip to content

norm

Functions for normalizing data.

app_channel (UDF)

Normalize app channel name, returning "Other" for unrecognized values.

Normalize app channel name, returning "Other" for unrecognized values.

Based on the logic used in ingestion: https://github.com/mozilla/gcp-ingestion/blob/fb7e9ed9e891e3e2320d85e05d7c1a1aedb57780/ingestion-beam/src/main/java/com/mozilla/telemetry/transforms/NormalizeAttributes.java#L34

Parameters

INPUTS

channel_name STRING

OUTPUTS

STRING

Source | Edit

browser_version_info (UDF)

Adds metadata related to the browser version in a struct.

This is a temporary solution that allows browser version analysis. It should eventually be replaced with one or more browser version tables that serves as a source of truth for version releases.

Parameters

INPUTS

version_string STRING

OUTPUTS

STRUCT<version STRING, major_version NUMERIC, minor_version NUMERIC, patch_revision NUMERIC, is_major_release BOOLEAN>

Source | Edit

diff_months (UDF)

Determine the number of whole months after grace period between start and end. Month is dependent on timezone, so start and end must both be datetimes, or both be dates, in the correct timezone. Grace period can be used to account for billing delay, usually 1 day, and is counted after months. When inclusive is FALSE, start and end are not included in whole months. For example, diff_months(start => '2021-01-01', end => '2021-03-01', grace_period => INTERVAL 0 day, inclusive => FALSE) returns 1, because start plus two months plus grace period is not less than end. Changing inclusive to TRUE returns 2, because start plus two months plus grace period is less than or equal to end. diff_months(start => '2021-01-01', end => '2021-03-02 00:00:00.000001', grace_period => INTERVAL 1 DAY, inclusive => FALSE) returns 2, because start plus two months plus grace period is less than end.

Parameters

INPUTS

start DATETIME, `end` DATETIME, grace_period INTERVAL, inclusive BOOLEAN

Source | Edit

extract_version (UDF)

Extracts numeric version data from a version string like <major>.<minor>.<patch>.

Note: Non-zero minor and patch versions will be floating point Numeric.

Usage:

SELECT
    mozfun.norm.extract_version(version_string, 'major') as major_version,
    mozfun.norm.extract_version(version_string, 'minor') as minor_version,
    mozfun.norm.extract_version(version_string, 'patch') as patch_version

Example using "96.05.01":

SELECT
    mozfun.norm.extract_version('96.05.01', 'major') as major_version, -- 96
    mozfun.norm.extract_version('96.05.01', 'minor') as minor_version, -- 5
    mozfun.norm.extract_version('96.05.01', 'patch') as patch_version  -- 1

Parameters

INPUTS

version_string STRING, extraction_level STRING

OUTPUTS

NUMERIC

Source | Edit

fenix_app_info (UDF)

Returns canonical, human-understandable identification info for Fenix sources.

The Glean telemetry library for Android by design routes pings based on the Play Store appId value of the published application. As of August 2020, there have been 5 separate Play Store appId values associated with different builds of Fenix, each corresponding to different datasets in BigQuery, and the mapping of appId to logical app names (Firefox vs. Firefox Preview) and channel names (nightly, beta, or release) has changed over time; see the spreadsheet of naming history for Mozilla's mobile browsers.

This function is intended as the source of truth for how to map a specific ping in BigQuery to a logical app names and channel. It should be expected that the output of this function may evolve over time. If we rename a product or channel, we may choose to update the values here so that analyses consistently get the new name.

The first argument (app_id) can be fairly fuzzy; it is tolerant of actual Google Play Store appId values like 'org.mozilla.firefox_beta' (mix of periods and underscores) as well as BigQuery dataset names with suffixes like 'org_mozilla_firefox_beta_stable'.

The second argument (app_build_id) should be the value in client_info.app_build.

The function returns a STRUCT that contains the logical app_name and channel as well as the Play Store app_id in the canonical form which would appear in Play Store URLs.

Note that the naming of Fenix applications changed on 2020-07-03, so to get a continuous view of the pings associated with a logical app channel, you may need to union together tables from multiple BigQuery datasets. To see data for all Fenix channels together, it is necessary to union together tables from all 5 datasets. For basic usage information, consider using telemetry.fenix_clients_last_seen which already handles the union. Otherwise, see the example below as a template for how construct a custom union.

Mapping of channels to datasets:

  • release: org_mozilla_firefox
  • beta: org_mozilla_firefox_beta (current) and org_mozilla_fenix
  • nightly: org_mozilla_fenix (current), org_mozilla_fennec_aurora, and org_mozilla_fenix_nightly
-- Example of a query over all Fenix builds advertised as "Firefox Beta"
CREATE TEMP FUNCTION extract_fields(app_id STRING, m ANY TYPE) AS (
  (
    SELECT AS STRUCT
      m.submission_timestamp,
      m.metrics.string.geckoview_version,
      mozfun.norm.fenix_app_info(app_id, m.client_info.app_build).*
  )
);

WITH base AS (
  SELECT
    extract_fields('org_mozilla_firefox_beta', m).*
  FROM
    `mozdata.org_mozilla_firefox_beta.metrics` AS m
  UNION ALL
  SELECT
    extract_fields('org_mozilla_fenix', m).*
  FROM
    `mozdata.org_mozilla_fenix.metrics` AS m
)
SELECT
  DATE(submission_timestamp) AS submission_date,
  geckoview_version,
  COUNT(*)
FROM
  base
WHERE
  app_name = 'Fenix'  -- excludes 'Firefox Preview'
  AND channel = 'beta'
  AND DATE(submission_timestamp) = '2020-08-01'
GROUP BY
  submission_date,
  geckoview_version

Parameters

INPUTS

app_id STRING, app_build_id STRING

OUTPUTS

STRUCT<app_name STRING, channel STRING, app_id STRING>

Source | Edit

fenix_build_to_datetime (UDF)

Convert the Fenix client_info.app_build-format string to a DATETIME. May return NULL on failure.

Fenix originally used an 8-digit app_build format

In short it is yDDDHHmm:

  • y is years since 2018
  • DDD is day of year, 0-padded, 001-366
  • HH is hour of day, 00-23
  • mm is minute of hour, 00-59

The last date seen with an 8-digit build ID is 2020-08-10.

Newer builds use a 10-digit format where the integer represents a pattern consisting of 32 bits. The 17 bits starting 13 bits from the left represent a number of hours since UTC midnight beginning 2014-12-28.

This function tolerates both formats.

After using this you may wish to DATETIME_TRUNC(result, DAY) for grouping by build date.

Parameters

INPUTS

app_build STRING

OUTPUTS

INT64

Source | Edit

firefox_android_package_name_to_channel (UDF)

Map Fenix package name to the channel name

Parameters

INPUTS

package_name STRING

OUTPUTS

STRING

Source | Edit

get_earliest_value (UDF)

This UDF returns the earliest not-null value pair and datetime from a list of values and their corresponding timestamp.

The function will return the first value pair in the input array, that is not null and has the earliest timestamp.

Because there may be more than one value on the same date e.g. more than one value reported by different pings on the same date, the dates must be given as TIMESTAMPS and the values as STRING.

Usage:

SELECT
   mozfun.norm.get_earliest_value(ARRAY<STRUCT<value STRING, value_source STRING, value_date DATETIME>>) AS <alias>

Parameters

INPUTS

value_set ARRAY<STRUCT<value STRING, value_source STRING, value_date DATETIME>>

OUTPUTS

STRUCT<earliest_value STRING, earliest_value_source STRING, earliest_date DATETIME>

Source | Edit

get_windows_info (UDF)

Exract the name, the version name, the version number, and the build number corresponding to a Microsoft Windows operating system version string in the form of .. or ... for most release versions of Windows after 2007.

Windows Names, Versions, and Builds

Summary

This function is primarily designed to parse the field os_version in table mozdata.default_browser_agent.default_browser. Given a Microsoft Windows OS version string, the function returns the name of the operating system, the version name, the version number, and the build number corresponding to the operating system. As of November 2022, the parser can handle 99.89% of the os_version values collected in table mozdata.default_browser_agent.default_browser.

Status as of November 2022

As of November 2022, the expected valid values of os_version are either x.y.z or w.x.y.z where w, x, y, and z are integers.

As of November 2022, the return values for Windows 10 and Windows 11 are based on Windows 10 release information and Windows 11 release information. For 3-number version strings, the parser assumes the valid values of z in x.y.z are at most 5 digits in length. For 4-number version strings, the parser assumes the valid values of z in w.x.y.z are at most 6 digits in length. The function makes an educated effort to handle Windows Vista, Windows 7, Windows 8, and Windows 8.1 information, but does not guarantee the return values are absolutely accurate. The function assumes the presence of undocumented non-release versions of Windows 10 and Windows 11, and will return an estimated name, version number, build number but not the version name. The function does not handle other versions of Windows.

As of November 2022, the parser currently handles just over 99.89% of data in the field os_version in table mozdata.default_browser_agent.default_browser.

Build number conventions

Note: Microsoft convention for build numbers for Windows 10 and 11 include two numbers, such as build number 22621.900 for version 22621. The first number repeats the version number and the second number uniquely identifies the build within the version. To simplify data processing and data analysis, this function returns the second unique identifier as an integer instead of returning the full build number as a string.

Example usage

SELECT
  `os_version`,
  mozfun.norm.get_windows_info(`os_version`) AS windows_info
FROM `mozdata.default_browser_agent.default_browser`
WHERE `submission_timestamp` > (CURRENT_TIMESTAMP() - INTERVAL 7 DAY) AND LEFT(document_id, 2) = '00'
LIMIT 1000

Mapping

os_version windows_name windows_version_name windows_version_number windows_build_number
6.0.z Windows Vista 6.0 6.0 z
6.1.z Windows 7 7.0 6.1 z
6.2.z Windows 8 8.0 6.2 z
6.3.z Windows 8.1 8.1 6.3 z
10.0.10240.z Windows 10 1507 10240 z
10.0.10586.z Windows 10 1511 10586 z
10.0.14393.z Windows 10 1607 14393 z
10.0.15063.z Windows 10 1703 15063 z
10.0.16299.z Windows 10 1709 16299 z
10.0.17134.z Windows 10 1803 17134 z
10.0.17763.z Windows 10 1809 17763 z
10.0.18362.z Windows 10 1903 18362 z
10.0.18363.z Windows 10 1909 18363 z
10.0.19041.z Windows 10 2004 19041 z
10.0.19042.z Windows 10 20H2 19042 z
10.0.19043.z Windows 10 21H1 19043 z
10.0.19044.z Windows 10 21H2 19044 z
10.0.19045.z Windows 10 22H2 19045 z
10.0.y.z Windows 10 UNKNOWN y z
10.0.22000.z Windows 11 21H2 22000 z
10.0.22621.z Windows 11 22H2 22621 z
10.0.y.z Windows 11 UNKNOWN y z
all other values (null) (null) (null) (null)

Parameters

INPUTS

os_version STRING

OUTPUTS

STRUCT<name STRING, version_name STRING, version_number DECIMAL, build_number INT64>

Source | Edit

glean_baseline_client_info (UDF)

Accepts a glean client_info struct as input and returns a modified struct that includes a few parsed or normalized variants of the input fields.

Parameters

INPUTS

client_info ANY TYPE, metrics ANY TYPE

OUTPUTS

string

Source | Edit

glean_ping_info (UDF)

Accepts a glean ping_info struct as input and returns a modified struct that includes a few parsed or normalized variants of the input fields.

Parameters

INPUTS

ping_info ANY TYPE

Source | Edit

metadata (UDF)

Accepts a pipeline metadata struct as input and returns a modified struct that includes a few parsed or normalized variants of the input metadata fields.

Parameters

INPUTS

metadata ANY TYPE

OUTPUTS

`date`, CAST(NULL

Source | Edit

os (UDF)

Normalize an operating system string to one of the three major desktop platforms, one of the two major mobile platforms, or "Other".

This is a reimplementation of logic used in the data pipeline to populate normalized_os.

Parameters

INPUTS

os STRING

Source | Edit

product_info (UDF)

Returns a normalized app_name and canonical_app_name for a product based on legacy_app_name and normalized_os values. Thus, this function serves as a bridge to get from legacy application identifiers to the consistent identifiers we are using for reporting in 2021.

As of 2021, most Mozilla products are sending telemetry via the Glean SDK, with Glean telemetry in active development for desktop Firefox as well. The probeinfo API is the single source of truth for metadata about applications sending Glean telemetry; the values for app_name and canonical_app_name returned here correspond to the "end-to-end identifier" values documented in the v2 Glean app listings endpoint . For non-Glean telemetry, we provide values in the same style to provide continuity as we continue the migration to Glean.

For legacy telemetry pings like main ping for desktop and core ping for mobile products, the legacy_app_name given as input to this function should come from the submission URI (stored as metadata.uri.app_name in BigQuery ping tables). For Glean pings, we have invented product values that can be passed in to this function as the legacy_app_name parameter.

The returned app_name values are intended to be readable and unambiguous, but short and easy to type. They are suitable for use as a key in derived tables. product is a deprecated field that was similar in intent.

The returned canonical_app_name is more verbose and is suited for displaying in visualizations. canonical_name is a synonym that we provide for historical compatibility with previous versions of this function.

The returned struct also contains boolean contributes_to_2021_kpi as the canonical reference for whether the given application is included in KPI reporting. Additional fields may be added for future years.

The normalized_os value that's passed in should be the top-level normalized_os value present in any ping table or you may want to wrap a raw value in mozfun.norm.os like mozfun.norm.product_info(app_name, mozfun.norm.os(os)).

This function also tolerates passing in a product value as legacy_app_name so that this function is still useful for derived tables which have thrown away the raw app_name value from legacy pings.

The mappings are as follows:

legacy_app_name normalized_os app_name product canonical_app_name 2019 2020 2021
Firefox * firefox_desktop Firefox Firefox for Desktop true true true
Fenix Android fenix Fenix Firefox for Android (Fenix) true true true
Fennec Android fennec Fennec Firefox for Android (Fennec) true true true
Firefox Preview Android firefox_preview Firefox Preview Firefox Preview for Android true true true
Fennec iOS firefox_ios Firefox iOS Firefox for iOS true true true
FirefoxForFireTV Android firefox_fire_tv Firefox Fire TV Firefox for Fire TV false false false
FirefoxConnect Android firefox_connect Firefox Echo Firefox for Echo Show true true false
Zerda Android firefox_lite Firefox Lite Firefox Lite true true false
Zerda_cn Android firefox_lite_cn Firefox Lite CN Firefox Lite (China) false false false
Focus Android focus_android Focus Android Firefox Focus for Android true true true
Focus iOS focus_ios Focus iOS Firefox Focus for iOS true true true
Klar Android klar_android Klar Android Firefox Klar for Android false false false
Klar iOS klar_ios Klar iOS Firefox Klar for iOS false false false
Lockbox Android lockwise_android Lockwise Android Lockwise for Android true true false
Lockbox iOS lockwise_ios Lockwise iOS Lockwise for iOS true true false
FirefoxReality* Android firefox_reality Firefox Reality Firefox Reality false false false

Parameters

INPUTS

legacy_app_name STRING, normalized_os STRING

OUTPUTS

STRUCT<app_name STRING, product STRING, canonical_app_name STRING, canonical_name STRING, contributes_to_2019_kpi BOOLEAN, contributes_to_2020_kpi BOOLEAN, contributes_to_2021_kpi BOOLEAN>

Source | Edit

result_type_to_product_name (UDF)

Convert urlbar result types into product-friendly names

This UDF converts result types from urlbar events (engagement, impression, abandonment) into product-friendly names.

Parameters

INPUTS

res STRING

OUTPUTS

STRING

Source | Edit

truncate_version (UDF)

Truncates a version string like <major>.<minor>.<patch> to either the major or minor version. The return value is NUMERIC, which means that you can sort the results without fear (e.g. 100 will be categorized as greater than 80, which isn't the case when sorting lexigraphically).

For example, "5.1.0" would be translated to 5.1 if the parameter is "minor" or 5 if the parameter is major.

If the version is only a major and/or minor version, then it will be left unchanged (for example "10" would stay as 10 when run through this function, no matter what the arguments).

This is useful for grouping Linux and Mac operating system versions inside aggregate datasets or queries where there may be many different patch releases in the field.

Parameters

INPUTS

os_version STRING, truncation_level STRING

OUTPUTS

NUMERIC

Source | Edit

vpn_attribution (UDF)

Accepts vpn attribution fields as input and returns a struct of normalized fields.

Parameters

INPUTS

utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING

OUTPUTS

STRUCT<normalized_acquisition_channel STRING, normalized_campaign STRING, normalized_content STRING, normalized_medium STRING, normalized_source STRING, website_channel_group STRING>

Source | Edit

windows_version_info (UDF)

Given an unnormalized set off Windows identifiers, return a friendly version of the operating system name.

Requires os, os_version and windows_build_number.

E.G. from windows_build_number >= 22000 return Windows 11

Parameters

INPUTS

os STRING, os_version STRING, windows_build_number INT64

OUTPUTS

STRING

Source | Edit