norm
Functions for normalizing data.
app_channel (UDF)
Normalize app channel name, returning "Other" for unrecognized values.
Normalize app channel name, returning "Other" for unrecognized values.
Based on the logic used in ingestion: https://github.com/mozilla/gcp-ingestion/blob/fb7e9ed9e891e3e2320d85e05d7c1a1aedb57780/ingestion-beam/src/main/java/com/mozilla/telemetry/transforms/NormalizeAttributes.java#L34
Parameters
INPUTS
channel_name STRING
OUTPUTS
STRING
browser_version_info (UDF)
Adds metadata related to the browser version in a struct.
This is a temporary solution that allows browser version analysis. It should eventually be replaced with one or more browser version tables that serves as a source of truth for version releases.
Parameters
INPUTS
version_string STRING
OUTPUTS
STRUCT<version STRING, major_version NUMERIC, minor_version NUMERIC, patch_revision NUMERIC, is_major_release BOOLEAN>
diff_months (UDF)
Determine the number of whole months after grace period between start and end. Month is dependent on timezone, so start and end must both be datetimes, or both be dates, in the correct timezone. Grace period can be used to account for billing delay, usually 1 day, and is counted after months. When inclusive is FALSE, start and end are not included in whole months. For example, diff_months(start => '2021-01-01', end => '2021-03-01', grace_period => INTERVAL 0 day, inclusive => FALSE) returns 1, because start plus two months plus grace period is not less than end. Changing inclusive to TRUE returns 2, because start plus two months plus grace period is less than or equal to end. diff_months(start => '2021-01-01', end => '2021-03-02 00:00:00.000001', grace_period => INTERVAL 1 DAY, inclusive => FALSE) returns 2, because start plus two months plus grace period is less than end.
Parameters
INPUTS
start DATETIME, `end` DATETIME, grace_period INTERVAL, inclusive BOOLEAN
extract_version (UDF)
Extracts numeric version data from a version string like
<major>.<minor>.<patch>
.
Note: Non-zero minor and patch versions will be floating point Numeric
.
Usage:
SELECT
mozfun.norm.extract_version(version_string, 'major') as major_version,
mozfun.norm.extract_version(version_string, 'minor') as minor_version,
mozfun.norm.extract_version(version_string, 'patch') as patch_version
Example using "96.05.01"
:
SELECT
mozfun.norm.extract_version('96.05.01', 'major') as major_version, -- 96
mozfun.norm.extract_version('96.05.01', 'minor') as minor_version, -- 5
mozfun.norm.extract_version('96.05.01', 'patch') as patch_version -- 1
Parameters
INPUTS
version_string STRING, extraction_level STRING
OUTPUTS
NUMERIC
fenix_app_info (UDF)
Returns canonical, human-understandable identification info for Fenix sources.
The Glean telemetry library for Android by design routes pings based on the Play Store appId value of the published application. As of August 2020, there have been 5 separate Play Store appId values associated with different builds of Fenix, each corresponding to different datasets in BigQuery, and the mapping of appId to logical app names (Firefox vs. Firefox Preview) and channel names (nightly, beta, or release) has changed over time; see the spreadsheet of naming history for Mozilla's mobile browsers.
This function is intended as the source of truth for how to map a specific ping in BigQuery to a logical app names and channel. It should be expected that the output of this function may evolve over time. If we rename a product or channel, we may choose to update the values here so that analyses consistently get the new name.
The first argument (app_id
) can be fairly fuzzy; it is tolerant
of actual Google Play Store appId values like 'org.mozilla.firefox_beta'
(mix of periods and underscores) as well as BigQuery dataset names
with suffixes like 'org_mozilla_firefox_beta_stable'.
The second argument (app_build_id
) should be the value
in client_info.app_build.
The function returns a STRUCT
that contains the logical app_name
and channel
as well as the Play Store app_id
in the canonical
form which would appear in Play Store URLs.
Note that the naming of Fenix applications changed on 2020-07-03,
so to get a continuous view of the pings associated with a logical
app channel, you may need to union together tables from multiple
BigQuery datasets. To see data for all Fenix channels together, it
is necessary to union together tables from all 5 datasets.
For basic usage information, consider using
telemetry.fenix_clients_last_seen
which already handles the
union. Otherwise, see the example below as a template for how
construct a custom union.
Mapping of channels to datasets:
- release:
org_mozilla_firefox
- beta:
org_mozilla_firefox_beta
(current) andorg_mozilla_fenix
- nightly:
org_mozilla_fenix
(current),org_mozilla_fennec_aurora
, andorg_mozilla_fenix_nightly
-- Example of a query over all Fenix builds advertised as "Firefox Beta"
CREATE TEMP FUNCTION extract_fields(app_id STRING, m ANY TYPE) AS (
(
SELECT AS STRUCT
m.submission_timestamp,
m.metrics.string.geckoview_version,
mozfun.norm.fenix_app_info(app_id, m.client_info.app_build).*
)
);
WITH base AS (
SELECT
extract_fields('org_mozilla_firefox_beta', m).*
FROM
`mozdata.org_mozilla_firefox_beta.metrics` AS m
UNION ALL
SELECT
extract_fields('org_mozilla_fenix', m).*
FROM
`mozdata.org_mozilla_fenix.metrics` AS m
)
SELECT
DATE(submission_timestamp) AS submission_date,
geckoview_version,
COUNT(*)
FROM
base
WHERE
app_name = 'Fenix' -- excludes 'Firefox Preview'
AND channel = 'beta'
AND DATE(submission_timestamp) = '2020-08-01'
GROUP BY
submission_date,
geckoview_version
Parameters
INPUTS
app_id STRING, app_build_id STRING
OUTPUTS
STRUCT<app_name STRING, channel STRING, app_id STRING>
fenix_build_to_datetime (UDF)
Convert the Fenix client_info.app_build-format string to a DATETIME. May return NULL on failure.
Fenix originally used an 8-digit app_build format
In short it is yDDDHHmm
:
- y is years since 2018
- DDD is day of year, 0-padded, 001-366
- HH is hour of day, 00-23
- mm is minute of hour, 00-59
The last date seen with an 8-digit build ID is 2020-08-10.
Newer builds use a 10-digit format where the integer represents a pattern consisting of 32 bits. The 17 bits starting 13 bits from the left represent a number of hours since UTC midnight beginning 2014-12-28.
This function tolerates both formats.
After using this you may wish to DATETIME_TRUNC(result, DAY)
for
grouping by build date.
Parameters
INPUTS
app_build STRING
OUTPUTS
INT64
firefox_android_package_name_to_channel (UDF)
Map Fenix package name to the channel name
Parameters
INPUTS
package_name STRING
OUTPUTS
STRING
get_earliest_value (UDF)
This UDF returns the earliest not-null value pair and datetime from a list of values and their corresponding timestamp.
The function will return the first value pair in the input array, that is not null and has the earliest timestamp.
Because there may be more than one value on the same date e.g. more than one value reported by different pings on the same date, the dates must be given as TIMESTAMPS and the values as STRING.
Usage:
SELECT
mozfun.norm.get_earliest_value(ARRAY<STRUCT<value STRING, value_source STRING, value_date DATETIME>>) AS <alias>
Parameters
INPUTS
value_set ARRAY<STRUCT<value STRING, value_source STRING, value_date DATETIME>>
OUTPUTS
STRUCT<earliest_value STRING, earliest_value_source STRING, earliest_date DATETIME>
get_windows_info (UDF)
Exract the name, the version name, the version number, and the build number
corresponding to a Microsoft Windows operating system version string in the
form of
Windows Names, Versions, and Builds
Summary
This function is primarily designed to parse the field os_version
in table mozdata.default_browser_agent.default_browser
. Given a Microsoft Windows OS version string, the function returns the name of the operating system, the version name, the version number, and the build number corresponding to the operating system. As of November 2022, the parser can handle 99.89% of the os_version
values collected in table mozdata.default_browser_agent.default_browser
.
Status as of November 2022
As of November 2022, the expected valid values of os_version
are either x.y.z
or w.x.y.z
where w
, x
, y
, and z
are integers.
As of November 2022, the return values for Windows 10 and Windows 11 are based on Windows 10 release information and Windows 11 release information. For 3-number version strings, the parser assumes the valid values of z
in x.y.z
are at most 5 digits in length. For 4-number version strings, the parser assumes the valid values of z
in w.x.y.z
are at most 6 digits in length. The function makes an educated effort to handle Windows Vista, Windows 7, Windows 8, and Windows 8.1 information, but does not guarantee the return values are absolutely accurate. The function assumes the presence of undocumented non-release versions of Windows 10 and Windows 11, and will return an estimated name, version number, build number but not the version name. The function does not handle other versions of Windows.
As of November 2022, the parser currently handles just over 99.89% of data in the field os_version
in table mozdata.default_browser_agent.default_browser
.
Build number conventions
Note: Microsoft convention for build numbers for Windows 10 and 11 include two numbers, such as build number 22621.900
for version 22621
. The first number repeats the version number and the second number uniquely identifies the build within the version. To simplify data processing and data analysis, this function returns the second unique identifier as an integer instead of returning the full build number as a string.
Example usage
SELECT
`os_version`,
mozfun.norm.get_windows_info(`os_version`) AS windows_info
FROM `mozdata.default_browser_agent.default_browser`
WHERE `submission_timestamp` > (CURRENT_TIMESTAMP() - INTERVAL 7 DAY) AND LEFT(document_id, 2) = '00'
LIMIT 1000
Mapping
os_version | windows_name | windows_version_name | windows_version_number | windows_build_number |
---|---|---|---|---|
6.0.z | Windows Vista | 6.0 | 6.0 | z |
6.1.z | Windows 7 | 7.0 | 6.1 | z |
6.2.z | Windows 8 | 8.0 | 6.2 | z |
6.3.z | Windows 8.1 | 8.1 | 6.3 | z |
10.0.10240.z | Windows 10 | 1507 | 10240 | z |
10.0.10586.z | Windows 10 | 1511 | 10586 | z |
10.0.14393.z | Windows 10 | 1607 | 14393 | z |
10.0.15063.z | Windows 10 | 1703 | 15063 | z |
10.0.16299.z | Windows 10 | 1709 | 16299 | z |
10.0.17134.z | Windows 10 | 1803 | 17134 | z |
10.0.17763.z | Windows 10 | 1809 | 17763 | z |
10.0.18362.z | Windows 10 | 1903 | 18362 | z |
10.0.18363.z | Windows 10 | 1909 | 18363 | z |
10.0.19041.z | Windows 10 | 2004 | 19041 | z |
10.0.19042.z | Windows 10 | 20H2 | 19042 | z |
10.0.19043.z | Windows 10 | 21H1 | 19043 | z |
10.0.19044.z | Windows 10 | 21H2 | 19044 | z |
10.0.19045.z | Windows 10 | 22H2 | 19045 | z |
10.0.y.z | Windows 10 | UNKNOWN | y | z |
10.0.22000.z | Windows 11 | 21H2 | 22000 | z |
10.0.22621.z | Windows 11 | 22H2 | 22621 | z |
10.0.y.z | Windows 11 | UNKNOWN | y | z |
all other values | (null) | (null) | (null) | (null) |
Parameters
INPUTS
os_version STRING
OUTPUTS
STRUCT<name STRING, version_name STRING, version_number DECIMAL, build_number INT64>
glean_baseline_client_info (UDF)
Accepts a glean client_info struct as input and returns a modified struct that includes a few parsed or normalized variants of the input fields.
Parameters
INPUTS
client_info ANY TYPE, metrics ANY TYPE
OUTPUTS
string
glean_ping_info (UDF)
Accepts a glean ping_info struct as input and returns a modified struct that includes a few parsed or normalized variants of the input fields.
Parameters
INPUTS
ping_info ANY TYPE
metadata (UDF)
Accepts a pipeline metadata struct as input and returns a modified struct that includes a few parsed or normalized variants of the input metadata fields.
Parameters
INPUTS
metadata ANY TYPE
OUTPUTS
`date`, CAST(NULL
os (UDF)
Normalize an operating system string to one of the three major desktop platforms, one of the two major mobile platforms, or "Other".
This is a reimplementation of
logic used in the data pipeline
to populate normalized_os
.
Parameters
INPUTS
os STRING
product_info (UDF)
Returns a normalized app_name
and canonical_app_name
for a product based on legacy_app_name
and normalized_os
values.
Thus, this function serves as a bridge to get from legacy application
identifiers to the consistent identifiers we are using for reporting in 2021.
As of 2021, most Mozilla products are sending telemetry via the Glean SDK,
with Glean telemetry in active development for desktop Firefox as well. The
probeinfo
API is the single source of truth for metadata about applications
sending Glean telemetry; the values for app_name
and
canonical_app_name
returned here correspond to the "end-to-end identifier"
values documented in
the v2 Glean app listings endpoint
.
For non-Glean telemetry, we provide values in the same style to provide
continuity as we continue the migration to Glean.
For legacy telemetry pings like main
ping for desktop and core
ping for
mobile products, the legacy_app_name
given as input to this function
should come from the submission URI (stored as
metadata.uri.app_name
in BigQuery ping tables).
For Glean pings, we have invented product
values that can be passed
in to this function as the legacy_app_name
parameter.
The returned app_name
values are intended to be
readable and unambiguous, but short and easy to type.
They are suitable for use as a key in derived tables.
product
is a deprecated field that was similar in intent.
The returned canonical_app_name
is more verbose and is suited for displaying
in visualizations. canonical_name
is a synonym that we provide for
historical compatibility with previous versions of this function.
The returned struct also contains boolean contributes_to_2021_kpi
as the
canonical reference for whether the given application is included in KPI
reporting. Additional fields may be added for future years.
The normalized_os
value that's passed in should be the top-level
normalized_os
value present in any ping table or you may want
to wrap a raw value in mozfun.norm.os
like mozfun.norm.product_info(app_name, mozfun.norm.os(os))
.
This function also tolerates passing in a product
value as
legacy_app_name
so that this function is still useful for derived tables
which have thrown away the raw app_name
value from legacy pings.
The mappings are as follows:
legacy_app_name | normalized_os | app_name | product | canonical_app_name | 2019 | 2020 | 2021 |
---|---|---|---|---|---|---|---|
Firefox | * | firefox_desktop | Firefox | Firefox for Desktop | true | true | true |
Fenix | Android | fenix | Fenix | Firefox for Android (Fenix) | true | true | true |
Fennec | Android | fennec | Fennec | Firefox for Android (Fennec) | true | true | true |
Firefox Preview | Android | firefox_preview | Firefox Preview | Firefox Preview for Android | true | true | true |
Fennec | iOS | firefox_ios | Firefox iOS | Firefox for iOS | true | true | true |
FirefoxForFireTV | Android | firefox_fire_tv | Firefox Fire TV | Firefox for Fire TV | false | false | false |
FirefoxConnect | Android | firefox_connect | Firefox Echo | Firefox for Echo Show | true | true | false |
Zerda | Android | firefox_lite | Firefox Lite | Firefox Lite | true | true | false |
Zerda_cn | Android | firefox_lite_cn | Firefox Lite CN | Firefox Lite (China) | false | false | false |
Focus | Android | focus_android | Focus Android | Firefox Focus for Android | true | true | true |
Focus | iOS | focus_ios | Focus iOS | Firefox Focus for iOS | true | true | true |
Klar | Android | klar_android | Klar Android | Firefox Klar for Android | false | false | false |
Klar | iOS | klar_ios | Klar iOS | Firefox Klar for iOS | false | false | false |
Lockbox | Android | lockwise_android | Lockwise Android | Lockwise for Android | true | true | false |
Lockbox | iOS | lockwise_ios | Lockwise iOS | Lockwise for iOS | true | true | false |
FirefoxReality* | Android | firefox_reality | Firefox Reality | Firefox Reality | false | false | false |
Parameters
INPUTS
legacy_app_name STRING, normalized_os STRING
OUTPUTS
STRUCT<app_name STRING, product STRING, canonical_app_name STRING, canonical_name STRING, contributes_to_2019_kpi BOOLEAN, contributes_to_2020_kpi BOOLEAN, contributes_to_2021_kpi BOOLEAN>
result_type_to_product_name (UDF)
Convert urlbar result types into product-friendly names
This UDF converts result types from urlbar events (engagement, impression, abandonment) into product-friendly names.
Parameters
INPUTS
res STRING
OUTPUTS
STRING
truncate_version (UDF)
Truncates a version string like <major>.<minor>.<patch>
to
either the major or minor version. The return value is NUMERIC
,
which means that you can sort the results without fear (e.g. 100
will be categorized as greater than 80, which isn't the case when
sorting lexigraphically).
For example, "5.1.0" would be translated to 5.1
if the parameter is
"minor" or 5
if the parameter is major.
If the version is only a major and/or minor version, then it will be left
unchanged (for example "10" would stay as 10
when run through this
function, no matter what the arguments).
This is useful for grouping Linux and Mac operating system versions inside aggregate datasets or queries where there may be many different patch releases in the field.
Parameters
INPUTS
os_version STRING, truncation_level STRING
OUTPUTS
NUMERIC
vpn_attribution (UDF)
Accepts vpn attribution fields as input and returns a struct of normalized fields.
Parameters
INPUTS
utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING
OUTPUTS
STRUCT<normalized_acquisition_channel STRING, normalized_campaign STRING, normalized_content STRING, normalized_medium STRING, normalized_source STRING, website_channel_group STRING>
windows_version_info (UDF)
Given an unnormalized set off Windows identifiers, return a friendly version of the operating system name.
Requires os, os_version and windows_build_number.
E.G. from windows_build_number >= 22000 return Windows 11
Parameters
INPUTS
os STRING, os_version STRING, windows_build_number INT64
OUTPUTS
STRING