mozetl.taar package

Submodules

mozetl.taar.taar_amodump module

class mozetl.taar.taar_amodump.AMOAddonFile[source]

Bases: mozetl.taar.taar_amodump.JSONSchema

meta = {'id': <class 'int'>, 'is_webextension': <class 'bool'>, 'platform': <class 'str'>, 'status': <class 'str'>}
class mozetl.taar.taar_amodump.AMOAddonInfo[source]

Bases: mozetl.taar.taar_amodump.JSONSchema

meta = {'categories': typing.Dict[str, typing.List[str]], 'current_version': <class 'mozetl.taar.taar_amodump.AMOAddonVersion'>, 'default_locale': <class 'str'>, 'description': typing.Dict[str, str], 'guid': <class 'str'>, 'name': typing.Dict[str, str], 'ratings': typing.Dict[str, float], 'summary': typing.Dict[str, str], 'tags': typing.List[str], 'weekly_downloads': <class 'int'>}
class mozetl.taar.taar_amodump.AMOAddonVersion[source]

Bases: mozetl.taar.taar_amodump.JSONSchema

meta = {'files': typing.List[mozetl.taar.taar_amodump.AMOAddonFile]}
class mozetl.taar.taar_amodump.AMODatabase(worker_count)[source]

Bases: object

fetch_addons()[source]
class mozetl.taar.taar_amodump.JSONSchema[source]

Bases: object

class mozetl.taar.taar_amodump.Undefined[source]

Bases: object

This value is used to disambiguate None vs a non-existant value on dict.get() lookups

mozetl.taar.taar_amodump.logger = <Logger amo_database (INFO)>

JSON from addons.mozilla.org server are parsed using subclasses of JSONSchema to declare the types that we want to extract.

The top level object that we are parsing is AMOAddonInfo.

All classes of type JSONSchema have a meta dictionary attribute which define keys which we are interested in extracting.

The key is the name of the attribute which we are interested in retrieving.

The value defines how we want to coerce the inbound data. There are 3 general cases:

  1. subclasses of JSONSchema are nested objects which are represented as dictionaries

  2. List<T> or Dict<T,T> types where values are coerced recursively using the marshal function.

  3. Everything else. These are callable type defintions. Usually Python built-ins like str or bool. It is possible to define custom callables if you want to do custom data conversion.

mozetl.taar.taar_amodump.marshal(value, name, type_def)[source]

mozetl.taar.taar_amowhitelist module

class mozetl.taar.taar_amowhitelist.AMOTransformer(bucket, prefix, fname, min_rating, min_age)[source]

Bases: object

This class transforms the raw AMO addon JSON dump by filtering out addons that do not meet the minimum requirements for ‘whitelisted’ addons. See the documentation in the transform method for details.

extract()[source]
get_featuredlist()[source]
get_featuredwhitelist()[source]
get_whitelist()[source]
load()[source]
load_featuredlist(jdata)[source]
load_featuredwhitelist(jdata)[source]
load_whitelist(jdata)[source]
transform(json_data)[source]

We currently whitelist addons which meet a minimum critieria of:

  • At least 3.0 average rating or higher

  • At least 60 days old as computed using the ‘first_create_date’ field in the addon JSON

  • Not the Firefox Pioneer addon

Criteria are discussed over at :

https://github.com/mozilla/taar-lite/issues/1

class mozetl.taar.taar_amowhitelist.AbstractAccumulator[source]

Bases: object

get_results()[source]
abstract process_record(guid, addon_data)[source]
class mozetl.taar.taar_amowhitelist.FeaturedAccumulator[source]

Bases: mozetl.taar.taar_amowhitelist.AbstractAccumulator

process_record(guid, addon_data)[source]
class mozetl.taar.taar_amowhitelist.WhitelistAccumulator(min_age, min_rating)[source]

Bases: mozetl.taar.taar_amowhitelist.AbstractAccumulator

process_record(guid, addon_data)[source]
class mozetl.taar.taar_amowhitelist.WhitelistFeaturedAccumulator(min_age, min_rating)[source]

Bases: mozetl.taar.taar_amowhitelist.WhitelistAccumulator

get_results()[source]
process_record(guid, addon_data)[source]

mozetl.taar.taar_dynamo module

This module replicates the scala script over at

https://github.com/mozilla/telemetry-batch-view/blob/1c544f65ad2852703883fe31a9fba38c39e75698/src/main/scala/com/mozilla/telemetry/views/HBaseAddonRecommenderView.scala

This should be invoked with something like this:

spark-submit –master=spark://ec2-52-32-39-246.us-west-2.compute.amazonaws.com taar_dynamo.py –date=20180218 –region=us-west-2 –table=taar_addon_data_20180206 –prod-iam-role=arn:aws:iam::361527076523:role/taar-write-dynamodb-from-dev

class mozetl.taar.taar_dynamo.CredentialSingleton[source]

Bases: object

getInstance(prod_iam_role)[source]
get_new_creds(prod_iam_role)[source]
class mozetl.taar.taar_dynamo.DynamoReducer(prod_iam_role, region_name=None, table_name=None)[source]

Bases: object

dynamo_reducer(list_a, list_b, force_write=False)[source]

This function can be used to reduce tuples of the form in list_transformer. Data is merged and when MAX_RECORDS number of JSON blobs are merged, the list of JSON is batch written into DynamoDB.

hash_client_ids(data_tuple)[source]

# Clobber the client_id by using sha256 hashes encoded as hex # Based on the js code in Fx

push_to_dynamo(data_tuple)[source]

This connects to DynamoDB and pushes records in item_list into a table.

We accumulate a list of up to 50 elements long to allow debugging of write errors.

mozetl.taar.taar_dynamo.etl(spark, run_date, region_name, table_name, prod_iam_role, sample_rate)[source]

This function is responsible for extract, transform and load.

Data is extracted from Parquet files in Amazon S3. Transforms and filters are applied to the data to create 3-tuples that are easily merged in a map-reduce fashion.

The 3-tuples are then loaded into DynamoDB using a map-reduce operation in Spark.

mozetl.taar.taar_dynamo.extract_transform(spark, run_date, sample_rate=0)[source]
mozetl.taar.taar_dynamo.filterDateAndClientID(row_jstr)[source]

Filter out any rows where the client_id is None or where the subsession_start_date is not a valid date

mozetl.taar.taar_dynamo.json_serial(obj)[source]

JSON serializer for objects not serializable by default json code

mozetl.taar.taar_dynamo.list_transformer(row_jsonstr)[source]

We need to merge two elements of the row data - namely the client_id and the start_date into the main JSON blob.

This is then packaged into a 4-tuple of :

The first integer represents the number of records that have been pushed into DynamoDB.

The second is the length of the JSON data list. This prevents us from having to compute the length of the JSON list unnecessarily.

The third element of the tuple is the list of JSON data.

The fourth element is a list of invalid JSON blobs. We maintain this to be no more than 50 elements long.

mozetl.taar.taar_dynamo.load_rdd(prod_iam_role, region_name, table_name, rdd)[source]
mozetl.taar.taar_dynamo.run_etljob(spark, run_date, region_name, table_name, prod_iam_role, sample_rate)[source]

mozetl.taar.taar_lite_guidguid module

This ETL job computes the co-installation occurrence of white-listed Firefox webextensions for a sample of the longitudinal telemetry dataset.

mozetl.taar.taar_lite_guidguid.extract_telemetry(spark)[source]

load some training data from telemetry given a sparkContext

mozetl.taar.taar_lite_guidguid.get_addons_per_client(broadcast_amo_whitelist, users_df)[source]

Extracts a DataFrame that contains one row for each client along with the list of active add-on GUIDs.

mozetl.taar.taar_lite_guidguid.get_initial_sample(spark)[source]

Takes an initial sample from the longitudinal dataset (randomly sampled from main summary). Coarse filtering on: - number of installed addons (greater than 1) - corrupt and generally wierd telemetry entries - isolating release channel - column selection

mozetl.taar.taar_lite_guidguid.is_valid_addon(broadcast_amo_whitelist, guid, addon)[source]

Filter individual addons out to exclude, system addons, legacy addons, disabled addons, sideloaded addons.

mozetl.taar.taar_lite_guidguid.key_all(a)[source]

Return (for each Row) a two column set of Rows that contains each individual installed addon (the key_addon) as the first column and an array of guids of all other addons that were seen co-installed with the key_addon. Excluding the key_addon from the second column to avoid inflated counts in later aggregation.

mozetl.taar.taar_lite_guidguid.load_s3(result_df, date, prefix, bucket)[source]
mozetl.taar.taar_lite_guidguid.transform(longitudinal_addons)[source]

mozetl.taar.taar_lite_guidranking module

This ETL job computes the installation rate of all addons and then cross references against the whitelist to compute the total install rate for all whitelisted addons.

mozetl.taar.taar_lite_guidranking.extract_telemetry(sparkSession)[source]

Load some training data from telemetry given a sparkContext

mozetl.taar.taar_lite_guidranking.load_s3(result_data, date, prefix, bucket)[source]
mozetl.taar.taar_lite_guidranking.transform(frame)[source]

Convert the dataframe to JSON and augment each record to include the install count for each addon.

mozetl.taar.taar_locale module

Bug 1396549 - TAAR Top addons per locale dictionary This notebook is adapted from a gist that computes the top N addons per locale after filtering for good candidates (e.g. no unsigned, no disabled, …) [1].

[1] https://gist.github.com/mlopatka/46dddac9d063589275f06b0443fcc69d

mozetl.taar.taar_locale.compute_noisy_counts(locale_addon_counts, addon_limits, whitelist, eps=0.4)[source]

Apply DP protections to the raw per-locale add-on frequency counts.

Laplace noise is added to each of the counts. Additionally, each per-locale set of frequency counts is expanded to include every add-on in the whitelist, even if some were not observed in the raw data.

This computation is done in local memory, rather than in Spark, to simplify working with random number generation. This relies on the assumption that the number of unique locales and whitelist add-ons each remain small (on the order of 100-1000).

Parameters
  • locale_addon_counts – a Pandas DF of per-locale add-on frequency counts, with columns locale, addon, count

  • addon_limits – a dict mapping locale strings to ints representing the max number of add-ons retained per client in that locale. Any locale not present in the dict is excluded from the final dataset.

  • whitelist – a list of add-on IDs belonging to the AMO whitelist

  • eps – the DP epsilon parameter, representing the privacy budget

Returns

a DF with the same structure as locale_addon_counts. Counts may now be non-integer and negative.

mozetl.taar.taar_locale.generate_dictionary(spark, num_addons, dataset_num_days)[source]

Compile lists of top add-ons by locale from per-client add-on data.

Runs a fresh data pull against clients_daily, computes DP-protected frequency counts, and generates a weighted list of top add-ons by locale.

Parameters
  • num_addons – number of add-on recommendations to report for each locale

  • dataset_num_days – number of days the raw data should cover

Returns

a dictionary {<locale>: [(‘GUID1’, 0.4), (‘GUID2’, 0.25), …]} as returned by get_top_addons_by_locale()

mozetl.taar.taar_locale.get_addon_limits_by_locale(client_addons_df)[source]

Determine the max number of add-ons per user in each locale.

We allow for the possibility of basing this on a summary statistic computed from the original data, in which case the limits will remain private.

Parameters

client_addons_df – a DF listing add-on IDs by client ID and locale, as generated by get_client_addons()

Returns

a dict mapping locale strings to their add-on limits

mozetl.taar.taar_locale.get_client_addons(spark, start_date, end_date=None)[source]

Returns a Spark DF listing add-ons by client_id and locale.

Only Firefox release clients are considered. The query finds each client’s most recent record in the clients_daily dataset over the given time period and returns its installed add-ons. System add-ons, disabled add-ons, and unsigned add-ons are filtered out.

Parameters
  • start_date – the earliest submission date to include (yyyymmdd)

  • optional (end_date,) – the latest submission date to include (yyyymmdd)

Returns

a DF with columns locale, client_id, addon

mozetl.taar.taar_locale.get_protected_locale_addon_counts(spark, client_addons_df)[source]

Compute DP-protected per-locale add-on frequency counts.

Privacy-preserving counts are generated using the Laplace mechanism, restricting to add-ons in the AMO whitelist.

Parameters

client_addons_df – a DF listing add-on IDs by client ID and locale, as generated by get_client_addons()

Returns

a Pandas DF with columns locale, addon, count containing DP-protected counts for each whitelist add-on within each locale. Unlike the true counts, weights may be non-integer or negative.

mozetl.taar.taar_locale.get_top_addons_by_locale(addon_counts, num_addons)[source]

Generate a dictionary of top-weighted add-ons by locale.

Raw counts are normalized by converting to relative proportions.

Parameters
  • addon_counts – a Pandas DF of per-locale add-on counts, with columns locale, addon, count.

  • num_addons – requested number of recommendations.

Returns

a dictionary {<locale>: [(‘GUID1’, 0.4), (‘GUID2’, 0.25), …]}

mozetl.taar.taar_locale.limit_client_addons(spark, client_addons_df, addon_limits, whitelist)[source]

Limit the number of add-ons associated with a single client ID.

This is a part of the privacy protection mechanism applied to the raw data. For each client in the dataset, we retain a randomly selected subset of their add-ons which belong to the whitelist. The max number of add-ons may differ by locale.

Parameters
  • client_addons_df – a DF listing add-on IDs by client ID and locale, as generated by get_client_addons()

  • addon_limits – a dict mapping locale strings to ints representing the max number of add-ons retained per client in that locale. Any locale not present in the dict is excluded from the final dataset.

  • whitelist – a list of add-on IDs belonging to the AMO whitelist

Returns

a DF containing a subset of the rows of client_addons_df

mozetl.taar.taar_locale.rlaplace()

laplace(loc=0.0, scale=1.0, size=None)

Draw samples from the Laplace or double exponential distribution with specified location (or mean) and scale (decay).

The Laplace distribution is similar to the Gaussian/normal distribution, but is sharper at the peak and has fatter tails. It represents the difference between two independent, identically distributed exponential random variables.

Note

New code should use the laplace method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • loc (float or array_like of floats, optional) – The position, \(\mu\), of the distribution peak. Default is 0.

  • scale (float or array_like of floats, optional) – \(\lambda\), the exponential decay. Default is 1. Must be non- negative.

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if loc and scale are both scalars. Otherwise, np.broadcast(loc, scale).size samples are drawn.

Returns

out – Drawn samples from the parameterized Laplace distribution.

Return type

ndarray or scalar

See also

Generator.laplace

which should be used for new code.

Notes

It has the probability density function

\[f(x; \mu, \lambda) = \frac{1}{2\lambda} \exp\left(-\frac{|x - \mu|}{\lambda}\right).\]

The first law of Laplace, from 1774, states that the frequency of an error can be expressed as an exponential function of the absolute magnitude of the error, which leads to the Laplace distribution. For many problems in economics and health sciences, this distribution seems to model the data better than the standard Gaussian distribution.

References

1

Abramowitz, M. and Stegun, I. A. (Eds.). “Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing,” New York: Dover, 1972.

2

Kotz, Samuel, et. al. “The Laplace Distribution and Generalizations, ” Birkhauser, 2001.

3

Weisstein, Eric W. “Laplace Distribution.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/LaplaceDistribution.html

4

Wikipedia, “Laplace distribution”, https://en.wikipedia.org/wiki/Laplace_distribution

Examples

Draw samples from the distribution

>>> loc, scale = 0., 1.
>>> s = np.random.laplace(loc, scale, 1000)

Display the histogram of the samples, along with the probability density function:

>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s, 30, density=True)
>>> x = np.arange(-8., 8., .01)
>>> pdf = np.exp(-abs(x-loc)/scale)/(2.*scale)
>>> plt.plot(x, pdf)

Plot Gaussian for comparison:

>>> g = (1/(scale * np.sqrt(2 * np.pi)) *
...      np.exp(-(x - loc)**2 / (2 * scale**2)))
>>> plt.plot(x,g)

mozetl.taar.taar_similarity module

Bug 1386274 - TAAR similarity-based add-on donor list

This job clusters users into different groups based on their active add-ons. A representative users sample is selected from each cluster (“donors”) and is saved to a model file along with a feature vector that will be used, by the TAAR library module, to perform recommendations.

mozetl.taar.taar_similarity.compute_clusters(addons_df, num_clusters, random_seed)[source]

Performs user clustering by using add-on ids as features.

mozetl.taar.taar_similarity.format_donors_dictionary(donors_df)[source]
mozetl.taar.taar_similarity.generate_non_cartesian_pairs(first_rdd, second_rdd)[source]
mozetl.taar.taar_similarity.get_addons_per_client(users_df, addon_whitelist, minimum_addons_count)[source]

Extracts a DataFrame that contains one row for each client along with the list of active add-on GUIDs.

mozetl.taar.taar_similarity.get_donor_pools(users_df, clusters_df, num_donors, random_seed=None)[source]

Samples users from each cluster.

mozetl.taar.taar_similarity.get_donors(spark, num_clusters, num_donors, addon_whitelist, date_from, random_seed=None)[source]
mozetl.taar.taar_similarity.get_lr_curves(spark, features_df, cluster_ids, kernel_bandwidth, num_pdf_points, random_seed=None)[source]

Compute the likelihood ratio curves for clustered clients.

Work-flow followed in this function is as follows:

  • Access the DataFrame including cluster numbers and features.

  • Load same similarity function that will be used in TAAR module.

  • Iterate through each cluster and compute in-cluster similarity.

  • Iterate through each cluster and compute out-cluster similarity.

  • Compute the kernel density estimate (KDE) per similarity score.

  • Linearly down-sample both PDFs to 1000 points.

Parameters
  • spark – the SparkSession object.

  • features_df – the DataFrame containing the user features (e.g. the ones coming from |get_donors|).

  • cluster_ids – the list of cluster ids (e.g. the one coming from |get_donors|).

  • kernel_bandwidth – the kernel bandwidth used to estimate the kernel densities.

  • num_pdf_points – the number of points to sample for the LR-curves.

  • random_seed – the provided random seed (fixed in tests).

Returns

A list in the following format [(idx, (lr-numerator-for-idx, lr-denominator-for-idx)), (…), …]

mozetl.taar.taar_similarity.get_samples(spark, date_from)[source]

Get a DataFrame with a valid set of sample to base the next processing on.

Sample is limited to submissions received since date_from and latest row per each client.

Reference documentation is found here:

Firefox Clients Daily telemetry table https://docs.telemetry.mozilla.org/datasets/batch_view/clients_daily/reference.html

BUG 1485152: PR include active_addons to clients_daily table: https://github.com/mozilla/telemetry-batch-view/pull/490

mozetl.taar.taar_similarity.similarity_function(x, y)[source]

Similarity function for comparing user features.

This actually really should be implemented in taar.similarity_recommender and then imported here for consistency.

mozetl.taar.taar_similarity.today_minus_90_days()[source]

mozetl.taar.taar_utils module

mozetl.taar.taar_utils.hash_telemetry_id(telemetry_id)[source]
This hashing function is a reference implementation based on :

https://phabricator.services.mozilla.com/D8311

mozetl.taar.taar_utils.load_amo_curated_whitelist()[source]

Return the curated whitelist of addon GUIDs

mozetl.taar.taar_utils.load_amo_external_whitelist()[source]

Download and parse the AMO add-on whitelist.

Raises

RuntimeError – the AMO whitelist file cannot be downloaded or contains no valid add-ons.

mozetl.taar.taar_utils.read_from_s3(s3_dest_file_name, s3_prefix, bucket)[source]

Read JSON from an S3 bucket and return the decoded JSON blob

mozetl.taar.taar_utils.selfdestructing_path(dirname)[source]
mozetl.taar.taar_utils.store_json_to_s3(json_data, base_filename, date, prefix, bucket)[source]

Saves the JSON data to a local file and then uploads it to S3.

Two copies of the file will get uploaded: one with as “<base_filename>.json” and the other as “<base_filename><YYYYMMDD>.json” for backup purposes.

Parameters
  • json_data – A string with the JSON content to write.

  • base_filename – A string with the base name of the file to use for saving locally and uploading to S3.

  • date – A date string in the “YYYYMMDD” format.

  • prefix – The S3 prefix.

  • bucket – The S3 bucket name.

mozetl.taar.taar_utils.write_to_s3(source_file_name, s3_dest_file_name, s3_prefix, bucket)[source]

Store the new json file containing current top addons per locale to S3.

Parameters
  • source_file_name – The name of the local source file.

  • s3_dest_file_name – The name of the destination file on S3.

  • s3_prefix – The S3 prefix in the bucket.

  • bucket – The S3 bucket.

Module contents