mozetl.clientsdaily package

Submodules

mozetl.clientsdaily.fields module

mozetl.clientsdaily.fields.agg_first(field_name)[source]
mozetl.clientsdaily.fields.agg_max(field_name, alias=None)[source]
mozetl.clientsdaily.fields.agg_mean(field_name, alias=None)[source]
mozetl.clientsdaily.fields.agg_sum(field_name, alias=None, expression=None)[source]
mozetl.clientsdaily.fields.get_alias(field_name, alias, kind)[source]

mozetl.clientsdaily.rollup module

mozetl.clientsdaily.rollup.extract_search_counts(frame)[source]

:frame DataFrame conforming to main_summary’s schema.

:return one row for each row in frame, replacing the nullable array-of-structs column “search_counts” with seven columns

“search_count_{access_point}_sum”:

one for each valid SEARCH_ACCESS_POINT, plus one named “all” which is always a sum of the other six.

All seven columns default to 0 and will be 0 if search_counts was NULL. Note that the Mozilla term of art “search access point”, referring to GUI elements, is named “source” in main_summary.

This routine is hairy because it generates a lot of SQL and Spark pseudo-SQL; see inline comments.

mozetl.clientsdaily.rollup.get_partition_count_for_writing(is_sampled)[source]

Return a reasonable partition count.

using_sample_id: boolean One day is O(140MB) if filtering down to a single sample_id, but O(14GB) if not. Google reports 256MB < partition size < 1GB as ideal.

mozetl.clientsdaily.rollup.load_main_summary(spark, input_bucket, input_prefix)[source]
mozetl.clientsdaily.rollup.to_profile_day_aggregates(frame_with_extracts)[source]
mozetl.clientsdaily.rollup.write_one_activity_day(results, date, output_prefix, partition_count)[source]

Module contents