mozetl.clientsdaily package¶

Submodules¶

mozetl.clientsdaily.fields module¶

mozetl.clientsdaily.rollup module¶

mozetl.clientsdaily.rollup.extract_search_counts(frame)[source]¶

:frame DataFrame conforming to main_summary’s schema.

:return one row for each row in frame, replacing the nullable array-of-structs column “search_counts” with seven columns

“search_count_{access_point}_sum”:

one for each valid SEARCH_ACCESS_POINT, plus one named “all” which is always a sum of the other six.

All seven columns default to 0 and will be 0 if search_counts was NULL. Note that the Mozilla term of art “search access point”, referring to GUI elements, is named “source” in main_summary.

This routine is hairy because it generates a lot of SQL and Spark pseudo-SQL; see inline comments.

mozetl.clientsdaily.rollup.get_partition_count_for_writing(is_sampled)[source]¶

Return a reasonable partition count.

using_sample_id: boolean One day is O(140MB) if filtering down to a single sample_id, but O(14GB) if not. Google reports 256MB < partition size < 1GB as ideal.

mozetl.clientsdaily.rollup.load_main_summary(spark, input_bucket, input_prefix)[source]¶

mozetl.clientsdaily.rollup.to_profile_day_aggregates(frame_with_extracts)[source]¶

mozetl.clientsdaily.rollup.write_one_activity_day(results, date, output_prefix, partition_count)[source]¶

mozetl.clientsdaily package¶

Submodules¶

mozetl.clientsdaily.fields module¶

mozetl.clientsdaily.rollup module¶

Module contents¶

python_mozetl

Navigation

Related Topics