retention table provides client counts relevant to client retention at a
1-day granularity. The project is tracked in Bug 1381840
retention table contains a set of attribute columns used to specify a
cohort of users and a set of metric columns to describe cohort activity. Each
row contains a permutation of attributes, an approximate set of clients in a
cohort, and the aggregate engagement metrics.
This table uses the HyperLogLog (HLL) sketch to create an approximate set of
clients in a cohort. HLL allows counting across overlapping cohorts in a single
pass while avoiding the problem of double counting. This data-structure has the
benefit of being compact and performant in the context of retention analysis,
at the expense of precision. For example, calculating a 7-day retention period
can be obtained by aggregating over a week of retention data using the union
operation. With SQL primitive, this requires a recalculation of COUNT DISTINCT
client_id's in the 7-day window.
- The data starts at 2017-03-06, the merge date where Nightly started to
track Firefox 55 in Mozilla-Central. However, there was
not a consistent view into the behavior of first session profiles until the
new_profileping. This means much of the data is inaccurate before 2017-06-26.
- This dataset uses 4 day reporting latency to aggregate at least 99% of the data in a given submission date. This figure is derived from the telemetry-health measurements on submission latency, with the discussion in Bug 1407410. This latency metric was reduced Firefox 55 with the introduction of the shutdown ping-sender mechanism.
- Caution should be taken before adding new columns. Additional attribute columns will grow the number of rows exponentially.
- The number of HLL bits chosen for this dataset is 13. This means the default size of the HLL object is 2^13 bits or 1KiB. This maintains about a 1% error on average. See this table from Algebird's HLL implementation for more details.
The data is primarily available through Re:dash on STMO via the Presto source. This service has been configured to use predefined HLL functions.
The column should first be cast to the HLL type. The scalar
cardinality(<hll_column>) function will approximate the number of unique
items per HLL object. The aggregate
merge(<hll_column>) function will perform
the set union between all objects in a column.
Example: Cast the count column into the appropriate type.
SELECT cast(hll as HLL) as n_profiles_hll FROM retention
Count the number of clients seen over all attribute combinations.
SELECT cardinality(cast(hll as HLL)) FROM retention
Group-by and aggregate client counts over different release channels.
SELECT channel, cardinality(merge(cast(hll AS HLL)) FROM retention GROUP BY channel
Also see the
See the Example Usage Dashboard for more usages of datasets of the same shape.
The job is scheduled on Airflow on a daily basis after
main_summary is run
for the day. This job requires both
As of 2017-10-10, the current version of
v1 and has a schema
root |-- subsession_start: string (nullable = true) |-- profile_creation: string (nullable = true) |-- days_since_creation: long (nullable = true) |-- channel: string (nullable = true) |-- app_version: string (nullable = true) |-- geo: string (nullable = true) |-- distribution_id: string (nullable = true) |-- is_funnelcake: boolean (nullable = true) |-- source: string (nullable = true) |-- medium: string (nullable = true) |-- content: string (nullable = true) |-- sync_usage: string (nullable = true) |-- is_active: boolean (nullable = true) |-- hll: binary (nullable = true) |-- usage_hours: double (nullable = true) |-- sum_squared_usage_hours: double (nullable = true) |-- total_uri_count: long (nullable = true) |-- unique_domains_count: double (nullable = true)
The runner script performs all the necessary setup to run on EMR. This script can be used to perform backfill.