- Data Reference
The churn dataset tracks the 7-day churn rate of telemetry profiles. This dataset is generally used for analyzing cohort churn across segments and time.
Churn is the rate of attrition defined by
(clients seen in week N)/(clients seen in week 0)
for groups of clients with some shared attributes. A group of clients with
shared attributes is called a cohort. The cohorts in this dataset are created
every week and can be tracked over time using the
acquisition_date and the
weeks since acquisition or
The following example demonstrates the current logic for generating this dataset. Each column represents the days since some arbitrary starting date.
All three clients are part of the same cohort. Client A is retained for weeks 0 and 1 since there is activity in both periods. A client only needs to show up once in the period to be counted as retained. Client B is acquired in week 0 and is active frequently but does not appear in following weeks. Client B is considered churned on week 1. However, a client that is churned can become retained again. Client C is considered churned on week 1 but retained on week 2.
The following table summarizes the above daily activity into the following view where every column represents the current week since acquisition date..
The clients are then grouped into cohorts by attributes. An attribute describes a property about the cohort such as the country of origin or the binary distribution channel. Each group also contains descriptive aggregates of engagement. Each metric describes the activity of a cohort such as size and overall usage at a given time instance.
Background and Caveats
The original concept for churn is captured in this Mana
The original derived data-set was created in bug
1198537. The first
major revision (
this data-set added attribution, search, and uri counts. The second major
additional clients through the
new-profile ping and adjusted the collection
window from 10 to 5 days.
- Each row in this dataset describes a unique segment of users
- The number of rows is exponential with the number of dimensions
- New fields should be added sparing to account for data-set size
- The dataset lags by 10 days in order account for submission latency
- This value was determined to be time for 99% of main pings to arrive at the
server. With the shutdown-ping sender, this has been reduced to 4 days.
churn_v3still tracks releases older than Firefox 55.
- This value was determined to be time for 99% of main pings to arrive at the server. With the shutdown-ping sender, this has been reduced to 4 days. However,
- The start of the period is fixed to Sundays. Once it has been aggregated, the
period cannot be shifted due to the way clients are counted.
- A supplementary 1-day
retentiondataset using HyperLogLog for client counts is available for counting over arbitrary retention periods and date offsets. Additionally, calculating churn or retention over specific cohorts is tractable in STMO with
- A supplementary 1-day
Accessing the Data
churn is available in Re:dash under Athena and Presto. The data is also
available in parquet for consumption by columnar data engines at
This section walks through a typical query to generate data suitable for visualization.
| ||common, attribute||The start date bucket of the cohort. This is week the client was acquired.|
| ||common, attribute||The number of periods that have elapsed since the cohort date. In this dataset, the retention period is 7 days.|
| ||attribute||Part of the release train model. An attribute that distinguishes cohorts.|
| ||filter attribute||Country code. Used to filter out all countries other than the 'US'|
| ||metric||Count of users in a cohort. Use sum to aggregate.|
First the fields are extracted and aliased for consistency.
elapsed_periods are common to most retention queries and are useful concepts
for building on other datasets.
WITH extracted AS ( SELECT acquisition_period AS cohort_date, current_week AS elapsed_periods, n_profiles, channel, geo FROM churn ),
The extracted table is filtered down to the attributes of interest. The cohorts
of interest originate in the US and are in the release or beta channels. Note
channel here is the concatenation of the normalized channel and the
funnelcake id. Only cohorts appearing after August 6, 2017 are chosen to be in
population AS ( SELECT channel, cohort_date, elapsed_periods, n_profiles FROM extracted WHERE geo = 'US' AND channel IN ('release', 'beta') AND cohort_date > '20170806' -- filter out noise from clients with incorrect dates AND elapsed_periods >= 0 AND elapsed_periods < 12 ),
The number of profiles is aggregated by the cohort dimensions. The cohort acquisition date and elapsed periods since acquisition are fundamental to cohort analysis.
cohorts AS ( SELECT channel, cohort_date, elapsed_periods, sum(n_profiles) AS n_profiles FROM population GROUP BY 1, 2, 3 ),
The table will have the following structure. The table is sorted by the first three columns for demonstration.
| || || || |
Finally, retention is calculated through the number of profiles at the time of
elapsed_period relative to the initial period. This data can be imported
into a pivot table for further analysis.
results AS ( SELECT c.*, iv.n_profiles AS total_n_profiles, (0.0+c.n_profiles)*100/iv.n_profiles AS percentage_n_profiles FROM cohorts c JOIN ( SELECT * FROM cohorts WHERE elapsed_periods = 0 ) iv ON ( c.cohort_date = iv.cohort_date AND c.channel = iv.channel ) )
| || || || || || |
Obtain the results.
SELECT * FROM results
You may consider visualizing using cohort graphs, line charts, or a pivot tables. See Firefox Telemetry Retention: Dataset Example Usage for more examples.
The aggregated churn data is updated weekly on Wednesday.
As of 2017-10-15, the current version of
v3 and has a schema as follows:
root |-- channel: string (nullable = true) |-- geo: string (nullable = true) |-- is_funnelcake: string (nullable = true) |-- acquisition_period: string (nullable = true) |-- start_version: string (nullable = true) |-- sync_usage: string (nullable = true) |-- current_version: string (nullable = true) |-- current_week: long (nullable = true) |-- source: string (nullable = true) |-- medium: string (nullable = true) |-- campaign: string (nullable = true) |-- content: string (nullable = true) |-- distribution_id: string (nullable = true) |-- default_search_engine: string (nullable = true) |-- locale: string (nullable = true) |-- is_active: string (nullable = true) |-- n_profiles: long (nullable = true) |-- usage_hours: double (nullable = true) |-- sum_squared_usage_hours: double (nullable = true) |-- total_uri_count: long (nullable = true) |-- unique_domains_count_per_profile: double (nullable = true)