Churn

As of 2019-08-21, this dataset has been deprecated and is no longer maintained. See Bug 1561048 for historical sources.

Introduction
Data Reference

Introduction

The churn dataset tracks the 7-day churn rate of telemetry profiles. This dataset is generally used for analyzing cohort churn across segments and time.

Content

Churn is the rate of attrition defined by (clients seen in week N)/(clients seen in week 0) for groups of clients with some shared attributes. A group of clients with shared attributes is called a cohort. The cohorts in this dataset are created every week and can be tracked over time using the acquisition_date and the weeks since acquisition or current_week.

The following example demonstrates the current logic for generating this dataset. Each column represents the days since some arbitrary starting date.

client	00	01	02	03	04	05	06	07	14
A	X							X
B		X	X	X	X	X	X
C	X								X

All three clients are part of the same cohort. Client A is retained for weeks 0 and 1 since there is activity in both periods. A client only needs to show up once in the period to be counted as retained. Client B is acquired in week 0 and is active frequently but does not appear in following weeks. Client B is considered churned on week 1. However, a client that is churned can become retained again. Client C is considered churned on week 1 but retained on week 2.

The following table summarizes the above daily activity into the following view where every column represents the current week since acquisition date..

client	0	1	2
A	X	X
B	X
C	X		X

The clients are then grouped into cohorts by attributes. An attribute describes a property about the cohort such as the country of origin or the binary distribution channel. Each group also contains descriptive aggregates of engagement. Each metric describes the activity of a cohort such as size and overall usage at a given time instance.

Background and Caveats

The original concept for churn is captured in this Mana page. The original derived data-set was created in bug 1198537. The first major revision (v2) of this data-set added attribution, search, and uri counts. The second major revision (v3) included additional clients through the new-profile ping and adjusted the collection window from 10 to 5 days.

Each row in this dataset describes a unique segment of users
- The number of rows is exponential with the number of dimensions
- New fields should be added sparing to account for data-set size
The dataset lags by 10 days in order account for submission latency
- This value was determined to be time for 99% of main pings to arrive at the server. With the shutdown-ping sender, this has been reduced to 4 days. However, churn_v3 still tracks releases older than Firefox 55.
The start of the period is fixed to Sundays. Once it has been aggregated, the period cannot be shifted due to the way clients are counted.
- A supplementary 1-day retention dataset using HyperLogLog for client counts is available for counting over arbitrary retention periods and date offsets. Additionally, calculating churn or retention over specific cohorts is tractable in STMO with main_summary or clients_daily datasets.

Accessing the Data

churn is available in STMO under Athena and Presto. The data is also available in parquet for consumption by columnar data engines at s3://telemetry-parquet/churn/v3.

Data Reference

Example Queries

This section walks through a typical query to generate data suitable for visualization.

field	type	description
`cohort_date`	common, attribute	The start date bucket of the cohort. This is week the client was acquired.
`elapsed_periods`	common, attribute	The number of periods that have elapsed since the cohort date. In this dataset, the retention period is 7 days.
`channel`	attribute	Part of the release train model. An attribute that distinguishes cohorts.
`geo`	filter attribute	Country code. Used to filter out all countries other than the 'US'
`n_profiles`	metric	Count of users in a cohort. Use sum to aggregate.

First the fields are extracted and aliased for consistency. cohort_date and elapsed_periods are common to most retention queries and are useful concepts for building on other datasets.

WITH extracted AS (
    SELECT acquisition_period AS cohort_date,
           current_week AS elapsed_periods,
           n_profiles,
           channel,
           geo
    FROM churn
),

The extracted table is filtered down to the attributes of interest. The cohorts of interest originate in the US and are in the release or beta channels. Note that channel here is the concatenation of the normalized channel and the funnelcake id. Only cohorts appearing after August 6, 2017 are chosen to be in this population.

 population AS (
    SELECT channel,
           cohort_date,
           elapsed_periods,
           n_profiles
    FROM extracted
    WHERE geo = 'US'
      AND channel IN ('release', 'beta')
      AND cohort_date > '20170806'
      -- filter out noise from clients with incorrect dates
      AND elapsed_periods >= 0
      AND elapsed_periods < 12
),

The number of profiles is aggregated by the cohort dimensions. The cohort acquisition date and elapsed periods since acquisition are fundamental to cohort analysis.

 cohorts AS (
     SELECT channel,
            cohort_date,
            elapsed_periods,
            sum(n_profiles) AS n_profiles
     FROM population
     GROUP BY 1, 2, 3
),

The table will have the following structure. The table is sorted by the first three columns for demonstration.

`channel`	`cohort_date`	`elapsed_periods`	`n_profiles`
release	20170101	0	100
release	20170101	1	90
release	20170101	2	80
...	...	...	...
beta	20170128	10	25

Finally, retention is calculated through the number of profiles at the time of the elapsed_period relative to the initial period. This data can be imported into a pivot table for further analysis.

results AS (
    SELECT c.*,
           iv.n_profiles AS total_n_profiles,
           (0.0+c.n_profiles)*100/iv.n_profiles AS percentage_n_profiles
    FROM cohorts c
    JOIN (
        SELECT *
        FROM cohorts
        WHERE elapsed_periods = 0
    ) iv ON (
        c.cohort_date = iv.cohort_date
        AND c.channel = iv.channel
    )
)

`channel`	`cohort_date`	`elapsed_periods`	`n_profiles`	`total_n_profiles`	`percentage_n_profiles`
release	20170101	0	100	100	1.0
release	20170101	1	90	100	0.9
release	20170101	2	80	100	0.8
...	....	...	...	...	...
beta	20170128	10	25	50	0.5

Obtain the results.

SELECT *
FROM results

You may consider visualizing using cohort graphs, line charts, or a pivot tables. See Firefox Telemetry Retention: Dataset Example Usage for more examples.

Scheduling

The aggregated churn data is updated weekly on Wednesday.

Schema

As of 2017-10-15, the current version of churn is v3 and has a schema as follows:

root
 |-- channel: string (nullable = true)
 |-- geo: string (nullable = true)
 |-- is_funnelcake: string (nullable = true)
 |-- acquisition_period: string (nullable = true)
 |-- start_version: string (nullable = true)
 |-- sync_usage: string (nullable = true)
 |-- current_version: string (nullable = true)
 |-- current_week: long (nullable = true)
 |-- source: string (nullable = true)
 |-- medium: string (nullable = true)
 |-- campaign: string (nullable = true)
 |-- content: string (nullable = true)
 |-- distribution_id: string (nullable = true)
 |-- default_search_engine: string (nullable = true)
 |-- locale: string (nullable = true)
 |-- is_active: string (nullable = true)
 |-- n_profiles: long (nullable = true)
 |-- usage_hours: double (nullable = true)
 |-- sum_squared_usage_hours: double (nullable = true)
 |-- total_uri_count: long (nullable = true)
 |-- unique_domains_count_per_profile: double (nullable = true)

Code Reference

The script for generating churn currently lives in mozilla/python_mozetl. The job can be found in mozetl/engagement/churn.

Mozilla Data Documentation