Churn

As of 2019-08-21, this dataset has been deprecated and is no longer maintained. See Bug 1561048 for historical sources.

Introduction

The churn dataset tracks the 7-day churn rate of telemetry profiles. This dataset is generally used for analyzing cohort churn across segments and time.

Content

Churn is the rate of attrition defined by (clients seen in week N)/(clients seen in week 0) for groups of clients with some shared attributes. A group of clients with shared attributes is called a cohort. The cohorts in this dataset are created every week and can be tracked over time using the acquisition_date and the weeks since acquisition or current_week.

The following example demonstrates the current logic for generating this dataset. Each column represents the days since some arbitrary starting date.

client000102030405060708091011121314
AXX
BXXXXXX
CXX

All three clients are part of the same cohort. Client A is retained for weeks 0 and 1 since there is activity in both periods. A client only needs to show up once in the period to be counted as retained. Client B is acquired in week 0 and is active frequently but does not appear in following weeks. Client B is considered churned on week 1. However, a client that is churned can become retained again. Client C is considered churned on week 1 but retained on week 2.

The following table summarizes the above daily activity into the following view where every column represents the current week since acquisition date..

client012
AXX
BX
CXX

The clients are then grouped into cohorts by attributes. An attribute describes a property about the cohort such as the country of origin or the binary distribution channel. Each group also contains descriptive aggregates of engagement. Each metric describes the activity of a cohort such as size and overall usage at a given time instance.

Background and Caveats

The original concept for churn is captured in this Mana page. The original derived data-set was created in bug 1198537. The first major revision (v2) of this data-set added attribution, search, and uri counts. The second major revision (v3) included additional clients through the new-profile ping and adjusted the collection window from 10 to 5 days.

  • Each row in this dataset describes a unique segment of users
    • The number of rows is exponential with the number of dimensions
    • New fields should be added sparing to account for data-set size
  • The dataset lags by 10 days in order account for submission latency
    • This value was determined to be time for 99% of main pings to arrive at the server. With the shutdown-ping sender, this has been reduced to 4 days. However, churn_v3 still tracks releases older than Firefox 55.
  • The start of the period is fixed to Sundays. Once it has been aggregated, the period cannot be shifted due to the way clients are counted.
    • A supplementary 1-day retention dataset using HyperLogLog for client counts is available for counting over arbitrary retention periods and date offsets. Additionally, calculating churn or retention over specific cohorts is tractable in STMO with main_summary or clients_daily datasets.

Accessing the Data

churn is available in STMO under Athena and Presto. The data is also available in parquet for consumption by columnar data engines at s3://telemetry-parquet/churn/v3.

Data Reference

Example Queries

This section walks through a typical query to generate data suitable for visualization.

fieldtypedescription
cohort_datecommon, attributeThe start date bucket of the cohort. This is week the client was acquired.
elapsed_periodscommon, attributeThe number of periods that have elapsed since the cohort date. In this dataset, the retention period is 7 days.
channelattributePart of the release train model. An attribute that distinguishes cohorts.
geofilter attributeCountry code. Used to filter out all countries other than the 'US'
n_profilesmetricCount of users in a cohort. Use sum to aggregate.

First the fields are extracted and aliased for consistency. cohort_date and elapsed_periods are common to most retention queries and are useful concepts for building on other datasets.

WITH extracted AS (
    SELECT acquisition_period AS cohort_date,
           current_week AS elapsed_periods,
           n_profiles,
           channel,
           geo
    FROM churn
),

The extracted table is filtered down to the attributes of interest. The cohorts of interest originate in the US and are in the release or beta channels. Note that channel here is the concatenation of the normalized channel and the funnelcake id. Only cohorts appearing after August 6, 2017 are chosen to be in this population.

 population AS (
    SELECT channel,
           cohort_date,
           elapsed_periods,
           n_profiles
    FROM extracted
    WHERE geo = 'US'
      AND channel IN ('release', 'beta')
      AND cohort_date > '20170806'
      -- filter out noise from clients with incorrect dates
      AND elapsed_periods >= 0
      AND elapsed_periods < 12
),

The number of profiles is aggregated by the cohort dimensions. The cohort acquisition date and elapsed periods since acquisition are fundamental to cohort analysis.

 cohorts AS (
     SELECT channel,
            cohort_date,
            elapsed_periods,
            sum(n_profiles) AS n_profiles
     FROM population
     GROUP BY 1, 2, 3
),

The table will have the following structure. The table is sorted by the first three columns for demonstration.

channelcohort_dateelapsed_periodsn_profiles
release201701010100
release20170101190
release20170101280
............
beta201701281025

Finally, retention is calculated through the number of profiles at the time of the elapsed_period relative to the initial period. This data can be imported into a pivot table for further analysis.

results AS (
    SELECT c.*,
           iv.n_profiles AS total_n_profiles,
           (0.0+c.n_profiles)*100/iv.n_profiles AS percentage_n_profiles
    FROM cohorts c
    JOIN (
        SELECT *
        FROM cohorts
        WHERE elapsed_periods = 0
    ) iv ON (
        c.cohort_date = iv.cohort_date
        AND c.channel = iv.channel
    )
)
channelcohort_dateelapsed_periodsn_profilestotal_n_profilespercentage_n_profiles
release2017010101001001.0
release201701011901000.9
release201701012801000.8
...................
beta201701281025500.5

Obtain the results.

SELECT *
FROM results

You may consider visualizing using cohort graphs, line charts, or a pivot tables. See Firefox Telemetry Retention: Dataset Example Usage for more examples.

Scheduling

The aggregated churn data is updated weekly on Wednesday.

Schema

As of 2017-10-15, the current version of churn is v3 and has a schema as follows:

root
 |-- channel: string (nullable = true)
 |-- geo: string (nullable = true)
 |-- is_funnelcake: string (nullable = true)
 |-- acquisition_period: string (nullable = true)
 |-- start_version: string (nullable = true)
 |-- sync_usage: string (nullable = true)
 |-- current_version: string (nullable = true)
 |-- current_week: long (nullable = true)
 |-- source: string (nullable = true)
 |-- medium: string (nullable = true)
 |-- campaign: string (nullable = true)
 |-- content: string (nullable = true)
 |-- distribution_id: string (nullable = true)
 |-- default_search_engine: string (nullable = true)
 |-- locale: string (nullable = true)
 |-- is_active: string (nullable = true)
 |-- n_profiles: long (nullable = true)
 |-- usage_hours: double (nullable = true)
 |-- sum_squared_usage_hours: double (nullable = true)
 |-- total_uri_count: long (nullable = true)
 |-- unique_domains_count_per_profile: double (nullable = true)

Code Reference

The script for generating churn currently lives in mozilla/python_mozetl. The job can be found in mozetl/engagement/churn.