Firefox Data Documentation

This documentation is intended to help Mozilla's developers and data scientists analyze and interpret the data gathered by the Firefox Telemetry system.

At Mozilla, our data-gathering and data-handling practices are anchored in our Data Privacy Principles and elaborated in the Mozilla Privacy Policy. You can learn more about what data Firefox collects and the choices you can make as a Firefox user in the Firefox Privacy Notice.

If there's information missing from these docs, or if you'd like to contribute, see this article on contributing, and feel free to file a bug here.

The source for this documentation can be found in this repo.

Using this document

This documentation is divided into four main sections:

Getting Started

This section provides a quick introduction to analyzing telemetry data. After reading these articles, you will be able to confidently perform analysis over telemetry data.

Tools

Describes the tools we maintain to access and analyze product data.

Cookbooks & Tutorials

This section contains tutorials presented in a simple problem/solution format.

Data Collection and Datasets

Describes all available data we have from our products. For each dataset, we include a description of the dataset's purpose, what data is included, how the data is collected, and how you can change or augment the dataset. You do not need to read this section end-to-end.

You can find the fully-rendered documentation here, rendered with mdBook, and hosted on Github pages.

Reporting a problem

If you have a problem with data tools, datasets, or other pieces of infrastructure, please help us out by reporting it.

Most of our work is tracked in Bugzilla in the Data Platform and Tools product.

Bugs should be filed in the closest-matching component in the Data Platform and Tools product, but if there is no component for the item in question, please file an issue in the General component.

Components are triaged at least weekly by the component owner(s). For issues needing urgent attention, it is recommended that you use the needinfo flag to attract attention from a specific person. If an issue doesn't receive the appropriate attention within a week, you can send email to the fx-data-dev mailing list, reach out on IRC in #datapipeline, or on Slack in #fx-metrics.

When a bug is triaged, it will be assigned a priority and points. Priorities have the following meanings:

  • P1: in active development in the current sprint
  • P2: planned to be worked on in the current quarter
  • P3: planned to be worked on next quarter
  • P4 and beyond: nice to have, we would accept a patch, but not actively being worked on.

Points reflect the amount of effort required for a bug and are assigned as follows:

  • 1 point: one day or less of effort
  • 2 points: two days of effort
  • 3 points: three days to a week of effort
  • 5 points or more: SO MUCH EFFORT, major project.

Problems with the data

There are bugzilla components for several of core datasets. described in this documentation, so if possible, please use a specific component.

If there is a problem with a dataset that does not have its own component, please file an issue in the Datasets: General component.

Problems with tools

There are bugzilla components for several of the tools that comprise the Data Platform, so please file a bug in the specific component that most closely matches the tool in question.

Operational bugs, such as services being unavailable, should be filed either in the component for the service itself or in the Operations component.

Other problems

When in doubt, please file issues in the General component.

Terminology

  • Analyst: Someone performing analysis. This is more general than data scientist.
  • Ping: A message sent from the Firefox browser to our telemetry servers containing information on browser state, user actions, etc... (more details)
  • Dataset: A set of data, includes ping data, derived datasets, etc...
  • Derived Dataset: A processed dataset, such as main_summary or the longitudinal dataset
  • Session: The time from when a Firefox browser starts until it shuts down
  • Subsession: Sessions are split into subsessions when a 24 hour threshold is crossed or an environment change occurs (more details)
  • ...

Getting Started

This document is meant to be a complete guide to using Firefox Data, so it can look overwhelming at first. These readings will get you up and running quickly After these readings you should be able to produce simple analyses but you should definitely get your analyses reviewed.

This section is meant to introduce new analysts to our data. I consider a "new analyst" to be an employee who is interested in working with our data but doesn't have previous experience with our tools/data. They could be technical or non-technical: engineer, PM, or data scientist.

Getting Started with Firefox Data

Firefox clients out in the wild send us data as pings. Main pings contain some combination of environment data (e.g. operating system, hardware, Firefox version), measurements (e.g. max number of open tabs, time spent running in JavaScript garbage collection), and events. We have quite a few different pings, but most of our data for Firefox Desktop comes in from main pings.

Measurement Types

When we need to measure specific things about clients, we use probes. A single ping will send in many different probes. There are two types of probes that we are interested in here: Histograms and Scalars.

You can search for and find more details about probes using the Probe Dictionary. It shows things like probe descriptions, when a probe started being collected, and whether it is collected on the release channel.

Histograms are bucketed counts. The Histograms.json file has the definitions for all histograms, which includes the minimum, maximum, and number of buckets. Any recorded value instead just increments its associated bucket. We have four main types of histograms:

  1. Boolean - Only two buckets, associated with true and false.
  2. Enumerated - Integer buckets, where usually each bucket has a label.
  3. Linear - Buckets are divided evenly between the minimum and maximum; e.g. [1-2] is a bucket, and so is [100-101].
  4. Exponential - Larger valued buckets cover a larger range; e.g. [1-2] is a bucket, and so is [100-200].

To see some of these in action, take a look at the Histogram Simulator.

Scalars are simply a single value. The Scalars.yaml file has the definitions for all scalars. These values can be integers, strings, or booleans.

TMO

The simplest way to start looking at probe data is to head over to telemetry.mozilla.org or TMO for short.

From there, you will likely want either the Measurement Dashboard or the Evolution Dashboard. Using these dashboards you can compare a probe's value between populations, e.g. GC_MS for 64 bit vs. 32 bit, and even track it across builds.

The Measurement Dashboard is a snapshot, aggregating all the data from all chosen dimensions. The Y axis is % of samples, and the X axis is the bucket. You can compare between dimensions, but it does not give you the ability to see how data is changing over time. To investigate that you must use the Evolution Dashboard.

The Evolution Dashboard shows how the data changes over time. Choose which statistics you'd like to plot over time, e.g. Median or 95th percentile. The X axis is time, and the Y axis is the value for whichever statistic you've chosen. This dashboard, for example, shows how GC_MS is improving from nightly 53 to nightly 56! While the median is not changing much, the 95th percentile is trending down, indicating that long garbage collections are being shortened.

The X axis on the Evolution Dashboard shows either Build ID (a date), or Submission Date. The difference is that on any single date we might receive submissions from lots of builds, but aggregating by Build ID means we can be sure a change was because of a new build.

The second plot on the Evolution View is the number of pings we saw containing that probe (Metric Count).

TMO Caveats

  • Data is aggregated on a per-ping basis, meaning these dashboards cannot be used to say something definitive about users. Please repeat that to yourself. A trend in the evolution view may be caused not by a change affecting lots of users, but a change affecting one single user who is now sending 50% of the pings we see. And yes, that does happen.
  • Sometimes it looks like no data is there, but you think there should be. Check under advanced settings and check "Don't Sanitize" and "Don't Trim Buckets". If it's still not there, let us know in IRC on #telemetry.
  • TMO Measurement Dashboards do not currently show release-channel data. Release-channel data ceased being aggregated as of Firefox 58. We're looking into ways of doing this correctly in the near future.

Where to go next

Choosing a Desktop Product Dataset

This document will help you find the best data source for a given analysis.

This guide focuses on descriptive datasets and does not cover experimentation. For example, this guide will help if you need to answer questions like:

  • How many users do we have in Germany, how many crashes we see per day?
  • How many users have a given addon installed?

If you're interested in figuring out whether there's a causal link between two events take a look at our tools for experimentation.

Table of Contents

Raw Pings

We receive data from our users via pings. There are several types of pings, each containing different measurements and sent for different purposes. To review a complete list of ping types and their schemata, see this section of the Mozilla Source Tree Docs.

Many pings are also described by a JSONSchema specification which can be found in this repository.

There are a few pings that are central to delivering our core data collection primitives (Histograms, Events, Scalars) and for keeping an eye on Firefox behaviour (Environment, New Profiles, Updates, Crashes).

For instance, a user's first session in Firefox might have four pings like this:

Flowchart of pings in the user's first session

"main" ping

The "main" ping is the workhorse of the Firefox Telemetry system. It delivers the Telemetry Environment as well as Histograms and Scalars for all process types that collect data in Firefox. It has several variants each with specific delivery characteristics:

ReasonSent whenNotes
shutdownFirefox session ends cleanlyAccounts for about 80% of all "main" pings. Sent by Pingsender immediately after Firefox shuts down, subject to conditions: Firefox 55+, if the OS isn't also shutting down, and if this isn't the client's first session. If Pingsender fails or isn't used, the ping is sent by Firefox at the beginning of the next Firefox session.
dailyIt has been more than 24 hours since the last "main" ping, and it is around local midnightIn long-lived Firefox sessions we might go days without receiving a "shutdown" ping. Thus the "daily" ping is sent to ensure we occasionally hear from long-lived sessions.
environment-changeTelemetry Environment changesIs sent immediately when triggered by Firefox (Installing or removing an addon or changing a monitored user preference are common ways for the Telemetry Environment to change)
aborted-sessionFirefox session doesn't end cleanlySent by Firefox at the beginning of the next Firefox session.

It was introduced in Firefox 38.

"first-shutdown" ping

The "first-shutdown" ping is identical to the "main" ping with reason "shutdown" created at the end of the user's first session, but sent with a different ping type. This was introduced when we started using Pingsender to send shutdown pings as there would be a lot of first-session "shutdown" pings that we'd start receiving all of a sudden.

It is sent using Pingsender.

It was introduced in Firefox 57.

"event" ping

The "event" ping provides low-latency eventing support to Firefox Telemetry. It delivers the Telemetry Environment, Telemetry Events from all Firefox processes, and some diagnostic information about Event Telemetry. It is sent every hour if there have been events recorded, and up to once every 10 minutes (governed by a preference) if the maximum event limit for the ping (default to 1000 per process, governed by a preference) is reached before the hour is up.

It was introduced in Firefox 62.

"update" ping

Firefox Update is the most important means we have of reaching our users with the latest fixes and features. The "update" ping notifies us when an update is downloaded and ready to be applied (reason: "ready") and when the update has been successfully applied (reason: "success"). It contains the Telemetry Environment and information about the update.

It was introduced in Firefox 56.

"new-profile" ping

When a user starts up Firefox for the first time, a profile is created. Telemetry marks the occasion with the "new-profile" ping which sends the Telemetry Environment. It is sent either 30 minutes after Firefox starts running for the first time in this profile (reason: "startup") or at the end of the profile's first session (reason: "shutdown"), whichever comes first. "new-profile" pings are sent immediately when triggered. Those with reason "startup" are sent by Firefox. Those with reason "shutdown" are sent by Pingsender.

It was introduced in Firefox 55.

"crash" ping

The "crash" ping provides diagnostic information whenever a Firefox process exits abnormally. Unlike the "main" ping with reason "aborted-session", this ping does not contain Histograms or Scalars. It contains a Telemetry Environment, Crash Annotations, and Stack Traces.

It was introduced in Firefox 40.

"optout" ping

In the event a user opts out of Telemetry, we send one final "optout" ping to let us know. We try exactly once to send it, discarding the ping if sending fails. It contains only the common ping data and an empty payload.

It was introduced in Firefox 63.

Pingsender

Pingsender is a small application shipped with Firefox which attempts to send pings even if Firefox is not running. If Firefox has crashed or has already shut down we would otherwise have to wait for the next Firefox session to begin to send pings.

Pingsender was introduced in Firefox 54 to send "crash" pings. It was expanded to send "main" pings of reason "shutdown" in Firefox 55 (excepting the first session). It sends the "first-shutdown" ping since its introduction in Firefox 57.

Analysis

The large majority of analyses can be completed using only the main ping. This ping includes histograms, scalars, and other performance and diagnostic data.

Few analyses actually rely directly on any raw ping data. Instead, we provide derived datasets which are processed versions of these data, made to be:

  • Easier and faster to query
  • Organized to make the data easier to analyze
  • Cleaned of erroneous or misleading data

Before analyzing raw ping data, check to make sure there isn't already a derived dataset made for your purpose. If you do need to work with raw ping data, be aware that loading the data can take a while. Try to limit the size of your data by controlling the date range, etc.

Accessing the Data

Ping data lives in BigQuery and is accessible in re:dash; see our BigQuery intro. There is currently limited history for main pings available in BigQuery; an import of historical data is planned, but without a determined timeline, so longer history requires an ATMO cluster using the Dataset API.

Further Reading

You can find the complete ping documentation. To augment our data collection, see Collecting New Data and the Data Collection Policy.

Main Ping Derived Datasets

The main ping contains most of the measurements used to track performance and health of Firefox in the wild. This ping includes histograms, scalars, and events.

This section describes the derived datasets we provide to make analyzing this data easier.

longitudinal

The longitudinal dataset is a 1% sample of main ping data organized so that each row corresponds to a client_id. If you're not sure which dataset to use for your analysis, this is probably what you want.

Contents

Each row in the longitudinal dataset represents one client_id, which is approximately a user. Each column represents a field from the main ping. Most fields contain arrays of values, with one value for each ping associated with a client_id. Using arrays give you access to the raw data from each ping, but can be difficult to work with from SQL. Here's a query showing some sample data to help illustrate.

Background and Caveats

Think of the longitudinal table as wide and short. The dataset contains more columns than main_summary and down-samples to 1% of all clients to reduce query computation time and save resources.

In summary, the longitudinal table differs from main_summary in two important ways:

  • The longitudinal dataset groups all data so that one row represents a client_id
  • The longitudinal dataset samples to 1% of all client_ids

Please note that this dataset only contains release (or opt-out) histograms and scalars.

Accessing the Data

The longitudinal is available in re:dash, though it can be difficult to work with the array values in SQL. Take a look at this example query.

The data is stored as a parquet table in S3 at the following address.

s3://telemetry-parquet/longitudinal/

main_summary

The main_summary table is the most direct representation of a main ping but can be difficult to work with due to its size. Prefer the clients_daily dataset unless it doesn't aggregate the measurements you're interested in.

Contents

The main_summary table contains one row for each ping. Each column represents one field from the main ping payload, though only a subset of all main ping fields are included. This dataset does not include most histograms.

Background and Caveats

This table is massive, and due to its size, it can be difficult to work with. You should avoid querying main_summary from re:dash. Your queries will be slow to complete and can impact performance for other users, since re:dash on a shared cluster.

Instead, we recommend using the longitudinal or clients_daily dataset where possible. If these datasets do not suffice, consider using Spark on Databricks. In the odd case where these queries are necessary, make use of the sample_id field and limit to a short submission date range.

Accessing the Data

The data is stored as a parquet table in S3 at the following address.

s3://telemetry-parquet/main_summary/v4/

Though not recommended main_summary is accessible through re:dash. Here's an example query. Your queries will be slow to complete and can impact performance for other users, since re:dash is on a shared cluster.

Further Reading

The technical documentation for main_summary is located in the telemetry-batch-view documentation.

The code responsible for generating this dataset is here

first_shutdown_summary

The first_shutdown_summary table is a summary of the first-shutdown ping.

Contents

The first shutdown ping contains first session usage data. The dataset has rows similar to the telemetry_new_profile_parquet, but in the shape of main_summary.

Background and Caveats

Ping latency was reduced through the shutdown ping-sender mechanism in Firefox 55. To maintain consistent historical behavior, the first main ping is not sent until the second start up. In Firefox 57, a separate first-shutdown ping was created to evaluate first-shutdown behavior while maintaining backwards compatibility.

In many cases, the first-shutdown ping is a duplicate of the main ping. The first-shutdown summary can be used in conjunction with the main summary by taking the union and deduplicating on the document_id.

Accessing the Data

The data can be accessed as first_shutdown_summary. It is currently stored in the following path.

s3://telemetry-parquet/first_shutdown_summary/v4/

The data is backfilled to 2017-09-22, the date of its first nightly appearance. This data should be available to all releases on and after Firefox 57.

client_count_daily

The client_count_daily dataset is useful for estimating user counts over a few pre-defined dimensions.

The client_count_daily dataset is similar to the deprecated client_count dataset except that is aggregated by submission date and not activity date.

Content

This dataset includes columns for a dozen factors and an HLL variable. The hll column contains a HyperLogLog variable, which is an approximation to the exact count. The factor columns include submission date and the dimensions listed here. Each row represents one combinations of the factor columns.

Background and Caveats

It's important to understand that the hll column is not a standard count. The hll variable avoids double-counting users when aggregating over multiple days. The HyperLogLog variable is a far more efficient way to count distinct elements of a set, but comes with some complexity. To find the cardinality of an HLL use cardinality(cast(hll AS HLL)). To find the union of two HLL's over different dates, use merge(cast(hll AS HLL)). The Firefox ER Reporting Query is a good example to review. Finally, Roberto has a relevant write-up here.

Accessing the Data

The data is available in Re:dash. Take a look at this example query.

I don't recommend accessing this data from ATMO.

Further Reading

clients_last_seen

The clients_last_seen dataset is useful for efficiently determining exact user counts such as DAU and MAU.

It does not use approximates, unlike the HyperLogLog algorithm used in the client_count_daily dataset, and it includes the most recent values in a 28 day window for all columns in the clients_daily dataset.

This dataset should be used instead of client_count_daily.

Content

For each submission_date this dataset contains one row per client_id that appeared in clients_daily in a 28 day window including submission_date and preceding days.

The days_since_seen column indicates the difference between submission_date and the most recent submission_date in clients_daily where the client_id appeared. A client observed on the given submission_date will have days_since_seen = 0.

Other days_since_ columns use the most recent date in clients_daily where a certain condition was met. If the condition was not met for a client_id in a 28 day window NULL is used. For example days_since_visited_5_uri uses the condition scalar_parent_browser_engagement_total_uri_count_sum >= 5. These columns can be used for user counts where a condition must be met on any day in a window instead of using the most recent values for each client_id.

The rest of the columns use the most recent value in clients_daily where the client_id appeared.

Background and Caveats

User counts generated using days_since_seen only reflect the most recent values from clients_daily for each client_id in a 28 day window. This means Active MAU as defined cannot be efficiently calculated using days_since_seen because if a given client_id appeared every day in February and only on February 1st had scalar_parent_browser_engagement_total_uri_count_sum >= 5 then it would only be counted on the 1st, and not the 2nd-28th. Active MAU can be efficiently and correctly calculated using days_since_visited_5_uri.

MAU can be calculated over a GROUP BY submission_date[, ...] clause using COUNT(*), because there is exactly one row in the dataset for each client_id in the 28 day MAU window for each submission_date.

User counts generated using days_since_seen can use SUM to reduce groups, because a given client_id will only be in one group per submission_date. So if MAU were calculated by country and channel, then the sum of the MAU for each country would be the same as if MAU were calculated only by channel.

Accessing the Data

The data is available in Re:dash and BigQuery. Take a look at this full running example query in Re:dash.

clients_daily

The clients_daily table is intended as the first stop for asking questions about how people use Firefox. It should be easy to answer simple questions. Each row in the table is a (client_id, submission_date) and contains a number of aggregates about that day's activity.

Contents

Many questions about Firefox take the form "What did clients with characteristics X, Y, and Z do during the period S to E?" The clients_daily table is aimed at answer those questions.

Accessing the Data

The data is stored as a parquet table in S3 at the following address.

s3://telemetry-parquet/clients_daily/v6/

The clients_daily table is accessible through re:dash using the Athena data source. It is also available via the Presto data source, though Athena should be preferred for performance and stability reasons.

Here's an example query.

Crash Ping Derived Datasets

The crash ping is captured after the main Firefox process crashes or after a content process crashes, whether or not the crash report is submitted to crash-stats.mozilla.org. It includes non-identifying metadata about the crash.

This section describes the derived datasets we provide to make analyzing this data easier.

error_aggregates

The error_aggregates_v2 table represents counts of errors counted from main and crash pings, aggregated every 5 minutes. It is the dataset backing the main mission control view, but may also be queried independently.

Contents

The error_aggregates_v2 table contains counts of various error measures (for example: crashes, "the slow script dialog showing"), aggregated across each unique set of dimensions (for example: channel, operating system) every 5 minutes. You can get an aggregated count for any particular set of dimensions by summing using SQL.

Experiment unpacking

It's important to note that when this dataset is written, pings from clients participating in an experiment are aggregated on the experiment_id and experiment_branch dimensions corresponding to what experiment and branch they are participating in. However, they are also aggregated with the rest of the population where the values of these dimensions are null. Therefore care must be taken when writing aggregating queries over the whole population - in these cases one needs to filter for experiment_id is null and experiment_branch is null in order to not double-count pings from experiments.

Accessing the data

You can access the data via re:dash. Choose Athena and then select the telemetry.error_aggregates_v2 table.

Further Reading

The code responsible for generating this dataset is here.

crash_summary

The crash_summary table is the most direct representation of a crash ping.

Contents

The crash_summary table contains one row for each crash ping. Each column represents one field from the crash ping payload, though only a subset of all crash ping fields are included.

Accessing the Data

The data is stored as a parquet table in S3 at the following address.

s3://telemetry-parquet/crash_summary/v1/

crash_summary is accessible through re:dash. Here's an example query.

Further Reading

The technical documentation for crash_summary is located in the telemetry-batch-view documentation.

The code responsible for generating this dataset is here

New-Profile Derived Datasets

The new-profile ping is sent from Firefox Desktop on the first session of a newly created profile and contains the initial information about the user environment.

This data is available in the telemetry_new_profile_parquet dataset.

The telemetry_new_profile_parquet table is the most direct representation of a new-profile ping.

Contents

The table contains one row for each ping. Each column represents one field from the new-profile ping payload, though only a subset of all fields are included.

Accessing the Data

The data is stored as a parquet table in S3 at the following address.

s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-new-profile-parquet/v2/

The telemetry_new_profile_parquet is accessible through re:dash. Here's an example query.

Further Reading

This dataset is generated automatically using direct to parquet. The configuration responsible for generating this dataset was introduced in bug 1360256.

Update Derived Dataset

The update ping is sent from Firefox Desktop when a browser update is ready to be applied and after it was correctly applied. It contains the build information and the update blob information, in addition to some information about the user environment. The telemetry_update_parquet table is the most direct representation of an update ping.

Contents

The table contains one row for each ping. Each column represents one field from the update ping payload, though only a subset of all fields are included.

Accessing the Data

The data is stored as a parquet table in S3 at the following address.

s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-update-parquet/v1/

The telemetry_update_parquet is accessible through re:dash. Here's an example query.

Further Reading

This dataset is generated automatically using direct to parquet. The configuration responsible for generating this dataset was introduced in bug 1384861.

Other Datasets

Public crash statistics for Firefox are available through the Data Platform in a socorro_crash dataset. The crash data in Socorro is sanitized and made available to ATMO and STMO. A nightly import job converts batches of JSON documents into a columnar format using the associated JSON Schema.

Contents

Accessing the Data

The dataset is available in parquet at s3://telemetry-parquet/socorro_crash/v2. It is also indexed with Athena and Presto with the table name socorro_crash.

Obsolete Datasets

heavy_users

The heavy_users table provides information about whether a given client_id is considered a "heavy user" on each day (using submission date).

Contents

The heavy_users table contains one row per client-day, where day is submission_date. A client has a row for a specific submission_date if they were active at all in the 28 day window ending on that submission_date.

A user is a "heavy user" as of day N if, for the 28 day period ending on day N, the sum of their active_ticks is in the 90th percentile (or above) of all clients during that period. For more analysis on this, and a discussion of new profiles, see this link.

Background and Caveats

  1. Data starts at 20170801. There is technically data in the table before this, but the heavy_user column is NULL for those dates because it needed to bootstrap the first 28 day window.
  2. Because it is top the 10% of clients for each 28 day period, more than 10% of clients active on a given submission_date will be considered heavy users. If you join with another data source (main_summary, for example), you may see a larger proportion of heavy users than expected.
  3. Each day has a separate, but related, set of heavy users. Initial investigations show that approximately 97.5% of heavy users as of a certain day are still considered heavy users as of the next day.
  4. There is no "fixing" or weighting of new profiles - days before the profile was created are counted as zero active_ticks. Analyses may need to use the included profile_creation_date field to take this into account.

Accessing the Data

The data is available both via sql.t.m.o and Spark.

In Spark:

spark.read.parquet("s3://telemetry-parquet/heavy_users/v1")

In SQL:

SELECT * FROM heavy_users LIMIT 3

Further Reading

The code responsible for generating this dataset is here

retention

The retention table provides client counts relevant to client retention at a 1-day granularity. The project is tracked in Bug 1381840

Contents

The retention table contains a set of attribute columns used to specify a cohort of users and a set of metric columns to describe cohort activity. Each row contains a permutation of attributes, an approximate set of clients in a cohort, and the aggregate engagement metrics.

This table uses the HyperLogLog (HLL) sketch to create an approximate set of clients in a cohort. HLL allows counting across overlapping cohorts in a single pass while avoiding the problem of double counting. This data-structure has the benefit of being compact and performant in the context of retention analysis, at the expense of precision. For example, calculating a 7-day retention period can be obtained by aggregating over a week of retention data using the union operation. With SQL primitive, this requires a recalculation of COUNT DISTINCT over client_id's in the 7-day window.

Background and Caveats

  1. The data starts at 2017-03-06, the merge date where Nightly started to track Firefox 55 in Mozilla-Central. However, there was not a consistent view into the behavior of first session profiles until the new_profile ping. This means much of the data is inaccurate before 2017-06-26.
  2. This dataset uses 4 day reporting latency to aggregate at least 99% of the data in a given submission date. This figure is derived from the telemetry-health measurements on submission latency, with the discussion in Bug 1407410. This latency metric was reduced Firefox 55 with the introduction of the shutdown ping-sender mechanism.
  3. Caution should be taken before adding new columns. Additional attribute columns will grow the number of rows exponentially.
  4. The number of HLL bits chosen for this dataset is 13. This means the default size of the HLL object is 2^13 bits or 1KiB. This maintains about a 1% error on average. See this table from Algebird's HLL implementation for more details.

Accessing the Data

The data is primarily available through Re:dash on STMO via the Presto source. This service has been configured to use predefined HLL functions.

The column should first be cast to the HLL type. The scalar cardinality(<hll_column>) function will approximate the number of unique items per HLL object. The aggregate merge(<hll_column>) function will perform the set union between all objects in a column.

Example: Cast the count column into the appropriate type.

SELECT cast(hll as HLL) as n_profiles_hll FROM retention

Count the number of clients seen over all attribute combinations.

SELECT cardinality(cast(hll as HLL)) FROM retention

Group-by and aggregate client counts over different release channels.

SELECT channel, cardinality(merge(cast(hll AS HLL))
FROM retention
GROUP BY channel

The HyperLogLog library wrappers are available for use outside of the configured STMO environment, spark-hyperloglog and presto-hyperloglog.

Also see the client_count_daily dataset.

churn

The churn dataset tracks the 7-day churn rate of telemetry profiles. This dataset is generally used for analyzing cohort churn across segments and time.

Content

Churn is the rate of attrition defined by (clients seen in week N)/(clients seen in week 0) for groups of clients with some shared attributes. A group of clients with shared attributes is called a cohort. The cohorts in this dataset are created every week and can be tracked over time using the acquisition_date and the weeks since acquisition or current_week.

The following example demonstrates the current logic for generating this dataset. Each column represents the days since some arbitrary starting date.

client000102030405060708091011121314
AXX
BXXXXXX
CXX

All three clients are part of the same cohort. Client A is retained for weeks 0 and 1 since there is activity in both periods. A client only needs to show up once in the period to be counted as retained. Client B is acquired in week 0 and is active frequently but does not appear in following weeks. Client B is considered churned on week 1. However, a client that is churned can become retained again. Client C is considered churned on week 1 but retained on week 2.

The following table summarizes the above daily activity into the following view where every column represents the current week since acquisition date..

client012
AXX
BX
CXX

The clients are then grouped into cohorts by attributes. An attribute describes a property about the cohort such as the country of origin or the binary distribution channel. Each group also contains descriptive aggregates of engagement. Each metric describes the activity of a cohort such as size and overall usage at a given time instance.

Background and Caveats

The original concept for churn is captured in this Mana page. The original derived data-set was created in bug 1198537. The first major revision (v2) of this data-set added attribution, search, and uri counts. The second major revision (v3) included additional clients through the new-profile ping and adjusted the collection window from 10 to 5 days.

  • Each row in this dataset describes a unique segment of users
    • The number of rows is exponential with the number of dimensions
    • New fields should be added sparing to account for data-set size
  • The dataset lags by 10 days in order account for submission latency
    • This value was determined to be time for 99% of main pings to arrive at the server. With the shutdown-ping sender, this has been reduced to 4 days. However, churn_v3 still tracks releases older than Firefox 55.
  • The start of the period is fixed to Sundays. Once it has been aggregated, the period cannot be shifted due to the way clients are counted.
    • A supplementary 1-day retention dataset using HyperLogLog for client counts is available for counting over arbitrary retention periods and date offsets. Additionally, calculating churn or retention over specific cohorts is tractable in STMO with main_summary or clients_daily datasets.

Accessing the Data

churn is available in Re:dash under Athena and Presto. The data is also available in parquet for consumption by columnar data engines at s3://telemetry-parquet/churn/v3.

Appendix

Mobile Metrics

There are several tables owned by the mobile team documented here:

  • android_addons
  • mobile_clients

Choosing a Mobile Product Dataset

Products Overview

Before doing an analysis, it is important to know which products you want to include. Here is a quick overview of Mozilla's mobile products.

Product NameApp NameOSNotes
Firefox AndroidFennecAndroid
Firefox iOSFenneciOS
Focus AndroidFocusAndroidPrivacy browser
Focus iOSFocusiOSPrivacy browser
KlarKlarAndroidGerman Focus release
Firefox for Fire TVFirefoxForFireTVAndroid
Firefox for Echo ShowFirefoxConnectAndroid
Firefox LiteZerdaAndroidFormerly Rocket (See below)
Fenix (Firefox Preview)N/AAndroidUses Glean (see below)

Firefox Lite was formerly known as Rocket. It is only available in certain countries in Asia Pacific - for more information on Firefox Lite data please see the telemetry documentation.

Focus is our privacy focused mobile browser which blocks trackers by default and does not store a browsing history.

Klar is the release name for Focus in Germany.

For more information on how telemetry is sent for iOS apps, see the telemetry documentation.

Some telemetry is also sent by FirefoxReality and some non-Mozilla forks of our browsers. It is best to filter on metadata_app_name to ensure you are looking at only the app you are trying to analyze data for.

Raw Pings

Mobile data is structured differently than desktop. Instead of sending a main ping, mobile has two key types of pings - a core ping and an events ping. The core ping is sent once per session and contains a much smaller set of metrics than the main ping, due to network and data size constraints. All mobile apps send the core ping. For more information on the core ping, there is telemetry documentation here.

Event pings are not sent for all products. Event pings are sent by Focus Android, Focus iOS, Klar, Firefox for FireTV, Firefox for Echo Show and Firefox Lite. Event pings are sent more often than core pings, at most once per 10 minute interval. If the ping records 10,000 events it is sent immediately, unless it is within 10 minutes of the last event ping sent, in which case some data may be lost. For more information on the event ping, there is telemetry documentation here.

Fennec (Firefox Android) does not send event pings, but instead has a saved_session ping which has the same format as main_summary but is only available for pre-release users and select few release users who have opted in to telemetry collection. Data from this must be treated with caution as it comes from a biased population and should not be used to make conclusions about Fennec users as a whole.

For more information on the implementation of the event pings and to view event descriptions for Focus, Firefox for FireTV or Firefox for Echo Show please see the linked documentation.

Core Ping Derived Datasets

telemetry_core_parquet

For most analyses of mobile data, use the telemetry_core_parquet table. This table contains data for all the non-desktop Firefox applications which send core pings.

Unlike main summary, you can query telemetry_core_parquet directly. Remember to filter on app_name and os, as Firefox iOS and Firefox Android have the same app_name. Best practice is to always filter on app_name, os, app version (found as metadata_app_version) and release channel (which can be found as under metadata as metadata.normalized_channel).

There are versioned tables for telemtry_core_parquet but the table without a _v# suffix is the most up to date table and it is best to use this in your analysis.

The metadata field contains a list of useful metrics. To access you can query metadata.metric_name for the metric_name of your choice. Metrics included in metadata are: [document_id, timestamp, date, geo_country, geo_city, app_build_id, normalized_channel] as described here.

The seq field indicates the order of the pings coming in. seq = 1 is the first ping we have received for that client id and can be used to proxy new users.

Other Tables

Mobile has a core_client_count table which has a created date and unique client id for each new install. This does not fully replicate what client_count_daily does for Desktop but can be useful for some analyses.

For other core ping derived tables see the documentation here. These (with the exception of mobile_clients) are derived from the saved_session ping only available as an opt-in on Fennec release, so should be used with caution.

Event Ping Derived Datasets

There are multiple event tables for mobile data. The two main event tables are telemetry_mobile_event_parquet and telemetry_focus_event_parquet. As the name suggests, the event pings from Focus (iOS, Android and Klar) get sent to telemetry_focus_event_parquet and the other apps send data to telemetry_mobile_event_parquet. Both tables have the same format and columns.

telemetry_mobile_events_parquet

This table contains event data for Firefox for Fire TV, Firefox for Echo Show and Firefox Lite. There is a metadata column containing a list of metrics.

Like when querying telemetry_core_parquet, there are multiple apps contained in each table, so it is best practice to filter on at least app_name and os. One thing to note is that there is no app_version field in these tables, so in order to filter or join on a specific version you must know the corresponding metadata.app_build_id(s) for that app_version. This can be found by reaching out to the engineering team building the app.

Some other applications also send event data to this table, including Lockbox and FirefoxReality. For more information on the event data sent from these applications, see their documentation.

telemetry_focus_events_parquet

This table contains event data for Focus Android, Focus iOS and Klar.

Like when querying telemetry_core_parquet, there are multiple apps contained in each table, so it is best practice to filter on at least app_name and os. One thing to note is that there is no app_version field in these tables, so in order to filter or join on a specific version you must know the corresponding app_build_id(s) for that app_version. This can be found by reaching out to the engineering team building the app.

Some other applications send data to this table, but it is preferred to use this only for analysis of event data from Focus and its related apps.

Notes

Each app has its own set of release channels and each app implements them in its own way. Most have a nightly, beta, release and an other channel, used at various stages of development. Users sign up to test pre-released versions of the app. In Focus Android, the beta channel uses the same APK in the Google Play Store as the release channel, but beta users get access to this version earlier than the release population. Once the release version is published, Beta users will be on the same version of the app as Release users and will be indistinguishable (without a query going back and flagging them by client_id). Beta releases have their normalized_channel tagged release and the only way to filter to beta users is to check that they were on a higher version number before the official release date.

There was an incident on Oct 25, 2018 where a chunk of client_ids on Firefox Android were reset to the same client_id. For more information see the blameless post-mortem document here or bug 1501329. Because of this, some retention analyses spanning this time frame may be impacted.

Upcoming Changes

In future, Android apps will use Glean - the new mobile telemetry SDK. Plans are to integrate this new SDK starting with Project Fenix, then update the other existing apps to Glean starting the second half of 2019. Instead of core and event pings, Glean will send baseline, metrics and events pings. For more information on Glean visit their GitHub page or #Glean on Slack.

Introduction

STMO is shorthand for sql.telemetry.mozilla.org, an installation of the excellent Re:dash data analysis and dashboarding tool that has been customized and configured for use with a number of the Firefox organization's data sets. As the name and URL imply, effective use of this tool requires familiarity with the SQL query language, with which all of the tool's data extraction and analysis are performed.

Concepts

There are three building block from which analyses in STMO are constructed: queries, visualizations, and dashboards.

Queries

STMO's basic unit of analysis is the query. A query is a block of SQL code that extracts and (optionally) transforms data from a single data source. Queries can vary widely in complexity. Some queries are trivial one liners (e.g. SELECT * FROM tablename LIMIT 10), while others are many pages long, small programs in their own right.

The raw output from a query is tabular data, where each row is one set of return values for the query's output columns. A query can be run manually or it can be specified to have a refresh schedule, where it will execute automatically after a specified interval of time.

Visualizations

Tabular data is great, but rarely is a grid of values the best way to make sense of your data. Each query can be associated with multiple visualizations, each visualization rendering the extracted data in some more useful format. There are many visualization types, including charts (line, bar, area, pie, etc.), counters, maps, pivot tables, and more. Each visualization type provides a set of configuration parameters that allow you to specify how to map from the raw query output to the desired visualization. Some visualization types make demands of the query output; a map visualization requires each row to contain a longitude value and a latitude value, for instance.

Dashboards

A dashboard is a collection of visualizations, combined into a single visual presentation for convenient perusal. A dashboard is decoupled from any particular queries. While it is possible to include multiple visualizations from a single query in one dashboard, it is not required; users can add any visualizations they can access to the dashboards they create.

Data Sources

SQL provides the ability to extract and manipulate the data, but you won't get very far without having some familiarity with what data is actually available, and how that data is structured. Each query in STMO is associated with exactly one data source, and you have to know ahead of time which data source contains the information that you need. One of the most commonly used data sources is called Athena (referring to Amazon's Athena query service, on which it is built), which contains most of the data that is obtained from telemetry pings received from Firefox clients. The BigQuery (referring to Google's BigQuery service) source is slowly replacing the Athena and Presto data sources. BigQuery contains some of the data that's exposed via Athena, as well as new data that is calculated there. Presto contains all of the data that's exposed via Athena and more, but returns query results much more slowly.

Other available data sources include Crash DB, Tiles, Sync Stats, Push, Test Pilot, ATMO, and even a Re:dash metadata which connects to STMO's own Re:dash database. You can learn more about the available data sets and how to find the one that's right for you on the Choosing a dataset page. If you have data set questions, or would like to know if specific data is or can be made available in STMO, please inquire in the #datapipeline or #datatools channels on irc.mozilla.org.

Creating an Example Dashboard

The rest of this document will take you through the process of creating a simple dashboard using STMO.

Creating A Query

We start by creating a query. Our first query will count the number of client ids that we have coming from each country, for the top N countries. Clicking on the 'New Query' button near the top left of the site will bring you to the query editing page:

New Query Page

For this (and most queries where we're counting distinct client IDs) we'll want to use clients_last_seen, which is generated from Firefox telemetry pings.

  • Check if the table is in BigQuery

    As mentioned above, BigQuery is replacing Athena and Presto, but not all data sets are yet available in BigQuery. Click on the 'Data Source' drop-down and select BigQuery, then check to see if the one we want is available by typing clients_last_seen into the "Search schema..." search box above the schema browser interface to the left of the main query edit box. You should see that there is, in fact, a clients_last_seen table (showing up as telemetry.clients_last_seen), as well as versioned clients_last_seen data sets (showing up as telemetry.clients_last_seen_v<VERSION>).

  • Check if the table is in Athena

    If it's not in BigQuery, now we should check to see if it's in Athena. If you click on the 'Data Source' drop-down and change the selection from 'BigQuery' to 'Athena' (with clients_last_seen still populating the filter input), you should see that there is a match for clients_last_seen, which means this table is available in Athena.

  • Check if the table is in Presto

    If it's also not in Athena, now we should check to see if it's in Presto. If you click on the 'Data Source' drop-down and change the selection from 'Athena' to 'Presto' (with clients_last_seen still populating the filter input), you should see that there is a match for clients_last_seen, which means this table is available in Presto.

  • Introspect the available columns

    Click on the 'Data Source' drop-down and change the selection to 'BigQuery', and click on telemetry.clients_last_seen in the schema browser to expose the columns that are available in the table. Three of the columns are of interest to us for this query: country, days_since_seen, and submission_date.

So a query that extracts all of the unique country values and the MAU for one day for each one, sorted from highest MAU to lowest MAU looks like this:

SELECT
  country,
  COUNTIF(days_since_seen < 28) AS mau
FROM
  telemetry.clients_last_seen
WHERE
  submission_date = '2019-04-01'
GROUP BY
  country
ORDER BY
  mau DESC

If you type that into the main query edit box and then click on the "Execute" button, you should see a blue bar appear below the edit box containing the text "Executing query..." followed by a timer indicating how long the query has been running. After a reasonable (for some definition of "reasonable", usually less than a minute) amount of time the query should complete, resulting in a table showing a MAU value for each country. Congratulations, you've just created and run your first STMO query!

Now would be a good time to click on the large "New Query" text near the top of the page; it should turn into an edit box, allowing you to rename the query. For this exercise, you should use a unique prefix (such as your name) for your query name, so it will be easy to find your query later; I used rmiller:Top Countries.

Creating A Visualization

Now that the query is created, we'll want to provide a simple visualization. The table with results from the first query execution should be showing up underneath the query edit box. Next to the TABLE heading should be another heading entitled +NEW VISUALIZATION.

New Visualization

Clicking on the +NEW VISUALIZATION link should bring you to the "Visualization Editor" screen, where you can specify a visualization name ("Top Countries bar chart"), a chart type ("Bar"), an x-axis column (country), and a y-axis column (mau).:

Visualization Editor

After the GENERAL settings have been specified, we'll want to tweak a few more settings on the X AXIS tab. You'll want to click on this tab and then change the 'Scale' setting to 'Category', and un-check the 'Sort Values' check-box to allow the query's sort order to take precedence:

Visualization X Axis

A Note About Limits

Once you save the visualization settings and return to the query source page, you should have a nice bar graph near the bottom of the page. You may notice, however, that the graph has quite a long tail. Rather than seeing all of the countries, it might be nicer to only see the top 20. We can do this by adding a LIMIT clause to the end of our query:

SELECT
  country,
  COUNTIF(days_since_seen < 28) AS mau
FROM
  telemetry.clients_last_seen
WHERE
  submission_date = '2019-04-01'
GROUP BY
  country
ORDER BY
  mau DESC
LIMIT
  20

If you edit the query to add a limit clause and again hit the 'Execute' button, you should get a new bar graph that only shows the 20 countries with the highest number of unique clients. In this case, the full result set has approximately 250 return values, and so limiting the result count improves readability. In other cases, however, an unlimited query might return thousands or even millions of rows. When those queries are run, readability is not the only problem; queries that return millions of rows can tax the system, failing to return the desired results, and negatively impacting the performance of all of the other users of the system. Thus, a very important warning:

ALL QUERIES SHOULD INCLUDE A "LIMIT" STATEMENT BY DEFAULT!

You should be in the habit of adding a "LIMIT 100" clause to the end of all new queries, to prevent your query from returning a gigantic result set that causes UI and performance problems. You may learn that the total result set is small enough that the limit is unnecessary, but unless you're certain that is the case specifying an explicit LIMIT helps prevent unnecessary issues.

Query Parameters

We got our chart under control by adding a "LIMIT 20" clause at the end. But what if we only want the top 10? Or maybe sometimes we want to see the top 30? We don't always know how many results our users will want. Is it possible to allow users to specify how many results they want to see?

As you've probably guessed, I wouldn't be asking that question if the answer wasn't "yes". STMO allows queries to accept user arguments by the use of double curly-braces around a variable name. So our query now becomes the following:

SELECT
  country,
  COUNTIF(days_since_seen < 28) AS mau
FROM
  telemetry.clients_last_seen
WHERE
  submission_date = '2019-04-01'
GROUP BY
  country
ORDER BY
  mau DESC
LIMIT
  {{country_count}}

Once you replace the hard coded limit value with {{country_count}} you should see an input field show up directly above the bar chart. If you enter a numeric value into this input box and click on 'Execute' the query will run with the specified limit. Clicking on the 'Save' button will then save the query, using the entered parameter value as the default.

Creating A Dashboard

Now we can create a dashboard to display our visualization. Do this by clicking on the 'Dashboards' drop-down near the top left of the page, and then clicking the 'New Dashboard' option. Choose a name for your dashboard, and you will be brought to a mostly empty page. Clicking on the '...' button near the top right of the page will give you the option to 'Add Widget'. This displays the following dialog:

Add Widget

The "Search a query by name" field is where you can enter in the unique prefix used in your query name to find the query you created. This will not yet work, however, because your query isn't published. Publishing a query makes it show up in searches and on summary pages. Since this is only an exercise, we won't want to leave our query published, but it must be published briefly in order to add it to our dashboard. You can publish your query by returning to the query source page and clicking the "Publish" button near the top right of the screen.

Once your query is published, it should show up in the search results when you type your unique prefix into the "Search a query by name" field on the "Add Widget" dialog. When you select your query, a new "Choose Visualization" drop-down will appear, allowing you to choose which of the query's visualizations to use. Choose the bar chart you created and then click on the "Add to Dashboard" button. Voila! Now your dashboard should have a bar chart, and you should be able to edit the country_count value and click the refresh button to change the number of countries that show up on the chart.

Completing the Dashboard

A dashboard with just one graph is a bit sad, so let's flesh it out by adding a handful of additional widgets. We're going to create three more queries, each with a very similar bar chart. The text of the queries will be provided here, but creating the queries and the visualizations and wiring them up to the dashboard will be left as an exercise to the user. The queries are as follows:

  • Top OSes (recommended os_count value == 6)
SELECT
  os,
  COUNTIF(days_since_seen < 28) AS mau
FROM
  telemetry.clients_last_seen
WHERE
  submission_date = '2019-04-01'
GROUP BY
  os
ORDER BY
  mau DESC
LIMIT
  {{os_count}}
  • Channel Counts
SELECT
  normalized_channel AS channel,
  COUNTIF(days_since_seen < 28) AS mau
FROM
  telemetry.clients_last_seen
WHERE
  submission_date = '2019-04-01'
GROUP BY
  channel
ORDER BY
  mau DESC
  • App Version Counts (recommended app_version_count value == 20)
SELECT
  app_name,
  app_version,
  COUNTIF(days_since_seen < 28) AS mau
FROM
  telemetry.clients_last_seen
WHERE
  submission_date = '2019-04-01'
GROUP BY
  app_name,
  app_version
ORDER BY
  mau DESC
LIMIT
  {{app_version_count}}

Creating bar charts for these queries and adding them to the original dashboard can result in a dashboard resembling this:

Completed Dashboard

Some final notes to help you create your dashboards:

  • Don't forget that you'll need to publish each of your queries before you can add its visualizations to your dashboard.

  • Similarly, it's a good idea to un-publish any test queries after you've used them in a dashboard so as not to permanently pollute everyone's search results with your tests and experiments. Queries that are the result of actual work-related analysis should usually remain published, so others can see and learn from them.

  • The 'Firefox' label on the 'App Version counts' graph is related to the use of the 'Group by' visualization setting. I encourage you to experiment with the use of 'Group by' in your graphs to learn more about how this can be used.

  • This tutorial has only scratched the surface of the wide variety of very sophisticated visualizations supported by STMO. You can see a great many much more sophisticated queries and dashboards by browsing around and exploring the work that has been published by others.

  • The Re:dash help center is useful for further deep diving into Re:dash and all of its capabilities.

Prototyping Queries

Sometimes you want to start working on your query before the data is available. You can do this with most of the data sources by selecting a static test data set, then working with it as usual. You can also use this method to explore how a given SQL backend behaves.

Note that UNION ALL will retain duplicate rows while UNION will discard them.

Here are a couple of examples:

Simple three-column test dataset

WITH test AS (
 SELECT 1 AS client_id, 'foo' AS v1, 'bar' AS v2 UNION ALL
 SELECT 2 AS client_id, 'bla' AS v1, 'baz' AS v2 UNION ALL
 SELECT 3 AS client_id, 'bla' AS v1, 'bar' AS v2 UNION ALL
 SELECT 2 AS client_id, 'bla' AS v1, 'baz' AS v2 UNION ALL
 SELECT 3 AS client_id, 'bla' AS v1, 'bar' AS v2
)

SELECT * FROM test

Convert a semantic version string to a sortable array field

WITH foo AS (
 SELECT '1.0.1' AS v UNION
 SELECT '1.10.3' AS v UNION
 SELECT '1.0.2' AS v UNION
 SELECT '1.1' AS v UNION
 -- Doesn't work with these type of strings due to casting
 -- SELECT '1.3a1' AS v UNION
 SELECT '1.2.1' AS v
)

SELECT cast(split(v, '.') AS array<bigint>) FROM foo ORDER BY 1

How do boolean fields get parsed from strings?

WITH bar AS (
 SELECT '1' AS b UNION
 SELECT '0' UNION
 SELECT 't' UNION
 SELECT 'f' UNION
 SELECT 'true' UNION
 SELECT 'false' UNION
 SELECT 'turkey'
)
SELECT b, try(cast(b AS boolean)) from bar

Analysis Gotchas

When performing analysis on any data there are some mistakes that are easy to make and details that are easy to overlook. Do you know exactly what question you hope to answer? Is your sample representative of your population? Is your result "real"? How precisely can you state your conclusion?

This document is not about those traps. Instead, it is about quirks and pitfalls specific to Telemetry.

Notable historic events

When looking at trends, it is helpful to be aware of events from the past that might impact comparisons with history. Here are a few to keep in mind:

  • October 29 2019 - Glean SDK Timing Distribution(s) are reporting buckets 1 nanosecond apart. This is due to a potential rounding bug in Glean SDK versions less than 19.0.0. See this bug.
  • October 23 2019 - Hot-fix shipped through add-ons that reset the Telemetry endpoint preference back to the default for a large number of users.
  • September 1 - October 18 2019 - BigQuery Ping tables are missing the X-PingSender-Version header information. This data is available before and after this time period.
  • May 4 - May 11 2019 - Telemetry source data deleted. No source data is available for this period and derived tables may have missing days or imputed values. Derived tables that depend on multiple days may have have affected dates beyond the deletion region.
  • January 31 2019 - Profile-per-install landed in mozilla-central and affects how new profiles are created. See discussion in bigquery-etl#212.
  • October 25 2018 - many client_ids on Firefox Android were reset to the same client_id. For more information see the blameless post-mortem document here or bug 1501329.
  • November 2017 - Quantum Launch. There was a surge in new profiles and usage.
  • June 1 and 5, 2016 - Main Summary v4 data is missing for these two days.
  • March 2016 - Unified Telemetry launched.

Pseudo-replication

Telemetry data is a collection of pings. A single main-ping represents a single subsession. Some clients have more subsessions than others.

So when you say "63% of beta 53 has Firefox set as its default browser", make sure you specify it is 63% of pings, since it is only around 46% of clients. (Apparently users with Firefox Beta 53 set as their default browser submit more main-pings than users who don't).

Profiles vs Users

In the section above you'll notice I said "clients" not "users." That is because of all the things we're able to count, users isn't one of them.

Users can have multiple Firefox profiles running on the same computer at the same time (like developers).

Users can have the same Firefox profile running on several computers on different days of the week (also developers).

The only things we can count are pings and clients. Clients we can count because we send a client_id with each ping that uniquely identifies the profile from which it came. This is generally close enough to our idea of "user" that we can get away with counting profiles and calling them users, but you may run into some instances where the distinction matters.

When in doubt, be precise. You're counting clients.

Opt-in vs Opt-out

We don't collect the same information from everyone.

Every profile that doesn't have Telemetry disabled sends us "opt-out" Telemetry. This includes:

Most probes are "opt-in": we do not get information from them unless the user opts into sending us this information. Users can opt-in in two ways:

  1. Using Firefox's Options UI to tick the box that gives us permission
  2. Installing any pre-release version of Firefox

The nature of selection bias is such that the population in #1 is useless for analysis. If you want to encourage users to collect good information for us, ask them to install Beta: it's only slightly harder than finding and checking the opt-in check-box in Firefox.

Trusting Dates

Don't trust client times.

Any timestamp recorded by the user is subject to "clock skew." The user's clock can be set (purposefully or accidentally) to any time at all. The nature of SSL certificates tends to keep this within a certain relatively-accurate window, because a user who's clock is too far in the past or too far in the future might confuse certain expiration-date-checking code.

Examples of client times: crashDate, crashTime, meta/Date, sessionStartDate, subsessionStartDate, profile/creationDate

Examples of server times you can trust: Timestamp, submission_date

Note that submissionDate does not appear in the ping documentation because it is added in post-processing. It can be found in the meta field of the ping as in the Databricks Example.

Date Formats

Not all dates and times are created equal. Most of the dates and times in Telemetry pings are ISO 8601. Most are full timestamps, though their resolution may differ from being per-second to being per-day.

Then there's profile/creationDate which is just a number of days since epoch. Like 17177 for the date 2017-01-11.

Tip: To convert profile/creationDate to a usable date in SQL: DATE_ADD('day', profile_created, DATE '1970-01-01')

In derived datasets ISO dates are sometimes converted to strings in one of two formats: %Y-%m-%d or %Y%m%d.

The date formats for different rows in main_summary are described on the main_summary reference page.

Build ids look like dates but aren't. If you take the first eight characters you can use that as a proxy for the day the build was released.

metadata/Date is an HTTP Date header in a RFC 7231-compatible format.

Tip: To parse metadata/Date to a usable date in SQL: DATE_PARSE(SUBSTR(client_submission_date, 1, 25), '%a, %d %b %Y %H:%i:%s')

Delays

Telemetry data takes a while to get into our hands. The largest data mule in Telemetry is the main-ping. It is (pending bug 1336360) sent at the beginning of a client's next Firefox session. If the user shuts down their Firefox for the weekend, we won't get their Friday data until Monday morning.

A rule of thumb is data from two days ago is usually fairly representative.

If you'd like to read more about this subject and look at pretty graphs, there are a series of blog posts here, here and here.

Pingsender

Pingsender greatly reduces delay in sending pings to Mozilla, but only some types of pings are sent by Pingsender. Bug 1310703 introduced Pingsender for crash pings and was merged in Firefox 54, which hit release on June 13, 2017. Bug 1336360 moved shutdown pings to Pingsender and was merged in Firefox 55, which hit release on August 8, 2017. Bug 1374270 added sending health pings on shutdown via Pingsender and was merged in Firefox 56, which hit release on Sept 28, 2017. Other types of pings are not sent with Pingsender. This is usually okay because Firefox is expected to continue running long enough to send those pings.

Mobile clients do not have Pingsender, so they suffer delay as given in this query.

Submission Date

submission_date is the server time at which a ping is received from the client. We use it to partition many of our data sets.

In bug 1422892 we decided to standardize on submission_date.

TL;DR

  • not subject to client clock skew
  • doesn't require normalization
  • good for backfill
  • good for daily processing
  • and usually good enough

Optimizing SQL Queries

After you write a query in STMO, you can make big steps to improve performance by understanding how data is stored, what databases are doing under the covers, and what you can change about your query to take advantage of those two pieces.

Note that this advice is most relevant for the Presto, Athena, and Presto-Search data sources, as well as Spark SQL and Spark notebooks in general.

TL;DR: What to do for quick improvements

  • Switch to Athena
  • Filter on a partitioned column† (even if you have a LIMIT)
  • Select the columns you want explicitly (Don't use SELECT *)
  • Use approximate algorithms: e.g. approx_distinct(...) instead of COUNT(DISTINCT ...)

† Partitioned columns can be identified in the Schema Explorer in re:dash. They are the first few columns under a table name, and their name is preceded by a [P].

Some Explanations

There are a few key things to understand about our data storage and these databases to learn how to properly optimize queries.

What are these databases?

The databases we use are not traditional relational databases like PostgreSQL or MySQL. They are distributed SQL engines, where the data is stored separately from the cluster itself. They include multiple machines all working together in a coordinated fashion. This is why the clusters can get slow when there are lots of competing queries - because the queries are sharing resources.

Note that Athena is serverless, which is why we recommend people use that when they can.

How does this impact my queries?

What that means is that multiple machines will be working together to get the result of your query. Because there is more than one machine, we worry a lot about Data Shuffles: when all of the machines have to send data around to all of the other machines.

For example, consider the following query, which gives the number of rows present for each client_id:

SELECT client_id, COUNT(*)
FROM main_summary
GROUP BY client_id

The steps that would happen are this:

  1. Each machine reads a different piece of the data, and parses out the client_id for each row. Internally, it then computes the number of rows seen for each client_id, but only for the data that it read.
  2. Each machine is then given a set of client_ids to aggregate. For example, the first machine may be told to get the count of client1. It will then have to ask every other machine for the total seen for client1. It can then aggregate the total.
  3. Given that every client_id has now been aggregated, each machine reports to the coordinator the client_ids that it was responsible for, as well as the count of rows seen for each. The coordinator is responsible for returning the result of the query to the client, which in our example is STMO.

A similar process happens on data joins, where different machines are told to join on different keys. In that case, data from both tables needs to be shuffled to every machine.

Why do we have multiple databases? Why not use Athena for everything?

Great question! Presto is something we control, and can upgrade it at-will. Athena is currently a serverless version of Presto, and as such doesn't have all of the bells and whistles. For example, it doesn't support lambda expressions or UDFs, the latter of which we use in the Client Count Daily dataset.

Key Takeaways

  • Use Athena, since it doesn't have the resource constraints that Presto or Spark do.
  • Use LIMIT. At the end of a query all the data needs to be sent to a single machine, using LIMIT will reduce that amount and possible prevent an out of memory situation.
  • Use approximate algorithms. These mean less data needs to be shuffled, since we can use probabilistic data structures instead of the raw data itself.
  • Specify large tables first in a JOIN operation. In this case, small tables can be sent to every machine, eliminating one data shuffle operation. Note that Spark supports a broadcast command explicitly.

How is the data stored?

The data is stored in columnar format. Let's try and understand that with an example.

Traditional Row Stores

Consider a completely normal CSV file, which is actually an example of a row store.

name,age,height
"Ted",27,6.0
"Emmanuel",45,5.9
"Cadence",5,3.5

When this data is stored to disk, you could read an entire record in a consecutive order. For example if the first " was stored at block 1 on disk, then a sequential scan from 1 will give the first row of data: "ted",27,6.0. Keep scanning and you'll get \n"Emm... and so on.

So for the above, the following query will be fast:

SELECT *
FROM people
WHERE name == 'Ted'

Since the database can just scan the first row of data. However, the following is more difficult:

SELECT name
FROM people

Now the database has to read all of the rows, and then pick out the name column. This is a lot more overhead!

Columnar Stores

Columnar turns the data sideways. For example, we can make a columnar version of the above data, and still store it in CSV:

name,"Ted","Emmanuel","Cadence"
age,27,45,5
height,6.0,5.9,3.5

Pretty easy! Now let's consider how we can query the data when it's stored this way.

SELECT *
FROM people
WHERE name == "ted"

This query is pretty hard! We have to read all of the data now, because the (name, age, height) isn't stored together.

Now let's consider our other query:

SELECT name
FROM people

Suddenly, this is easy! We don't have to check in as many places for data, we can just read the first few blocks of disks sequentially.

Data Partitions

We can improve performance even further by taking advantage of partitions. These are entire files of data that share a value for a column. So for example, if everyone in the people table lived in DE, then we could add that to the filename: /country=DE/people.csv.

From there, our query engine would have to know how to read that path, and understand that it's telling us that all of those people share a country. So if we were to query for this:

SELECT *
FROM people
WHERE country == 'US'

The database wouldn't have to even read the file! It could just look at the path and realize there was nothing of interest.

Our tables are often partitioned by date, e.g. submission_date_s3.

Key Takeaways

  • Limit queries to a specific few columns you need, to reduce the amount of data that has to be read
  • Filter on partitions to prune down the data you need

Getting Help

Mailing lists

Telemetry-related announcements including new datasets, outages, feature releases, etc are sent to fx-data-dev@mozilla.org, a public mailing list. Please subscribe to the mailing list here.

There's also an internal mailing list, fx-data-platform@mozilla.com, meant for internal data platform team communications. Please speak to your manager if you believe you should be on the list.

IRC

The data platform team is available in #telemetry on irc.mozilla.org. For pipeline-specific issues, you can also find us in #datapipeline

Slack

The Mozilla data org is reachable at #fx-metrics in the internal Mozilla Slack

Tools

This section describes tools we recommend using to analyze Firefox data.

Projects

Below are a number of trailheads that lead into the projects and code that comprise the Firefox Data Platform.

Telemetry APIs

Name and repoDescription
python_moztelemetryPython APIs for Mozilla Telemetry
moztelemetryScala APIs for Mozilla Telemetry
spark-hyperloglogAlgebird's HyperLogLog support for Apache Spark
mozanalysisA library for Mozilla experiments analysis
gleanA client-side mobile Telemetry SDK for collecting metrics and sending them to Mozilla's Telemetry service

ETL code and Datasets

Name and repoDescription
telemetry-batch-viewScala ETL code for derived datasets
python_mozetlPython ETL code for derived datasets
telemetry-airflowAirflow configuration and DAGs for scheduled jobs
python_mozaggregatorAggregation job for telemetry.mozilla.org aggregates
telemetry-streamingSpark Streaming ETL jobs for Mozilla Telemetry

See also firefox-data-docs for documentation on datasets.

Infrastructure

Name and repoDescription
mozilla-pipeline-schemasJSON and Parquet Schemas for Mozilla Telemetry and other structured data
hindsightReal-time data processing
lua_sandboxGeneric sandbox for safe data analysis
lua_sandbox_extensionsModules and packages that extend the Lua sandbox
nginx_moz_ingestNginx module for Telemetry data ingestion
puppet-configCloud services puppet config for deploying infrastructure
parquet2hiveHive import statement generator for Parquet datasets
edge-validatorA service endpoint for validating incoming data
gcp-ingestionDocumentation and implementation of the Mozilla telemetry ingestion system on Google Cloud Platform

EMR Bootstrap scripts

Name and repoDescription
emr-bootstrap-sparkAWS bootstrap scripts for Spark.
emr-bootstrap-prestoAWS bootstrap scripts for Presto.

Data applications

Name and repoDescription
telemetry.mozilla.orgMain entry point for viewing aggregate Telemetry data
Cerberus & MedusaAutomatic alert system for telemetry aggregates
Mission ControlLow latency dashboard for stability and health metrics
Re:dashMozilla's fork of the data query / visualization system
redash-stmoMozilla's extensions to Re:dash
TAARTelemetry-aware addon recommender
EnsembleA minimalist platform for publishing data
Hardware ReportFirefox Hardware Report, available here
python-zeppelinConvert Zeppelin notebooks to Markdown
St. MocliA command-line interface to STMO
probe-scraperScrape and publish Telemetry probe data from Firefox
test-tubeCompare data across branches in experiments
experimenterA web application for managing experiments
St. MoabAutomatically generate Re:dash dashboard for A/B experiments

Legacy projects

Projects in this section are less active, but may not be officially deprecated. Please check with the fx-data-dev mailing list before starting a new project using anything in this section.

Name and repoDescription
telemetry-next-nodeA node.js package for accessing Telemetry Aggregates data

Reference materials

Public

Name and repoDescription
firefox-data-docsAll the info you need to answer questions about Firefox users with data
Firefox source docsMozilla Source Tree Docs - Telemetry section
reports.t.m.oKnowledge repository for public reports

Non-public

Name and repoDescription
Fx-Data-PlanningQuarterly goals and internal documentation

An overview of Mozilla’s Data Pipeline

Note: This article describes the AWS-based pipeline which is being retired; the client-side concepts here still apply, but this article will be updated to reflect the new GCP pipeline.

This post describes the architecture of Mozilla’s data pipeline, which is used to collect Telemetry data from our users and logs from various services. One of the cool perks of working at Mozilla is that most of what we do is out in the open and because of that I can do more than just show you some diagram with arrows of our architecture; I can point you to the code, script & configuration that underlies it!

To make the examples concrete, the following description is centered around the collection of Telemetry data. The same tool-chain is used to collect, store and analyze data coming from disparate sources though, such as service logs.

graph TD
  firefox((fa:fa-firefox Firefox))-->|JSON| elb
  elb[Load Balancer]-->|JSON| nginx
  nginx-->|JSON| landfill(fa:fa-database S3 Landfill)
  nginx-->|protobuf| kafka[fa:fa-files-o Kafka]
  kafka-->|protobuf| cep(Hindsight CEP)
  kafka-->|protobuf| dwl(Hindsight DWL)
  cep--> hsui(Hindsight UI)
  dwl-->|protobuf| datalake(fa:fa-database S3 Data Lake)
  dwl-->|parquet| datalake
  datalake-->|parquet| prestodb
  prestodb-->redash[fa:fa-line-chart Re:dash]
  datalake-->spark
  spark-->datalake
  airflow[fa:fa-clock-o Airflow]-->|Scheduled tasks|spark{fa:fa-star Spark}
  spark-->|aggregations|rdbms(fa:fa-database PostgreSQL)
  rdbms-->tmo[fa:fa-bar-chart TMO]
  rdbms-->cerberus[fa:fa-search-plus Cerberus]


style firefox fill:#f61
style elb fill:#777
style nginx fill:green
style landfill fill:tomato
style datalake fill:tomato
style kafka fill:#aaa
style cep fill:palegoldenrod
style dwl fill:palegoldenrod
style hsui fill:palegoldenrod
style prestodb fill:cornflowerblue
style redash fill:salmon
style spark fill:darkorange
style airflow fill:lawngreen
style rdbms fill:cornflowerblue
style tmo fill:lightgrey
style cerberus fill:royalblue

Firefox

There are different APIs and formats to collect data in Firefox, all suiting different use cases:

  • histograms – for recording multiple data points;
  • scalars – for recording single values;
  • timings – for measuring how long operations take;
  • events – for recording time-stamped events.

These are commonly referred to as probes. Each probe must declare the collection policy it conforms to: either release or prerelease. When adding a new measurement data-reviewers carefully inspect the probe and eventually approve the requested collection policy:

  • Release data is collected from all Firefox users.
  • Prerelease data is collected from users on Firefox Nightly and Beta channels.

Users may choose to turn the data collection off in preferences.

A session begins when Firefox starts up and ends when it shuts down. As a session could be long-running and last weeks, it gets sliced into smaller logical units called subsessions. Each subsession generates a batch of data containing the current state of all probes collected so far, i.e. a main ping, which is sent to our servers. The main ping is just one of the many ping types we support. Developers can create their own ping types if needed.

Pings are submitted via an API that performs a HTTP POST request to our edge servers. If a ping fails to successfully submit (e.g. because of missing internet connection), Firefox will store the ping on disk and retry to send it until the maximum ping age is exceeded.

Kafka

HTTP submissions coming in from the wild hit a load balancer and then an NGINX module. The module accepts data via a HTTP request which it wraps in a Hindsight protobuf message and forwards to two places: a Kafka cluster and a short-lived S3 bucket (landfill) which acts as a fail-safe in case there is a processing error and/or data loss within the rest of the pipeline. The deployment scripts and configuration files of NGINX and Kafka live in a private repository.

The data from Kafka is read from the Complex Event Processors (CEP) and the Data Warehouse Loader (DWL), both of which use Hindsight.

Hindsight

Hindsight, an open source stream processing software system developed by Mozilla as Heka’s successor, is useful for a wide variety of different tasks, such as:

  • converting data from one format to another;
  • shipping data from one location to another;
  • performing real time analysis, graphing, and anomaly detection.

Hindsight’s core is a lightweight data processing kernel written in C that controls a set of Lua plugins executed inside a sandbox.

The CEP are custom plugins that are created, configured and deployed from an UI which produce real-time plots like the number of pings matching a certain criteria. Mozilla employees can access the UI and create/deploy their own custom plugin in real-time without interfering with other plugins running.

CEP Custom Plugin

The DWL is composed of a set of plugins that transform, convert & finally shovel pings into S3 for long term storage. In the specific case of Telemetry data, an input plugin reads pings from Kafka, pre-processes them and sends batches to S3, our data lake, for long term storage. The data is compressed and partitioned by a set of dimensions, like date and application.

The data has traditionally been serialized to Protobuf sequence files which contain some nasty “free-form” JSON fields. Hindsight gained recently the ability to dump data directly in Parquet form though.

The deployment scripts and configuration files of the CEP & DWL live in a private repository.

Spark

Once the data reaches our data lake on S3 it can be processed with Spark on Mozilla's Databricks instance. Databricks allows Mozilla employees to write custom analyses in notebooks, and also schedule Databricks jobs to run periodically.

As mentioned earlier, most of our data lake contains data serialized to Protobuf with free-form JSON fields. Needless to say, parsing JSON is terribly slow when ingesting Terabytes of data per day. A set of ETL jobs, written in Scala by Data Engineers and scheduled with Airflow, create Parquet views of our raw data. We have a Github repository telemetry-batch-view that showcases this.

Aggregates Dataset

graph TD
%% Data Flow Diagram for mozaggregator/TMO-adjacent services
firefox((fa:fa-firefox Firefox)) --> |main ping| pipeline
fennec((fa:fa-firefox Fennec)) --> |saved-session ping| pipeline
pipeline((Telemetry Pipeline))

subgraph mozaggregator
  service(service)
  aggregator
  rdbms(fa:fa-database PostgreSQL)
end

pipeline --> aggregator
pipeline --> spark{fa:fa-star Spark}
pipeline --> redash[fa:fa-line-chart Re:dash]

subgraph telemetry.mozilla.org
  telemetry.js(telemetry.js) --> dist
  telemetry.js --> evo
  orphan[Update Orphaning]
  crashdash[tmo/crashes]
end

redash --> crashdash
service --> telemetry.js
spark --> orphan

telemetry.js --> telemetry-next-node(telemetry-next-node)
subgraph alerts.tmo
  cerberus[fa:fa-search-plus Cerberus] -->medusa
  medusa --> html
  medusa --> email
end

telemetry-next-node --> cerberus

style redash fill:salmon
style spark fill:darkorange
style rdbms fill:cornflowerblue
style cerberus fill:royalblue
style firefox fill:#f61
style fennec fill:#f61
style telemetry.js fill:lightgrey
style dist fill:lightgrey
style evo fill:lightgrey

A dedicated Spark job feeds daily aggregates to a PostgreSQL database which powers a HTTP service to easily retrieve faceted roll-ups. The service is mainly used by TMO, a dashboard that visualizes distributions and time-series, and Cerberus, an anomaly detection tool that detects and alerts developers of changes in the distributions. Originally the sole purpose of the Telemetry pipeline was to feed data into this dashboard but in time its scope and flexibility grew to support more general use-cases.

TMO

Presto & re:dash

We maintain a couple of Presto clusters and a centralized Hive metastore to query Parquet data with SQL. The Hive metastore provides an universal view of our Parquet dataset to both Spark and Presto clusters.

Presto, and other databases, are behind a re:dash service (STMO) which provides a convenient & powerful interface to query SQL engines and build dashboards that can be shared within the company. Mozilla maintains its own fork of re:dash to iterate quickly on new features, but as good open source citizen we push our changes upstream.

STMO

Is that it?

No, not really. If you want to read more, check out this article. For example, the DWL pushes some of the Telemetry data to Redshift and other tools that satisfy more niche needs. The pipeline ingests logs from services as well and there are many specialized dashboards out there I haven’t mentioned.

There is a vast ecosystem of tools for processing data at scale, each with their pros & cons. The pipeline grew organically and we added new tools as new use-cases came up that we couldn’t solve with our existing stack. There are still scars left from that growth though which require some effort to get rid of, like ingesting data from schema-less format.

A Detailed Look at the Data Platform

For a more gentle introduction to the data platform, please read the Pipeline Overview article.

This article goes into more depth about the architecture and flow of data in the platform.

The Entire Platform

The full detail of the platform can get quite complex, but at a high level the structure is fairly simple.

graph LR
  Producers[Data Producers] --> Ingestion
  Ingestion --> Storage[Long-term Storage]
  Ingestion --> Stream[Stream Processing]
  Stream --> Storage
  Batch[Batch Processing] --> Storage
  Storage --> Batch
  Self[Self Serve] -.- Stream
  Self -.- Batch
  Stream -.-> Visualization
  Batch -.-> Visualization
  Stream --> Export
  Batch --> Export

Each of these high-level parts of the platform are described in more detail below.

Data Producers

By far most data handled by the Data Platform is produced by Firefox. There are other producers, though, and the eventual aim is to generalize data production using a client SDK or set of standard tools.

Most data is submitted via HTTP POST, but data is also produced in the form of service logs and statsd messages.

If you would like to locally test a new data producer, the gzipServer project provides a simplified server that makes it easy to inspect submitted messages.

Ingestion

graph LR
  subgraph HTTP
    tee
    lb[Load Balancer]
    mozingest
  end
  subgraph Kafka
    kafka_unvalidated[Kafka unvalidated]
    kafka_validated[Kafka validated]
    zookeeper[ZooKeeper] -.- kafka_unvalidated
    zookeeper -.- kafka_validated
  end
  subgraph Storage
    s3_heka[S3 Heka Protobuf Storage]
    s3_parquet[S3 Parquet Storage]
  end
  subgraph Data Producers
    Firefox --> lb
    more_producers[Other Producers] --> lb
  end

  lb --> tee
  tee --> mozingest
  mozingest --> kafka_unvalidated
  mozingest --> Landfill
  kafka_unvalidated --> dwl[Data Store Loader]
  kafka_validated --> cep[Hindsight CEP]
  kafka_validated --> sparkstreaming[Spark Streaming]
  Schemas -.->|validation| dwl
  dwl --> kafka_validated
  dwl --> s3_heka
  dwl --> s3_parquet
  sparkstreaming --> s3_parquet

Data arrives as an HTTP POST of an optionally gzipped payload of JSON. See the common Edge Server specification for details.

Submissions hit a load balancer which handles the SSL connection, then forwards to a "tee" server, which may direct some or all submissions to alternate backends. In the past, the tee was used to manage the cutover between different versions of the backend infrastructure. It is implemented as an OpenResty plugin.

From there, the mozingest HTTP Server receives submissions from the tee and batches and stores data durably on Amazon S3 as a fail-safe (we call this "Landfill"). Data is then passed along via Kafka for validation and further processing. If there is a problem with decoding, validation, or any of the code described in the rest of this section, data can be re-processed from this fail-safe store. The mozingest server is implemented as an nginx module.

Validation, at a minimum, ensures that a payload is valid JSON (possibly compressed). Many document types also have a JSONSchema specification, and are further validated against that.

Invalid messages are redirected to a separate "errors" stream for debugging and inspection.

Valid messages proceed for further decoding and processing. This involves things like doing GeoIP lookup and discarding the IP address, and attaching some HTTP header info as annotated metadata.

Validated and annotated messages become available for stream processing.

They are also batched and stored durably for later batch processing and ad-hoc querying.

See also the "generic ingestion" proposal which aims to make ingestion, validation, storage, and querying available as self-serve for platform users.

Data flow for valid submissions
sequenceDiagram
    participant Fx as Firefox
    participant lb as Load Balancer
    participant mi as mozingest
    participant lf as Landfill
    participant k as Kafka
    participant dwl as Data Store Loader
    participant dl as Data Lake

    Fx->>lb: HTTPS POST
    lb->>mi: forward
    mi-->>lf: failsafe store
    mi->>k: enqueue
    k->>dwl: validate, decode
    dwl->>k: enqueue validated
    dwl->>dl: store durably
Other ingestion methods

Hindsight is used for ingestion of logs from applications and services, it supports parsing of log lines and appending similar metadata as the HTTP ingestion above (timestamp, source, and so on).

Statsd messages are ingested in the usual way.

Storage

graph TD
  subgraph RDBMS
    PostgreSQL
    Redshift
    MySQL
    BigQuery
  end
  subgraph NoSQL
    DynamoDB
  end
  subgraph S3
    landfill[Landfill]
    s3_heka[Heka Data Lake]
    s3_parquet[Parquet Data Lake]
    s3_analysis[Analysis Outputs]
    s3_public[Public Outputs]
  end

  Ingestion --> s3_heka
  Ingestion --> s3_parquet
  Ingestion --> landfill
  Ingestion -.-> stream[Stream Processing]
  stream --> s3_parquet
  batch[Batch Processing] --> s3_parquet
  batch --> PostgreSQL
  batch --> DynamoDB
  batch --> s3_public
  selfserve[Self Serve] --> s3_analysis
  s3_analysis --> selfserve
  Hive -->|Presto| redash[Re:dash]
  PostgreSQL --> redash
  Redshift --> redash
  MySQL --> redash
  BigQuery --> redash

  s3_parquet -.- Hive

Amazon S3 forms the backbone of the platform storage layer. The primary format used in the Data Lake is parquet, which is a strongly typed columnar storage format that can easily be read and written by Spark, as well as being compatible with SQL interfaces such as Hive and Presto. Some data is also stored in Heka-framed protobuf format. This custom format is usually reserved for data where we do not have a complete JSONSchema specification.

Using S3 for storage avoids the need for an always-on cluster, which means that data at rest is inexpensive. S3 also makes it very easy to automatically expire (delete) objects after a certain period of time, which is helpful for implementing data retention policies.

Once written to S3, the data is typically treated as immutable - data is not appended to existing files, nor is data normally updated in place. The exception here is when data is back-filled, in which case previous data may be overwritten.

There are a number of other types of storage used for more specialized applications, including relational databases (such as PostgreSQL for the Telemetry Aggregates) and NoSQL databases (DynamoDB is used for a backing store for the TAAR project). Reading data from a variety of RDBMS sources is also supported via Re:dash.

The data stored in Heka format is readable from Spark using libraries in Scala or Python.

Parquet data can be read and written natively from Spark, and many datasets are indexed in a Hive Metastore, making them available through a SQL interface on Re:dash and in notebooks via Spark SQL. Many other SQL data sources are also made available via Re:dash, see this article for more information on accessing data using SQL.

There is a separate data store for self-serve Analysis Outputs, intended to keep ad-hoc, temporary data out of the Data Lake. This is implemented as a separate S3 location, with personal output locations prefixed with each person's user id, similar to the layout of the /home directory on a Unix system.

Analysis outputs can also be made public using the Public Outputs bucket. This is a web-accessible S3 location for powering public dashboards. This public data is available at https://analysis-output.telemetry.mozilla.org/<job name>/data/<files>.

Stream Processing

Stream processing is done using Hindsight and Spark Streaming.

Hindsight allows you to run plugins written in Lua inside a sandbox. This gives a safe, performant way to do self-serve streaming analysis. See this article for an introduction. Hindsight plugins do the initial data validation and decoding, as well as writing out to long-term storage in both Heka-framed protobuf and parquet forms.

Spark Streaming is used to read from Kafka and perform low-latency ETL and aggregation tasks. These aggregates are currently used by Mission Control and are also available for querying via Re:dash.

Batch Processing

Batch processing is done using Spark. Production ETL code is written in both Python and Scala.

There are Python and Scala libraries for reading data from the Data Lake in Heka-framed protobuf form, though it is much easier and more performant to make use of a derived dataset whenever possible.

Datasets in parquet format can be read natively by Spark, either using Spark SQL or by reading data directly from S3.

Data produced by production jobs go into the Data Lake, while output from ad-hoc jobs go into Analysis Outputs.

Job scheduling and dependency management is done using Airflow. Most jobs run once a day, processing data from "yesterday" on each run. A typical job launches a cluster, which fetches the specified ETL code as part of its bootstrap on startup, runs the ETL code, then shuts down upon completion. If something goes wrong, a job may time out or fail, and in this case it is retried automatically.

Self Serve Data Analysis

graph TD
  subgraph Storage
    lake[Data Lake]
    s3_output_public[Public Outputs]
    s3_output_private[Analysis Outputs]
  end
  subgraph STMO
    redash[Re:dash] -->|read| lake
  end
  subgraph TMO
    evo[Evolution Dashboard]
    histo[Histogram Dashboard]
    agg[Telemetry Aggregates]
    evo -.- agg
    histo -.- agg
  end
  subgraph Databricks
    db_notebook[Notebook]
    db_notebook -->|read + write| lake
  end

Most of the data analysis tooling has been developed with the goal of being "self-serve". This means that people should be able to access and analyze data on their own, without involving data engineers or operations. Thus can data access scale beyond a small set of people with specialized knowledge of the entire pipeline.

The use of these self-serve tools is described in the Getting Started article. This section focuses on how these tools integrate with the platform infrastructure.

STMO: SQL Analysis

STMO is a customized Re:dash installation that provides self-serve access to a a variety of different datasets. From here, you can query data in the Parquet Data Lake as well as various RDBMS data sources.

STMO interfaces with the data lake using both Presto and Amazon Athena. Each has its own data source in Re:dash. Since Athena does not support user-defined functions, datasets with HyperLogLog columns, such as client_count_daily, are only available via Presto..

Different Data Sources in STMO connect to different backends, and each backend might use a slightly different flavor of SQL. You should find a link to the documentation for the expected SQL variant next to the Data Sources list.

Queries can be run just once, or scheduled to run periodically to keep data up-to-date.

There is a command-line interface to STMO called St. Mocli, if you prefer writing SQL using your own editor and tools.

Databricks: Managed Spark Analysis

Our Databricks instance (see Databricks docs) offers another notebook interface for doing analysis in Scala, SQL, Python and R.

Databricks provides an always-on shared server which is nice for quick data investigations.

ATMO (deprecated): Spark Analysis

ATMO was a service for managing Spark clusters for data analysis on AWS. It was deprecated in Q3 2019 and removed in Q4.

TMO: Aggregate Graphs

TMO provides easy visualizations of histogram and scalar measures over time. Time can be in terms of either builds or submission dates. This is the most convenient interface to the Telemetry data, as it does not require any custom code.

Visualization

There are a number of visualization libraries and tools being used to display data.

TMO Dashboards

The landing page at telemetry.mozilla.org is a good place to look for existing graphs, notably the measurement dashboard which gives a lot of information about histogram and scalar measures collected on pre-release channels.

Notebooks

Use of interactive notebooks has become a standard in the industry, and Mozilla makes heavy use of this approach. Databricks makes it easy to run, share, and schedule notebooks.

Others

Re:dash lets you query the data using SQL, but it also supports a number of useful visualizations.

Hindsight's web interface has the ability to visualize time-series data.

Mission Control gives a low-latency view into release health.

Many bespoke visualizations are built using the Metrics Graphics library as a display layer.

Monitoring and Alerting

There are multiple layers of monitoring and alerting.

At a low level, the system is monitored to ensure that it is functioning as expected. This includes things like machine-level resources (network capacity, disk space, available RAM, CPU load) which are typically monitored using DataDog.

Next, we monitor the "transport" functionality of the system. This includes monitoring incoming submission rates, payload sizes, traffic patterns, schema validation failure rates, and alerting if anomalies are detected. This type of anomaly detection and alerting is handled by Hindsight.

Once data has been safely ingested and stored, we run some automatic regression detection on all Telemetry histogram measures using Cerberus. This code looks for changes in the distribution of a measure, and emails probe owners if a significant change is observed.

Production ETL jobs are run via Airflow, which monitors batch job progress and alerts if there are failures in any job. Self-serve batch jobs running via Databricks also generate alerts upon failure.

Scheduled Re:dash queries may also be configured to generate alerts, which is used to monitor the last-mile user facing status of derived datasets. Re:dash may also be used to monitor and alert on high-level characteristics of the data, or really anything you can think of.

Data Exports

Data is exported from the pipeline to a few other tools and systems. Examples include integration with Amplitude for mobile and product analytics, publishing reports and visualizations to the Mozilla Data Collective, and shipping data to other parts of the Mozilla organization.

There are also a few data sets which are made publicly available, such as the Firefox Hardware Report.

Bringing it all together

Finally, here is a more detailed view of the entire platform. Some connections are omitted for clarity.

graph LR
 subgraph Data Producers
  Firefox
  more_producers[...]
 end
 subgraph Storage
  Landfill
  warehouse_heka[Heka Data Lake]
  warehouse_parquet[Parquet Data Lake]
  warehouse_analysis[Analysis Outputs]
  PostgreSQL
  Redshift
  MySQL
  hive[Hive] -.- warehouse_parquet
 end
 subgraph Stream Processing
  cep[Hindsight Streaming]
  dwl[Data Store Loader] --> warehouse_heka
  dwl --> warehouse_parquet
  sparkstreaming[Spark Streaming] --> warehouse_parquet
 end
 subgraph Ingestion
  Firefox --> lb[Load Balancer]
  more_producers --> lb
  lb --> tee
  tee --> mozingest
  mozingest --> kafka
  mozingest --> Landfill
  ZooKeeper -.- kafka[Kafka]
  kafka --> dwl
  kafka --> cep
  kafka --> sparkstreaming
 end
 subgraph Batch Processing
  Airflow -.->|spark|tbv[telemetry-batch-view]
  Airflow -.->|spark|python_mozetl
  warehouse_heka --> tbv
  warehouse_parquet --> tbv
  warehouse_heka --> python_mozetl
  warehouse_parquet --> python_mozetl
  tmo_agg[Telemetry Aggregates]
 end
 subgraph Visualization
  Hindsight
  Jupyter
  Zeppelin
  TMO
  redash_graphs[Re:dash]
  MissionControl
  bespoke_viz[Bespoke Viz]
 end
 subgraph Export
  tbv --> Amplitude
  sparkstreaming --> Amplitude
 end
 subgraph Self Serve
  redash[Re:dash] -.-> Presto
  Presto --> hive
  redash -.-> Athena
  Athena --> hive
  warehouse_heka --> spcluster
  warehouse_parquet --> spcluster
  spcluster --> warehouse_analysis
 end
 Schemas -.->|validation| dwl

HTTP Edge Server Specification

This document specifies the behavior of the server that accepts submissions from any HTTP client e.g. Firefox telemetry.

The original implementation of the HTTP Edge Server was tracked in Bug 1129222.

General Data Flow

HTTP submissions come in from the wild, hit a load balancer, then optionally an Nginx proxy, then the HTTP Edge Server described in this document. Data is accepted via a POST/PUT request from clients, which the server will wrap in a Heka message and forward to two places: the Services Data Pipeline, where any further processing, analysis, and storage will be handled; as well as to a short-lived S3 bucket which will act as a fail-safe in case there is a processing error and/or data loss within the main Data Pipeline.

Namespaces

Namespaces are used to control the processing of data from different types of clients, from the metadata that is collected to the destinations where the data is written, processed and accessible. Namespaces are configured in Nginx using a location directive, to request a new namespace file a bug against the Data Platform Team with a short description of what the namespace will be used for and the desired configuration options. Data sent to a namespace that is not specifically configured is assumed to be in the non-Telemetry JSON format described here.

Forwarding to the pipeline

The constructed Heka protobuf message to is written to disk and the pub/sub pipeline (currently Kafka). The messages written to disk serve as a fail-safe, they are batched and written to S3 (landfill) when they reach a certain size or timeout.

Edge Server Heka Message Schema

  • required binary Uuid; // Internal identifier randomly generated
  • required int64 Timestamp; // Submission time (server clock)
  • required string Hostname; // Hostname of the edge server e.g. ip-172-31-2-68
  • required string Type; // Kafka topic name e.g. telemetry-raw
  • required group Fields
    • required string uri; // Submission URI e.g. /submit/telemetry/6c49ec73-4350-45a0-9c8a-6c8f5aded0cf/main/Firefox/58.0.2/release/20180206200532
    • required binary content; // POST Body
    • required string protocol; // e.g. HTTP/1.1
    • optional string args; // Query parameters e.g. v=4
    • optional string remote_addr; // In our setup it is usually a load balancer e.g. 172.31.32.229
    • // HTTP Headers specified in the production edge server configuration
    • optional string Content-Length; // e.g. 4722
    • optional string Date; // e.g. Mon, 12 Mar 2018 00:02:18 GMT
    • optional string DNT; // e.g. 1
    • optional string Host; // e.g. incoming.telemetry.mozilla.org
    • optional string User-Agent; // e.g. pingsender/1.0
    • optional string X-Forwarded-For; // Last entry is treated as the client IP for geoIP lookup e.g. 10.98.132.74, 103.3.237.12
    • optional string X-PingSender-Version;// e.g. 1.0

Server Request/Response

GET Request

Accept GET on /status, returning OK if all is well. This can be used to check the health of web servers.

GET Response codes

  • 200 - OK. /status and all’s well
  • 404 - Any GET other than /status
  • 500 - All is not well

POST/PUT Request

Treat POST and PUT the same. Accept POST or PUT to URLs of the form

^/submit/namespace/[id[/dimensions]]$

Example Telemetry format:

/submit/telemetry/docId/docType/appName/appVersion/appUpdateChannel/appBuildID

Specific Telemetry example:

/submit/telemetry/ce39b608-f595-4c69-b6a6-f7a436604648/main/Firefox/61.0a1/nightly/20180328030202

Example non-Telemetry format:

/submit/namespace/doctype/docversion/docid

Specific non-Telemetry example:

/submit/eng-workflow/hgpush/1/2c3a0767-d84a-4d02-8a92-fa54a3376049

Note that id above is a unique document ID, which is used for de-duping submissions. This is not intended to be the clientId field from Telemetry. id is required, and it is recommended that id be a UUID.

POST/PUT Response codes

  • 200 - OK. Request accepted into the pipeline.
  • 400 - Bad request, for example an un-encoded space in the URL.
  • 404 - not found - POST/PUT to an unknown namespace
  • 405 - wrong request type (anything other than POST/PUT)
  • 411 - missing content-length header
  • 413 - request body too large (Note that if we have badly-behaved clients that retry on 4XX, we should send back 202 on body/path too long).
  • 414 - request path too long (See above)
  • 500 - internal error

Other Considerations

Compression

It is not desirable to do decompression on the edge node. We want to pass along messages from the HTTP Edge node without "cracking the egg" of the payload.

We may also receive badly formed payloads, and we will want to track the incidence of such things within the main pipeline.

Bad Messages

Since the actual message is not examined by the edge server the only failures that occur are defined by the response status codes above. Messages are only forwarded to the pipeline when a response code of 200 is returned to the client.

GeoIP Lookups

No GeoIP lookup is performed by the edge server. If a client IP is available the the data warehouse loader performs the lookup and then discards the IP before the message hits long-term storage.

Data Retention

The edge server only stores data while batching and will have a retention time of moz_ingest_landfill_roll_timeout which is generally only a few minutes. Retention time for the S3 landfill, pub/sub, and the data warehouse is outside the scope of this document.

Event Data Pipeline

We collect event-oriented data from different sources. This data is collected and processed in a specific path through our data pipeline, which we will detail here.

graph TD

subgraph Products
fx_code(fa:fa-cog Firefox code) --> firefox(fa:fa-firefox Firefox Telemetry)
fx_extensions(fa:fa-cog Mozilla extensions) --> firefox
mobile(fa:fa-cog Mobile products) --> mobile_telemetry(fa:fa-firefox Mobile Telemetry)
end

subgraph Data Platform
firefox -.->|main ping, Firefox <62| pipeline((fa:fa-database Firefox Data Pipeline))
firefox -->|event ping, Firefox 62+| pipeline
mobile_telemetry --> |mobile events ping| pipeline
pipeline -->|Firefox <62 events| main_summary[fa:fa-bars main summary table]
pipeline -->|Firefox 62+ events| events_table[fa:fa-bars events table]
main_summary --> events_table
pipeline -->|Mobile events| mobile_events_table[fa:fa-bars mobile events table]
end

subgraph Data Tools
events_table --> redash
mobile_events_table --> redash
main_summary --> redash(fa:fa-bar-chart Redash)
pipeline -->|on request| amplitude(fa:fa-bar-chart Amplitude)
end

style fx_code fill:#f94,stroke-width:0px
style fx_extensions fill:#f94,stroke-width:0px
style mobile fill:#f94,stroke-width:0px
style firefox fill:#f61,stroke-width:0px
style mobile_telemetry fill:#f61,stroke-width:0px
style pipeline fill:#79d,stroke-width:0px
style main_summary fill:lightblue,stroke-width:0px
style events_table fill:lightblue,stroke-width:0px
style mobile_events_table fill:lightblue,stroke-width:0px
style redash fill:salmon,stroke-width:0px
style amplitude fill:salmon,stroke-width:0px

Overview

Across the different Firefox teams there is a common need for a more fine grained understanding of product usage, like understanding the order of interactions or how they occur over time. To address that our data pipeline needs to support working with event-oriented data.

We specify a common event data format, which allows for broader, shared usage of data processing tools. To make working with event data feasible, we provide different mechanisms to get the event data from products to our data pipeline and make the data available in tools for analysis.

The event format

Events are submitted as an array, e.g.:

[
  [2147, "ui", "click", "back_button"],
  [2213, "ui", "search", "search_bar", "google"],
  [2892, "ui", "completion", "search_bar", "yahoo",
    {"querylen": "7", "results": "23"}],
  [5434, "dom", "load", "frame", null,
    {"prot": "https", "src": "script"}],
  // ...
]

Each event is of the form:

[timestamp, category, method, object, value, extra]

Where the individual fields are:

  • timestamp: Number, positive integer. This is the time in ms when the event was recorded, relative to the main process start time.
  • category: String, identifier. The category is a group name for events and helps to avoid name conflicts.
  • method: String, identifier. This describes the type of event that occurred, e.g. click, keydown or focus.
  • object: String, identifier. This is the object the event occurred on, e.g. reload_button or urlbar.
  • value: String, optional, may be null. This is a user defined value, providing context for the event.
  • extra: Object, optional, may be null. This is an object of the form {"key": "value", ...}, both keys and values need to be strings. This is used for events when additional richer context is needed.

See also the Firefox Telemetry documentation.

Event data collection

Firefox event collection

To collect this event data in Firefox there are different APIs in Firefox, all addressing different use cases:

For all these APIs, events will get sent to the pipeline through the event ping, which gets sent hourly, if any pings were recorded, or up to every 10 minutes whenever 1000 events were recorded. Before Firefox 62, events were sent through the main ping instead, with a hard limit of 500 events per ping. From Firefox 61, all events recorded through these APIs are automatically counted in scalars.

Finally, custom pings can follow the event data format and potentially connect to the existing tooling with some integration work.

Mobile event collection

Mobile events data primarily flows through the mobile events ping (ping schema), from e.g. Firefox iOS, Firefox for Fire TV and Rocket.

Currently we also collect event data from Firefox Focus through the focus-events ping, using the telemetry-ios and telemetry-android libraries.

Datasets

On the pipeline side, the event data is made available in different datasets:

  • main_summary has a row for each main ping and includes its event payload for Firefox versions before 62.
  • events contains a row for each event received from main pings and event pings. See this sample query.
  • telemetry_mobile_event_parquet contains a row for each mobile event ping. See this sample query.
  • focus_events_longitudinal currently contains events from Firefox Focus.

Data tooling

The above datasets are all accessible through Re:dash and Spark jobs.

For product analytics based on event data, we have Amplitude (hosted by the IT data team). We can connect our event data sources data to Amplitude. We have an active connector to Amplitude for mobile events, which can push event data over daily. For Firefox Desktop events this will be available soon.

Mozilla Firefox Data Analysis Tools

This is a starting point for making sense of (and gaining access to) all of the Firefox-related data analysis tools. There are a number of different tools available, all with their own strengths, tailored to a variety of use cases and skill sets.

sql.telemetry.mozilla.org (STMO)

The sql.telemetry.mozilla.org (STMO) site is an instance of the very fine Re:dash software, allowing for SQL-based exploratory analysis and visualization / dashboard construction. Requires (surprise!) familiarity with SQL, and for your data to be explicitly exposed as an STMO data source. Bugs or feature requests can be reported in our issue tracker.

analysis.telemetry.mozilla.org (ATMO)

The analysis.telemetry.mozilla.org (ATMO) site can be used to launch and gain access to virtual machines running Apache Spark clusters which have been pre-configured with access to the raw data stored in our long term storage S3 buckets. Spark allows you to use Python or Scala to perform arbitrary analysis and generate arbitrary output. Once developed, ATMO can also be used to run recurring Spark jobs for data transformation, processing, or reporting. Requires Python or Scala programming skills and knowledge of various data APIs. Learn more by visiting the documentation or tutorials.

Databricks

Offers notebook interface with shared, always-on, autoscaling cluster (attaching your notebooks to shared_serverless is the best way to start). Convenient for quick data investigations. Users can get help on #databricks channel on IRC and are advised to join the databricks-discuss@mozilla.com group.

telemetry.mozilla.org (TMO)

Our telemetry.mozilla.org (TMO) site is the 'venerable standby' of Firefox telemetry analysis tools. It uses aggregate telemetry data (as opposed to the collated data sets that are exposed to most of the other tools) so it provides less latency than most but is unsuitable for examining at the individual client level. It provides a powerful UI that allows for sophisticated ad-hoc analysis without the need for any specialized programming skills, but with so many options the UI can be a bit intimidating for novice users.

Real Time / CEP

The "real time" or "complex event processing" (CEP) system is part of the ingestion infrastructure that processes all of our Firefox telemetry data. It provides extremely low latency access to the data as it's flowing through our ingestion system on its way to long term storage. As a CEP system, it is unlike the rest of our analysis tools in that it is up to the analyst to specify and maintain state from the data that is flowing; it is non-trivial to revisit older data that has already passed through the system. The CEP is very powerful, allowing for sophisticated monitoring, alerting, reporting, and dashboarding. Developing new analysis plugins requires knowledge of the Lua programming language, relevant APIs, and a custom filter configuration syntax. Learn more about how to do this in our Creating a Real-time Analysis Plugin article.

Introduction

Apache Spark is a data processing engine designed to be fast and easy to use. We have setup Jupyter notebooks that use Spark to analyze our Telemetry data. Jupyter notebooks can be easily shared and updated among colleagues, and, when combined with Spark, enable richer analysis than SQL alone.

The Spark clusters can be launched from ATMO. The Spark Python API is called PySpark.

Note that this documentation focuses on ATMO, which is deprecated. Databricks is the preferred Spark analysis platform. For more information please see this example notebook.

Setting Up a Spark Cluster On ATMO

  1. Go to BROKEN:https://analysis.telemetry.mozilla.org
  2. Click “Launch an ad-hoc Spark cluster”.
  3. Enter some details:
    1. The “Cluster Name” field should be a short descriptive name, like “chromehangs analysis”.
    2. Set the number of workers for the cluster. Please keep in mind to use resources sparingly; use a single worker to write and debug your job.
    3. Upload your SSH public key.
  4. Click “Submit”.
  5. A cluster will be launched on AWS pre-configured with Spark, Jupyter and some handy data analysis libraries like pandas and matplotlib.

Once the cluster is ready, you can tunnel Jupyter through SSH by following the instructions on the dashboard. For example:

ssh -i ~/.ssh/id_rsa -L 8888:localhost:8888 hadoop@ec2-54-70-129-221.us-west-2.compute.amazonaws.com

Finally, you can launch Jupyter in Firefox by visiting http://localhost:8888.

The Python Jupyter Notebook

When you access http://localhost:8888, two example Jupyter notebooks are available to peruse.

Starting out, we recommend looking through the Telemetry Hello World notebook. It gives a nice overview of Jupyter and analyzing telemetry data using PySpark and the RDD API.

Using Jupyter

Jupyter Notebooks contain a series of cells. Each cell contains code or markdown. To switch between the two, use the drop-down at the top. To run a cell, use shift-enter; this either compiles the markdown or runs the code. To create new cell, select Insert -> Insert Cell Below.

A cell can output text or plots. To output plots inlined with the cell, run %pylab inline, usually below your import statements:

The notebook is setup to work with Spark. See the "Using Spark" section for more information.

Schedule a periodic job

Scheduled Spark jobs allow a Jupyter notebook to be updated consistently, making a nice and easy-to-use dashboard.

To schedule a Spark job:

  1. Visit the analysis provisioning dashboard at BROKEN:https://analysis.telemetry.mozilla.org and sign in
  2. Click “Schedule a Spark Job”
  3. Enter some details:
    1. The “Job Name” field should be a short descriptive name, like “chromehangs analysis”.
    2. Upload your Jupyter notebook containing the analysis.
    3. Set the number of workers of the cluster in the “Cluster Size” field.
    4. Set a schedule frequency using the remaining fields.

Now, the notebook will be updated automatically and the results can be easily shared. Furthermore, all files stored in the notebook's local working directory at the end of the job will be automatically uploaded to S3, which comes in handy for simple ETL workloads for example.

For reference, see Simple Dashboard with Scheduled Spark Jobs and Plotly.

Sharing a Notebook

Jupyter notebooks can be shared in a few different ways.

Sharing a Static Notebook

An easy way to share is using a gist on Github.

  1. Download file as .ipynb
  2. Upload to a gist on gist.github.com
  3. Enter the gist URL at Jupyter nbviewer
  4. Share with your colleagues!

Sharing a Scheduled Notebook

Setup your scheduled notebook. After it's run, do the following:

  1. Go to the "Schedule a Spark job" tab in ATMO
  2. Get the URL for the notebook (under 'Currently Scheduled Jobs')
  3. Paste that URL into Jupyter nbviewer

Zeppelin Notebooks

We also have support for Apache Zeppelin notebooks. The notebook server for that is running on port 8890, so you can connect to it just by tunnelling the port (instead of port 8888 for Jupyter). For example:

ssh -i \~/.ssh/id\_rsa -L 8890:localhost:8890
hadoop@ec2-54-70-129-221.us-west-2.compute.amazonaws.com

Using Spark

Spark is a general-purpose cluster computing system - it allows users to run general execution graphs. APIs are available in Python, Scala, and Java. The Jupyter notebook utilizes the Python API. In a nutshell, it provides a way to run functional code (e.g. map, reduce, etc.) on large, distributed data.

Check out Spark Best Practices for tips on using Spark to its full capabilities.

SparkContext (sc)

Access to the Spark API is provided through SparkContext. In the Jupyter notebook, this is the sc object. For example, to create a distributed RDD of monotonically increasing numbers 1-1000:

numbers = range(1000)
# no need to initialize sc in the Jupyter notebook
numsRdd = sc.parallelize(numbers)
nums.take(10) #no guaranteed order

Spark RDD

The Resilient Distributed Dataset (RDD) is Spark's basic data structure. The operations that are performed on these structures are distributed to the cluster. Only certain actions (such as collect() or take(N)) pull an RDD in locally.

RDD's are nice because there is no imposed schema - whatever they contain, they distribute around the cluster. Additionally, RDD's can be cached in memory, which can greatly improve performance of some algorithms that need access to data over and over again.

Additionally, RDD operations are all part of a directed, acyclic graph. This gives increased redundancy, since Spark is always able to recreate an RDD from the base data (by rerunning the graph), but also provides lazy evaluation. No computation is performed while an RDD is just being transformed (a la map), but when an action is taken (e.g. reduce, take) the entire computation graph is evaluated. Continuing from our previous example, the following gives some of the peaks of a sin wave:

import numpy as np
#no computation is performed on the following line!
sin_values = numsRdd.map(lambda x : np.float(x) / 10).map(lambda x : (x, np.sin(x)))
#now the entire computation graph is evaluated
sin_values.takeOrdered(5, lambda x : -x[1])

For jumping into working with Spark RDD's, we recommend reading the Spark Programming Guide.

Spark SQL and Spark DataFrames/Datasets

Spark also supports traditional SQL, along with special data structures that require schemas. The Spark SQL API can be accessed with the spark object. For example:

   longitudinal = spark.sql('SELECT * FROM longitudinal')

creates a DataFrame that contains all the longitudinal data. A Spark DataFrame is essentially a distributed table, a la Pandas or R DataFrames. Under the covers they are an RDD of Row objects, and thus the entirety of the RDD API is available for DataFrames, as well as a DataFrame specific API. For example, a SQL-like way to get the count of a specific OS:

   longitudinal.select("os").where("os = 'Darwin'").count()

To Transform the DataFrame object to an RDD, simply do:

  longitudinal_rdd = longitudinal.rdd

In general, however, the DataFrames are performance optimized, so it's worth the effort to learn the DataFrame API.

For more overview, see the SQL Programming Guide. See also the Longitudinal Tutorial, one of the available example notebooks when you start a cluster.

Available Data Sources for SparkSQL

For information about data sources available for querying (e.g. Longitudinal dataset), see Choosing a Dataset.

These datasets are optimized for fast access, and will far out-perform analysis on the raw Telemetry ping data.

Persisting data

You can save data to the Databricks Filesystem or to a subdirectory of the S3 bucket s3://net-mozaws-prod-us-west-2-pipeline-analysis/<username>/.

Accessing the Spark UI

After establishing an SSH connection to the Spark cluster, go to https://localhost:8888/spark to see the Spark UI. It has information about job statuses and task completion, and may help you debug your job.

The MozTelemetry Library

We have provided a library that gives easy access to the raw telemetry ping data. For example usage, see the Telemetry Hello World example notebook. Detailed API documentation for the library can be found at the Python MozTelemetry Documentation.

Using the Raw Ping Data

First off, import the moztelemetry library using the following:

from moztelemetry.dataset import Dataset

The ping data is an RDD of JSON elements. For example, using the following:

pings = Dataset.from_source("telemetry") \
    .where(docType='main') \
    .where(submissionDate="20180101") \
    .where(appUpdateChannel="nightly") \
    .records(sc, sample=0.01)

returns an RDD of 1/100th of Firefox Nightly JSON pings submitted on from January 1 2018. Now, because it's JSON, pings are easy to access. For example, to get the count of each OS type:

os_names = pings.map(lambda x: (x['environment']['system']['os']['name'], 1))
os_counts = os_names.reduceByKey(lambda x, y: x + y)
os_counts.collect()

Alternatively, moztelemetry provides the get_pings_properties function, which will gather the data for you:

from moztelemetry import get_pings_properties
subset = get_pings_properties(pings, ["environment/system/os/name"])
subset.map(lambda x: (x["environment/system/os/name"], 1)).reduceByKey(lambda x, y: x + y).collect()

FAQ

Please add more FAQ as questions are answered by you or for you.

How can I load parquet datasets in a Jupyter notebook?

Load tables with:

dataset = spark.table("main_summary")

or use spark.read.parquet like:

dataset = spark.read.parquet("s3://the_bucket/the_prefix/the_version")`

I got a REMOTE HOST IDENTIFICATION HAS CHANGED! error

AWS recycles hostnames, so this warning is expected. Removing the offending key from $HOME/.ssh/known_hosts will remove the warning. You can find the line to remove by finding the line in the output that says

Offending key in /path/to/hosts/known_hosts:2

Where 2 is the line number of the key that can be deleted. Just remove that line, save the file, and try again.

Why is my notebook hanging?

There are a few common causes for this:

  1. Currently, our Spark notebooks can only run a single Python kernel at a time. If you open multiple notebooks on the same cluster and try to run both, the second notebook will hang. Be sure to close notebooks using "Close and Halt" under the "File" drop-down.
  2. The connection from PySpark to the Spark driver might be lost. Unfortunately the best way to recover from this for the moment seems to be spinning up a new cluster.
  3. Cancelling execution of a notebook cell doesn't cancel any spark jobs that might be running in the background. If your spark commands seem to be hanging, try running sc.cancelAllJobs().

How can I keep running after closing the notebook?

For long-running computation, it might be nice to close the notebook (and the SSH session) and look at the results later. Unfortunately, all cell output will be lost when a notebook is closed (for the running cell). To alleviate this, there are a few options:

  1. Have everything output to a variable. These values should still be available when you reconnect.
  2. Put %%capture at the beginning of the cell to store all output. See the documentation.

How do I load an external library into the cluster?

Assuming you've got a URL for the repo, you can create an egg for it this way:

!git clone `<repo url>` && cd `<repo-name>` && python setup.py bdist_egg`\
sc.addPyFile('`<repo-name>`/dist/my-egg-file.egg')`

Alternately, you could just create that egg locally, upload it to a web server, then download and install it:

import requests`\
r = requests.get('`<url-to-my-egg-file>`')`\
with open('mylibrary.egg', 'wb') as f:`\
  f.write(r.content)`\
sc.addPyFile('mylibrary.egg')`

You will want to do this before you load the library. If the library is already loaded, restart the kernel in the Jupyter notebook.

SQL Style Guide

Table of Contents

Consistency

From Pep8:

A style guide is about consistency. Consistency with this style guide is important. Consistency within a project is more important. Consistency within one module or function is the most important.

However, know when to be inconsistent -- sometimes style guide recommendations just aren't applicable. When in doubt, use your best judgment. Look at other examples and decide what looks best. And don't hesitate to ask!

Reserved Words

Always use uppercase for reserved keywords like SELECT, WHERE, or AS.

Variable Names

  1. Use consistent and descriptive identifiers and names.
  2. Use lower case names with underscores, such as first_name. Do not use CamelCase.
  3. Functions, such as cardinality, approx_distinct, or substr, are identifiers and should be treated like variable names.
  4. Names must begin with a letter and may not end in an underscore.
  5. Only use letters, numbers, and underscores in variable names.

Be Explicit

When choosing between explicit or implicit syntax, prefer explicit.

Aliasing

Always include the AS keyword when aliasing a variable or table name, it's easier to read when explicit.

Good

SELECT
  substr(submission_date, 1, 6) AS month
FROM
  main_summary
LIMIT
  10

Bad

SELECT
  substr(submission_date, 1, 6) month
FROM
  main_summary
LIMIT
  10

Joins

Always include the JOIN type rather than relying on the default join.

Good

-- BigQuery Standard SQL Syntax
SELECT
  submission_date,
  experiment.key AS experiment_id,
  experiment.value AS experiment_branch,
  count(*) AS count
FROM
  telemetry.clients_daily
CROSS JOIN
  UNNEST(experiments.key_value) AS experiment
WHERE
  submission_date > '2019-07-01'
  AND sample_id = '10'
GROUP BY
  submission_date,
  experiment_id,
  experiment_branch

Bad

-- BigQuery Standard SQL Syntax
SELECT
  submission_date,
  experiment.key AS experiment_id,
  experiment.value AS experiment_branch,
  count(*) AS count
FROM
  telemetry.clients_daily,
  UNNEST(experiments.key_value) AS experiment -- Implicit JOIN
WHERE
  submission_date > '2019-07-01'
  AND sample_id = '10'
GROUP BY
  1, 2, 3 -- Implicit grouping column names

Grouping Columns

In the previous example, implicit grouping columns were discouraged, but there are cases where it makes sense.

In some SQL flavors (such as Presto) grouping elements must refer to the expression before any aliasing is done. If you are grouping by a complex expression it may be desirable to use implicit grouping columns rather than repeating the expression.

Good

-- BigQuery SQL Syntax
SELECT
  submission_date,
  normalized_channel IN ('nightly', 'aurora', 'beta') AS is_prerelease,
  count(*) AS count
FROM
  telemetry.clients_daily
WHERE
  submission_date > '2019-07-01'
GROUP BY
  submission_date,
  is_prerelease -- Grouping by aliases is supported in BigQuery

Good

-- Presto SQL Syntax
SELECT
  submission_date,
  normalized_channel IN ('nightly', 'aurora', 'beta') AS is_prerelease,
  count(*) AS count
FROM
  telemetry.clients_daily
WHERE
  submission_date > '20190701'
GROUP BY 
  1, 2 -- Implicit grouping avoids repeating expressions

Bad

-- Presto SQL Syntax
SELECT
  submission_date,
  normalized_channel IN ('nightly', 'aurora', 'beta') AS is_prerelease,
  count(*) AS count
FROM
  telemetry.clients_daily
WHERE
  submission_date > '20190701'
GROUP BY
  submission_date,
  normalized_channel IN ('nightly', 'aurora', 'beta')

Left Align Root Keywords

Root keywords should all start on the same character boundary. This is counter to the common "rivers" pattern described here.

Good:

SELECT
  client_id,
  submission_date
FROM
  main_summary
WHERE
  sample_id = '42'
  AND submission_date > '20180101'
LIMIT
  10

Bad:

SELECT client_id,
       submission_date
  FROM main_summary
 WHERE sample_id = '42'
   AND submission_date > '20180101'

Code Blocks

Root keywords should be on their own line. For example:

Good:

SELECT
  client_id,
  submission_date
FROM
  main_summary
WHERE
  submission_date > '20180101'
  AND sample_id = '42'
LIMIT
  10

It's acceptable to include an argument on the same line as the root keyword, if there is exactly one argument.

Acceptable:

SELECT
  client_id,
  submission_date
FROM main_summary
WHERE
  submission_date > '20180101'
  AND sample_id = '42'
LIMIT 10

Do not include multiple arguments on one line.

Bad:

SELECT client_id, submission_date
FROM main_summary
WHERE
  submission_date > '20180101'
  AND sample_id = '42'
LIMIT 10

Bad

SELECT
  client_id,
  submission_date
FROM main_summary
WHERE submission_date > '20180101'
  AND sample_id = '42'
LIMIT 10

Parentheses

If parentheses span multiple lines:

  1. The opening parenthesis should terminate the line.
  2. The closing parenthesis should be lined up under the first character of the line that starts the multi-line construct.
  3. The contents of the parentheses should be indented one level.

For example:

Good

WITH sample AS (
  SELECT
    client_id,
  FROM
    main_summary
  WHERE
    sample_id = '42'
)

Bad (Terminating parenthesis on shared line)

WITH sample AS (
  SELECT
    client_id,
  FROM
    main_summary
  WHERE
    sample_id = '42')

Bad (No indent)

WITH sample AS (
SELECT
  client_id,
FROM
  main_summary
WHERE
  sample_id = '42'
)

Boolean at the Beginning of Line

AND and OR should always be at the beginning of the line. For example:

Good

...
WHERE
  submission_date > 20180101
  AND sample_id = '42'

Bad

...
WHERE
  submission_date > 20180101 AND
  sample_id = '42'

Nested Queries

Do not use nested queries. Instead, use common table expressions to improve readability.

Good:

WITH sample AS (
  SELECT
    client_id,
    submission_date
  FROM
    main_summary
  WHERE
    sample_id = '42'
)

SELECT *
FROM sample
LIMIT 10

Bad:

SELECT *
FROM (
  SELECT
    client_id,
    submission_date
  FROM
    main_summary
  WHERE
    sample_id = '42'
)
LIMIT 10

About this Document

This document was heavily influenced by https://www.sqlstyle.guide/

Changes to the style guide should be reviewed by at least one member of both the Data Engineering team and the Data Science team.

Glean - product analytics & telemetry

For Mozilla, getting reliable data from our products is critical to inform our decision making. Glean is our new product analytics & telemetry solution that provides that data for our mobile products. It aims to be easy to integrate, reliable and transparent by providing an SDK and integrated tools.

It currently supports Android products, while iOS support is planned. Note that this is different from Telemetry for Firefox Desktop (library, datasets), although it provides similar capabilities.

Contents:

Overview

Glean consists of different pieces:

  • Product-side tools - the Glean SDK is what products integrate and record data into.
  • Services - this is where the data is stored and made available for analysis in our data platform.
  • Data Tools - these are used to look at the data, performing analysis and setting up dashboards.

drawing

What does it offer

Glean is designed to support typical product analytics use-cases and encourage best practices by requiring clearly defined metrics through the following:

Basic product analytics are collected out-of-the-box in a standardized way. A baseline of analysis is important for all our mobile applications, from counting active users to retention and session times. This is supported out-of-the-box by the library and works consistently across our mobile products.

No custom code is required for adding new metrics to a product. To make engineers more productive, the SDK keeps the amount of instrumentation code required for metrics as small as possible. Engineers only need to specify what they want to instrument, with which semantics and then record the data using the Glean SDK. The SDK takes care of storing & sending that data reliably.

Following lean data practices through SDK design choices. It's easy to limit data collection to what's necessary and documentation can be generated easily, aiding both transparency & understanding for analysis.

Better data tooling integration due to standardized data types & registering them in machine-readable files. By having collected data described in machine-readable files, our various data tools can read them and support metrics automatically, without manual work.

Due to common high-level concepts for metrics, APIs & data tools can better match the use-cases. To make the choice easier for which metric type to use, we are introducing higher-level data types that offer clear and understandable semantics - for example, when you want to count something, you use the "count" type. This also gives us opportunities to offer better tooling for the data, both on the client and for data tooling.

Basic semantics on how the data is collected are clearly defined by the library. To make it easier to understand the general semantics of our data, the Glean SDK will define and document when which kind of data will get sent. This gives data analysis common basic semantics.

How to use Glean

Contact

References

Using the Glean debug ping view

What is this good for?

Glean Debug Ping View enables you to easily see in real-time what data your mobile application is sending through Glean.

This data is what actually arrives in our data pipeline, shown in a web interface that is automatically updated when new data arrives.

What setup is needed for applications?

You can use the debug view for all our mobile applications that use Glean (and enable it), including those installed from the app store. To enable this you need to run a command in adb that tags the outgoing data as "debug data". You will provide a debug tag, which makes it easier to identify your device in the web interface.

adb shell am start -n <application-id>/mozilla.components.service.glean.debug.GleanDebugActivity \
  --ez logPings true \
  --es sendPing baseline \
  --es tagPings my-debug-tag

my-debug-tag is what will help you identify your data in the web interface, while <application-id> is the application identifier as declared in the manifest (e.g. org.mozilla.reference.browser). The debug commands are documented in more detail in the Glean documentation.

Supported applications

As for now, the following application ids are supported:

  • org.mozilla.fenix
  • org.mozilla.reference.browser
  • org.mozilla.samples.glean
  • org.mozilla.tv.firefox
  • ... and some debug versions of the above applications.

Where can I see the data?

The data is provided in this web interface. It lists all recently active devices and updates automatically. You can use your debug identifier to quickly identify your own testing data.

Any data sent from a mobile device usually shows up within 10 seconds, updating the pages automatically.

Can you give me an example?

For example to send a baseline ping immediately from the Reference Browser, with a debug identifier of johndoe-test1:

adb shell am start -n org.mozilla.reference.browser/mozilla.components.service.glean.debug.GleanDebugActivity \
  --es sendPing baseline \
  --es tagPings johndoe-test1

baseline pings are also sent automatically by Glean when the application goes to the background. So to check these you can set the tag:

adb shell am start -n org.mozilla.reference.browser/mozilla.components.service.glean.debug.GleanDebugActivity \
  --es tagPings johndoe-test1

Now whenever you put the application in the background, a baseline ping should show up in the web interface.

If you triggered some event recording and want to confirm them you can use the events

adb shell am start -n org.mozilla.reference.browser/mozilla.components.service.glean.debug.GleanDebugActivity \
  --es sendPing events \
  --es tagPings johndoe-test1

Note: Glean will always attempt to collect data for the ping that was requested using the sendPing command line switch. However, if no data is recorded by the application, nothing will be sent. The baseline ping is guaranteed to always be sent, since it’s populated by Glean itself.

Caveats

Some important things to watch out for (see also the Glean SDK documentation):

  • Options that are set using the adb flags are not immediately reset and will persist until the application is closed or manually reset.

  • There are a couple of different ways in which to send pings through the GleanDebugActivity:

    1. You can use the GleanDebugActivity in order to tag pings and trigger them manually using the UI. This should always produce a ping with all required fields.
    2. You can use the GleanDebugActivity to tag and send pings. This has the side effect of potentially sending a ping which does not include all fields because sendPings triggers pings to be sent before certain application behaviors can occur which would record that information. For example, duration is not calculated or included in a baseline ping sent with sendPing because it forces the ping to be sent before the duration metric has been recorded.

Troubleshooting

If nothing is showing up on the dashboard, it would be useful to check the following:

  • If adb logcat reports ”Glean must be enabled before sending pings.” right after calling the GleanDebugActivity, then the application has disabled Glean. Please check with the application team on how to fix that.
  • If no error is reported when triggering tagged pings, but the data won't show up on the dashboard, check if the used <application-id> is the same expected by the Glean pipeline (i.e. the one used to publish the application on the Play Store).
  • Fenix and the reference-browser debug builds currently don't enable Glean. You could override this in local builds.

Questions? Problems?

Reach out to Alessio Placitelli (:dexter) or Arkadiusz Komarzewski (:akomar) in #glean on slack or send an email to glean-team@mozilla.com.

References

Telemetry Alerts

Many Telemetry probes were created to show performance trends over time. Sudden changes happening in Nightly could be the sign of an unintentional performance regression, so we introduced a system to automatically detect and alert developers about such changes.

Thus we created Telemetry Alerts. It comes in two pieces: Cerberus the Detector and Medusa the Front-end.

Cerberus

Every day Cerberus grabs the latest aggregated information about all non-keyed Telemetry probes from aggregates.telemetry.mozilla.org and compares the distribution of values from the Nightly builds of the past two days to the distribution of values from the Nightly builds of the past seven days.

It does this by calculating the Bhattacharyya distance between the two distributions and guessing whether or not they are significant and narrow.

It places all detected changes in a file for ingestion by Medusa.

Medusa

Medusa is in charge of emailing people when distributions change and for displaying the website https://alerts.telemetry.mozilla.org which contains pertinent information about each detected regression.

Medusa also checks for expiring histograms and sends emails notifying of their expiry.

What it can do

Telemetry Alerts is very good at identifying sudden changes in the shapes of normalized distributions of Telemetry probes. If you can see the distribution of GC_MS shift from one day to the next, then likely so can Cerberus.

What can't it do

Telemetry Alerts is not able to see sudden shifts in volume. It is also very easily fooled if a change happens over a long period of time or doesn't fundamentally alter the shape of the probe's histogram.

So if you have a probe like SCALARS_BROWSER.ENGAGEMENT.MAX_CONCURRENT_TAB_COUNT, Cerberus won't notice if:

  • The number of pings reporting this value decreased in half, but otherwise reported the same spread of numbers
  • The value increases very slowly over time (which I'd expect it to do given how good Session Restore is these days)
  • We suddenly received twice as many pings from 200-tab subsessions (the dominance of 1-tab pings would likely ensure the overall shape of the distribution changed insufficiently much for Cerberus to pick up on it)

Telemetry Alert Emails

One of the main ways humans interact with Telemetry Alerts is through the emails sent by Medusa.

At present the email contains a link to the alert's page on https://alerts.telemetry.mozilla.org and a link to a pushlog on https://hg.mozilla.org detailing the changes newly-present in the Nightly build that exhibited the change.

Triaging a Telemetry Alert Email

Congratulations! You have just received a Telemetry Alert!

Now what?

Assumption: Alerts happen because of changes in probes. Changes in probes happen because of changes in related code. If we can identify the code change, we can find the bug that introduced the code change. If we can find the bug, we can ni? the person who made the change.

Goal: Identify the human responsible for the Alert so they can identify if it is good/bad/intentional/exceptional/temporary/permanent/still relevant/having its alerts properly looked after.

Guide:

  1. Is this alert just one of a group of similar changes by topic? By build?
  • If there's a group by topic (SPDY, URLCLASSIFIER, ...) check to see if the changes are similar in direction/magnitude. They usually are.
  • If there's a group by build but not topic, maybe a large merge kicked things over. Unfortunate, as that will make finding the source more difficult.
  1. Open the hg.mozilla.org and alerts.telemetry.mozilla.org links in tabs
  • On alerts.tmo, does it look like an improvement or regression? (This is just a first idea and might change. There are often extenuating circumstances that make something that looks bad into an improvement, and vice versa.)
  • On hg.mo, does the topic of the changed probe exist in the pushlog? In other words, does any part of the probe's name show up in the summaries of any of the commits?
  1. From alerts.tmo, open the https://telemetry.mozilla.org link by clicking on the plot's title. Open another tab to the Evolution View.
  • Is the change temporary? (might have been noticed elsewhere and backed out)
  • Is the change up or down?
  • Has it happened before?
  • Was it accompanied by a decrease in submission volume? (the second graph at the bottom of the Evolution View)
  • On the Distribution View, did the Sample Count increase? Decrease? (this signifies that the change could be because of the addition or subtraction of a population of values. For instance, we could suddenly stop sending 0 values which would shift the graph to the right. This could be a good thing (we're not handling useless things any longer) a bad thing (something broke and we're no longer measuring the same thing we used to measure) or indifferent)
  1. If you still don't have a cause
  • Use DXR or searchfox to find where the probe is accumulated.
  • Click "Log" in that view.
  • Are there any changesets in the resultant hg.mo list that ended up in the build we received the Alert for?
  1. If you still don't know what's going on
  • find a domain expert on IRC and bother them to help you out. Domain knowledge is awesome.

From pursuing these steps or sub-steps you should now have two things: a bug that likely caused the alert, and an idea of what the alert is about.

Now comment on the bug. Feel free to use this script:

This bug may have contributed to a sudden change in the Telemetry probe <PROBE_NAME>[1] which seems to have occurred in Nightly <builddate>[2][3].

There was a <describe the change: increase/decrease, population addition/subtraction, regression/improvement, change in submission/sample volume...>.
This might mean <wild speculation. It'll encourage the ni? to refute it :) >

Is this an improvement? A regression?

Is this intentional? Is this expected?

Is this probe still measuring something useful?

[1]: <the alerts.tmo link>
[2]: <the hg.mo link for the pushlog>
[3]: <the telemetry.mozilla.org link showing the Evolution View>

Then ni? the person who pushed the change. Reply-all to the dev-telemetry-alerts mail with a link to the bug and some short notes on what you found.

From here the user on ni? should get back to you in fairly short order and either help you find the real bug that caused it, or help explain what the Alert was all about. More often than not it is an expected change from a probe that is still operating correctly and there is no action to take...

...except making sure you never have to respond to an Alert for this probe again, that is. File a bug in that bug's component to update the Alerting probe to have a valid, monitored alert_emails field so that the next time it misbehaves they can be the ones to explain themselves without you having to spend all this time tracking them down.

Cookbooks

A Cookbook is a focused tutorial to guide you through a focused task. For example, a Cookbook could:

  • Introduce you to what types of analyses are common for (e.g.) Search or Crash data
  • Guide you through an example analysis to demonstrate the basic principles behind a new statistical technique

Accessing and working with BigQuery

This guide will give you a quick introduction to working with data stored in BigQuery

BigQuery uses a columnar data storage format called Capacitor which supports semi-structured data.

There is a cost associated with using BigQuery based on operations. As of right now we pay an on-demand pricing for queries based on how much data a query scans. To minimize costs see Query Optimizations. More detailed pricing information can be found here.

As we transition to GCP BigQuery has become our primary data warehouse and SQL Query engine. Our previous SQL Query Engines, Presto and Athena, and our Parquet data lake will no longer be accessible by the end of 2019. Specific guidance for transitioning off of the AWS data infrastructure, including up-to-date timelines of data availability, is maintained in the Data Access Continuity Guide Google Doc.

Table of Contents

Access

There are multiple ways to access BigQuery. For most users the primary interface will be re:dash.

See below for additional interfaces. All other interfaces will require access to be provisioned.

Interfaces

BigQuery datasets and tables can be accessed by the following methods:

Access Request

For access to BigQuery via GCP Console and API please file a bug here. As part of this request we will add you to the appropriate Google Groups and provision a GCP Service Account.

From re:dash

All Mozilla users will be able to access BigQuery via re:dash through the following Data Sources:

  • Telemetry (BigQuery)
  • Telemetry Search (BigQuery)
    • This group is restricted to users in the re:dash search group.

Access via re:dash is read-only. You will not be able to create views or tables via re:dash.

GCP BigQuery Console

  • File a bug with Data Operations for access to GCP Console.
  • Visit GCP BigQuery Console
  • Switch to the project provided to you during your access request e.g moz-fx-data-bq-<team-name>

See Using the BigQuery web UI in the GCP Console for more details.

GCP BigQuery API Access

  • File a bug with Data Operations for access to GCP BigQuery API Access.

A list of supported BigQuery client libraries can be found here.

Detailed REST reference can be found here.

From bq Command-line Tool

  • Install the GCP SDK
  • Authorize gcloud with either your user account or provisioned service account. See documentation here.
    • gcloud auth login
  • Set your google project to your team project
    • gcloud config set project moz-fx-data-bq-<team-name>
    • project name will be provided for you when your account is provisioned.

bq Examples

List tables and views in a BigQuery dataset

bq ls moz-fx-data-derived-datasets:telemetry

Query a table or view

bq query --nouse_legacy_sql 'select count(*) from `moz-fx-data-derived-datasets.telemetry.main` where submission_date = "2019-08-22" LIMIT 10'

Additional examples and documentation can be found here.

From client SDKs

Client SDKs for various programming languages don't access credentials the same way as the gcloud and bq command-line tools. The client SDKs generally assume that the machine is configured with a service account and will look for JSON-based credentials in several well-known locations rather than looking for user credentials.

If you have service account credentials, you can point client SDKs at them by setting:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/creds.json

If you don't have appropriate service account credentials, but your GCP user account has sufficient access, you can have your user credentials mimic a service account by running:

gcloud auth application-default login

Once you've followed the browser flow to grant access, you should be able to, for example, access BigQuery from Python:

pip install google-cloud-bigquery
python -c 'from google.cloud import bigquery; print([d.dataset_id for d in bigquery.Client().list_datasets()])'

From Spark

We recommend the Storage API Connector for accessing BigQuery tables in Spark as it is the most modern and actively developed connector. It works well with the BigQuery client library which is useful if you need to run arbitrary SQL queries (see example Databricks notebook) and load their results into Spark.

On Databricks

The shared_serverless_python3 cluster is configured with shared default GCP credentials that will be automatically picked up by BigQuery client libraries. It also has the Storage API Connector library added as seen in the example Python notebook.

On Dataproc

Dataproc is Google's managed Spark cluster service. Accessing BigQuery from there will be faster than from Databricks because it will not involve cross-cloud data transfers.

You can spin up a Dataproc cluster with Jupyter using the following command. Insert your values for cluster-name, bucket-name, and project-id there. Your notebooks will be stored in Cloud Storage under gs://bucket-name/notebooks/jupyter:

gcloud beta dataproc clusters create cluster-name \
    --optional-components=ANACONDA,JUPYTER \
    --image-version=1.4 \
    --enable-component-gateway \
    --properties=^#^spark:spark.jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar \
    --num-workers=3 \
    --max-idle=3h \
    --bucket bucket-name \
    --region=us-west1 \
    --project project-id

Jupyter URL can be retrieved with the following command:

gcloud beta dataproc clusters describe cluster-name --region=us-west1 --project project-id | grep Jupyter

After you've finished your work, it's a good practice to delete your cluster:

gcloud beta dataproc clusters delete cluster-name --region=us-west1 --project project-id --quiet

From Colaboratory

Colaboratory is Jupyter notebook environment, managed by Google and running in the cloud. Notebooks are stored in Google Drive and can be shared in a similar way to Google Docs.

Colaboratory can be used to easily access BigQuery and perform interactive analyses. See Telemetry Hello World notebook.

Note: this is very similar to API Access, so you will need access to your team's GCP project - file a request as described above.

Querying Tables

Projects, Datasets and Tables in BigQuery

In GCP a project is a way to organize cloud resources. We use multiple projects to maintain our BigQuery datasets.

Note that we have historically used the term dataset to describe a set of records all following the same schema, but this idea corresponds to a table in BigQuery. In BigQuery terminology, datasets are a top-level container used to organize and control access to tables and views.

Caveats

  • The date partition field (e.g. submission_date_s3, submission_date) is mostly used as a partitioning column, but it has changed from YYYYMMDD string form to a proper DATE type that accepts string literals in the more standards-friendly YYYY-MM-DD form.
  • Unqualified queries can become very costly very easily. We've placed restrictions on large tables from accidentally querying "all data for all time", namely that you must make use of the date partition fields for large tables (like main_summary or clients_daily).
  • Please read Query Optimizations section that contains advice on how to reduce cost and improve query performance.
  • re:dash BigQuery data sources will have a 10 TB data scanned limit per query. Please let us know in #fx-metrics on Slack if you run into issues!
  • There is no native map support in BigQuery. Instead, we are using structs with fields [key, value]. We have provided convenience functions to access these like key-value maps (described below.)

Projects with BigQuery datasets

ProjectDatasetPurpose
moz-fx-data-shared-prodAll production data including full pings, imported parquet data, BigQuery ETL, and ad-hoc analysis
<namespace>_liveSee live datasets below
<namespace>_stableSee stable datasets below
<namespace>_derivedSee derived datasets below
<namespace>See user-facing (unsuffixed) datasets below
analysisUser generated tables for analysis
backfillTemporary staging area for back-fills
blpadiBlocklist ping derived data(restricted)
payload_bytes_rawRaw JSON payloads as received from clients, used for reprocessing scenarios, a.k.a. "landfill" (restricted)
payload_bytes_decodedgzip-compressed decoded JSON payloads, used for reprocessing scenarios
payload_bytes_errorgzip-compressed JSON payloads that were rejected in some phase of the pipeline; particularly useful for investigating schema validation errors
searchSearch data imported from parquet (restricted)
staticStatic tables, often useful for data-enriching joins
tmpTemporary staging area for parquet data loads
udfPersistent user-defined functions defined in SQL; see Using UDFs
udf_jsPersistent user-defined functions defined in JavaScript; see Using UDFs
validationTemporary staging area for validation
moz-fx-data-derived-datasetsLegacy project that contains mostly views to data in moz-fx-data-shared-prod during a transition period; STMO currently points at this project but we will announce a transition to moz-fx-data-shared-prod by end of 2019
analysisUser generated tables for analysis; note that this dataset is separate from moz-fx-data-shared-prod:analysis and users are responsible for migrating or cloning data during the transition period
moz-fx-data-shar-nonprod-efedNon-production data produced by stage ingestion infrastructure

Table Layout and Naming

Under the single moz-fx-data-shared-prod project, each document namespace (corresponding to folders underneath the schemas directory of mozilla-pipeline-schemas) has four BigQuery datasets provisioned with the following properties:

  • Live datasets (telemetry_live, activity_stream_live, etc.) contain live ping tables (see definitions of table types in the next paragraph)
  • Stable datasets (telemetry_stable, activity_stream_stable, etc.) contain historical ping tables
  • Derived datasets (telemetry_derived, activity_stream_derived, etc.) contain derived tables, primarily populated via nightly queries defined in BigQuery ETL and managed by Airflow
  • User-facing (unsuffixed) datasets (telemetry, activity_stream, etc.) contain user-facing views on top of the tables in the corresponding stable and derived datasets.

The table and view types referenced above are defined as follows:

  • Live ping tables are the final destination for the telemetry ingestion pipeline. Dataflow jobs process incoming ping payloads from clients, batch them together by document type, and load the results to these tables approximately every five minutes, although a few document types are opted in to a more expensive streaming path that makes records available in BigQuery within seconds of ingestion. These tables are partitioned by date according to submission_timestamp and are also clustered on that same field, so it is possible to make efficient queries over short windows of recent data such as the last hour. They have a rolling expiration period of 30 days, but that window may be shortened in the future. Analyses should only use these tables if they need results for the current (partial) day.
  • Historical ping tables have exactly the same schema as their corresponding live ping tables, but they are populated only once per day via an Airflow job and have a 25 month retention period. These tables are superior to the live ping tables for historical analysis because they never contain partial days, they have additional deduplication applied, and they are clustered on sample_id, allowing efficient queries on a 1% sample of clients. It is guaranteed that document_id is distinct within each UTC day of each historical ping table, but it is still possible for a document to appear multiple times if a client sends the same payload across multiple days. Note that this requirement is relaxed for older telemetry ping data that was backfilled from AWS; approximately 0.5% of documents are duplicated in telemetry.main and other historical ping tables for 2019-04-30 and earlier dates.
  • Derived tables are populated by nightly Airflow jobs and are considered an implementation detail; their structure may change at any time at the discretion of the data platform team to allow refactoring or efficiency improvements.
  • User-facing views are the schema objects that users are primarily expected to use in analyses. Many of these views correspond directly to an underlying historical ping table or derived table, but they provide the flexibility to hide deprecated columns or present additional calculated columns to users. These views are the schema contract with users and they should not change in backwards-incompatible ways without a version increase or an announcement to users about a breaking change.

Spark and other applications relying on the BigQuery Storage API for data access need to reference derived tables or historical ping tables directly rather than user-facing views. Unless the query result is relatively large, we recommend instead that users run a query on top of user-facing views with the output saved in a destination table, which can then be accessed from Spark.

Structure of Ping Tables in BigQuery

Unlike with the previous AWS-based data infrastructure, we don't have different mechanisms for accessing entire pings vs. "summary" tables. As such, there are no longer special libraries or infrastructure necessary for accessing full pings, rather each document type maps to a user-facing view that can be queried in STMO. For example:

  • "main" pings are accessible from view telemetry.main
  • "crash" pings are accessible from view telemetry.crash
  • "baseline" pings for Fenix are accessible from view org_mozilla_fenix.baseline

All fields in the incoming pings are accessible in these views, and (where possible) match the nested data structures of the original JSON. Field names are converted from camelCase form to snake_case for consistency and SQL compatibility.

Any fields not present in the ping schemas are present in an additional_properties field containing leftover JSON. BigQuery provides functions for parsing and manipulating JSON data via SQL.

Later in this document, we demonstrate the use of a few Mozilla-specific functions that we have defined to allow ergonomic querying of map-like fields (which are represented as arrays of structs in BigQuery) and histograms (which are encoded as raw JSON strings).

Writing Queries

To query a BigQuery table you will need to specify the dataset and table name. It is good practice to specify the project however depending on which project the query originates from this is optional.

SELECT
    col1,
    col2
FROM
    `project.dataset.table`
WHERE
    -- data_partition_field will vary based on table
    date_partition_field >= DATE_SUB(CURRENT_DATE, INTERVAL 1 MONTH)

An example query from Clients Last Seen Reference

SELECT
    submission_date,
    os,
    COUNT(*) AS count
FROM
    telemetry.clients_last_seen
WHERE
    submission_date >= DATE_SUB(CURRENT_DATE, INTERVAL 1 WEEK)
    AND days_since_seen = 0
GROUP BY
    submission_date,
    os
HAVING
    count > 10 -- remove outliers
    AND lower(os) NOT LIKE '%windows%'
ORDER BY
    os,
    submission_date DESC

Check out the BigQuery Standard SQL Functions & Operators for detailed documentation.

Writing query results to a permanent table

You can write query results to a BigQuery table you have access via GCP BigQuery Console or GCP BigQuery API Access

  • Use moz-fx-data-shared-prod.analysis dataset.
    • Prefix your table with your username. If your username is username@mozilla.com create a table with username_my_table.
  • See Writing query results documentation for detailed steps.

Writing results to GCS (object store)

If a BigQuery table is not a suitable destination for your analysis results, we also have a GCS bucket available for storing analysis results. It is usually Spark jobs that will need to do this.

  • Use bucket gs://moz-fx-data-prod-analysis/
    • Prefix object paths with your username. If your username is username@mozilla.com, you might store a file to gs://moz-fx-data-prod-analysis/username/myresults.json.

Creating a View

You can create views in BigQuery if you have access via GCP BigQuery Console or GCP BigQuery API Access.

  • Use moz-fx-data-shared-prod.analysis dataset.
    • Prefix your view with your username. If your username is username@mozilla.com create a table with username_my_view.
  • See Creating Views documentation for detailed steps.

Using UDFs

BigQuery offers user-defined functions (UDFs) that can be defined in SQL or JavaScript as part of a query or as a persistent function stored in a dataset. We have defined a suite of persistent functions to enable transformations specific to our data formats, available in datasets udf (for functions defined in SQL) and udf_js (for functions defined in JavaScript). Note that JavaScript functions are potentially much slower than those defined in SQL, so use functions in udf_js with some caution, likely only after performing aggregation in your query.

We document a few of the most broadly useful UDFs below, but you can see the full list of UDFs with source code in bigquery-etl/udf and bigquery-etl/udf_js. Publishing a full reference page for our persistent UDFs is a planned improvement, tracked in bigquery-etl#228.

Accessing map-like fields

BigQuery currently lacks native map support and our workaround is to use a STRUCT type with fields named [key, value]. We've created a UDF that provides key-based access with the signature: udf.get_key(<struct>, <key>). The example below generates a count per reason key in the event_map_values field in the telemetry events table for Normandy unenrollment events from yesterday.

SELECT udf.get_key(event_map_values, 'reason') AS reason,
       COUNT(*) AS EVENTS
FROM telemetry.events
WHERE submission_date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
  AND event_category='normandy'
  AND event_method='unenroll'
GROUP BY 1
ORDER BY 2 DESC

Accessing histograms

We considered many potential ways to represent histograms as BigQuery fields and found the most efficient encoding was actually to leave them as raw JSON strings. To make these strings easier to use for analysis, you can convert them into nested structures using udf.json_extract_histogram:

WITH
  extracted AS (
  SELECT
    submission_timestamp,
    udf.json_extract_histogram(payload.histograms.a11y_consumers) AS a11y_consumers
  FROM
    telemetry.main )
  --
SELECT
  a11y_consumers.bucket_count,
  a11y_consumers.sum,
  a11y_consumers.range[ORDINAL(1)] AS range_low,
  udf.get_key(a11y_consumers.values, 11) AS value_11
FROM
  extracted
WHERE
  a11y_consumers.bucket_count IS NOT NULL
  AND DATE(submission_timestamp) = "2019-08-09"
LIMIT
  10

Query Optimizations

To improve query performance and minimize the cost associated with using BigQuery please see the following query optimizations:

  • Avoid SELECT * by selecting only the columns you need
    • Using SELECT * is the most expensive way to query data. When you use SELECT * BigQuery does a full scan of every column in the table.
    • Applying a LIMIT clause to a SELECT * query might not affect the amount of data read, depending on the table structure.
      • Many of our tables are configured to use clustering in which case a LIMIT clause does effectively limit the amount of data that needs to be scanned.
      • Tables that include a sample_id field will usually have that as one of the clustering fields and you can efficiently scan random samples of users by specifying WHERE sample_id = 0 (1% sample), WHERE sample_id < 10 (10% sample), etc. This can be especially helpful with main_summary, clients_daily, and clients_last_seen which are very large tables and are all clustered on sample_id.
      • To check whether your LIMIT and WHERE clauses are actually improving performance, you should see a lower value reported for actual "Data Scanned" by a query compared to the prediction ("This query will process X bytes") in STMO or the BigQuery UI.
    • If you are experimenting with data or exploring data, use one of the data preview options instead of SELECT *.
      • Preview support is coming soon to BigQuery data sources in re:dash
  • Limit the amount of data scanned by using a date partition filter
    • Tables that are larger than 1 TB will require that you provide a date partition filter as part of the query.
    • You will receive an error if you attempt to query a table that requires a partition filter.
      • Cannot query over table 'moz-fx-data-shared-prod.telemetry_derived.main_summary_v4' without a filter over column(s) 'submission_date' that can be used for partition elimination
    • See Writing Queries for examples.
  • Reduce data before using a JOIN
    • Trim the data as early in the query as possible, before the query performs a JOIN. If you reduce data early in the processing cycle, shuffling and other complex operations only execute on the data that you need.
    • Use sub queries with filters or intermediate tables or views as a way of decreasing sides of a join, prior to the join itself.
  • Do not treat WITH clauses as prepared statements
    • WITH clauses are used primarily for readability because they are not materialized. For example, placing all your queries in WITH clauses and then running UNION ALL is a misuse of the WITH clause. If a query appears in more than one WITH clause, it executes in each clause.
  • Use approximate aggregation functions
    • If the SQL aggregation function you're using has an equivalent approximation function, the approximation function will yield faster query performance. For example, instead of using COUNT(DISTINCT), use APPROX_COUNT_DISTINCT().
    • See approximate aggregation functions in the standard SQL reference.
  • Reference the data size prediction ("This query will process X bytes") in STMO and the BigQuery UI to help gauge the efficiency of your queries. You should see this number go down as you limit the range of submission_dates or include fewer fields in your SELECT statement. For clustered tables, this estimate won't take into account benefits from LIMITs and WHERE clauses on clustering fields, so you'll need to compare to the actual "Data Scanned" after the query is run. Queries are charged by data scanned at $5/TB so each 200 GB of data scanned will cost $1; it can be useful to keep the data estimate below 200 GB while developing and testing a query to limit cost and query time, then open up to the full range of data you need when you have confidence in the results.

A complete list of optimizations can be found here and cost optimizations here

Scheduling BigQuery Queries in Airflow

Queries in bigquery-etl can be scheduled in Airflow to be run regularly with the results written to a table.

In bigquery-etl

In the bigquery-etl project, queries are written in /templates. The directory structure is based on the destination table: /templates/{dataset_id}/{table_name}. For example, /templates/telemetry/core_clients_last_seen_raw_v1/query.sql is a query that will write results to the core_clients_last_seen_raw_v1 table in the telemetry dataset. This can be overridden in Airflow.

If we want to create a new table of just client_id's each day called client_ids in the example dataset, we should create /templates/example/client_ids/query.sql:

SELECT
  DISTINCT(client_id),
  submission_date
FROM
    telemetry_derived.main_summary_v4
WHERE
  submission_date = @submission_date

@submission_date is a parameter that will be filled in by Airflow.

After /templates/example/client_ids/query.sql is created, /script/generate_sql can be run to generate the associated query in /sql/examples/client_ids/query.sql which is the query that will be run by the Airflow task.

Commit both changes in templates/ and /sql. When a commit is made to master in bigquery-etl, the Docker image is pushed and available to Airflow.

In telemetry-airflow

The next step is to create a DAG or add a task to an existing DAG that will run the query.
In telemetry-airflow, BigQuery related functions are found in /dags/utils/gcp.py. The function we are interested in is bigquery_etl_query.

For our client_ids example, we could create a new DAG, /dags/client_ids.py:

from airflow import models
from utils.gcp import bigquery_etl_query

default_args = {
    ...
}

dag_name = 'client_ids'

with models.DAG(dag_name, schedule_interval='0 1 * * *', default_args=default_args) as dag:
    client_ids = bigquery_etl_query(
        task_id='client_ids',
        destination_table='client_ids',
        dataset_id='example'
    )

By default, bigquery_etl_query will execute the query in /sql/{dataset_id}/{destination_table}/query.sql and write to the derived-datasets project but this can be changed via the function arguments.

This DAG will then execute /sql/example/client_ids/query.sql every day, writing results to the client_ids table in the example dataset in the derived-datasets project.

Other considerations

  • The Airflow task will overwrite the destination table partition
    • Destination table should be partitioned by submission_date
    • date_partition_parameter argument in bigquery_etl_query can be set to None to overwrite the whole table
  • Airflow can be tested locally following instructions here: https://github.com/mozilla/telemetry-airflow#testing-gke-jobs-including-bigquery-etl-changes
  • It's possible to change the Docker image that Airflow uses to test changes to bigquery-etl before merging changes to master
    • Supply a value to the docker_image argument in bigquery_etl_query

Dataset Specific

Working with Crash Pings

Here are some snippets to get you started querying crash pings from the Dataset API.

We can first load and instantiate a Dataset object to query the crash pings, and look at the possible fields to filter on:

from moztelemetry.dataset import Dataset
telem = Dataset.from_source("telemetry")
telem.schema
# => 'submissionDate, sourceName, sourceVersion, docType, appName, appUpdateChannel,
#     appVersion, appBuildId'

The more specific these filters, the faster it can be pulled. The fields can be filtered by either value or a callable. For example, a version and date range can be specified from the v5758 and dateslambdas below:

v5758 = lambda x: x[:2] in ('57', '58')
dates = lambda x: '20180126' <= x <= '20180202'
telem = (
    Dataset.from_source("telemetry")
    .where(docType='crash', appName="Firefox", appUpdateChannel="release",
           appVersion=v5758, submissionDate=dates)
)

Now, referencing the docs for the crash ping, the desired fields can be selected and brought in as a spark RDD named pings

sel = (
    telem.select(
        os_name='environment.system.os.name',
        os_version='environment.system.os.version',
        app_version='application.version',
        app_architecture='application.architecture',
        clientId='clientId',
        creationDate='creationDate',
        submissionDate='meta.submissionDate',
        sample_id='meta.sampleId',
        modules='payload.stackTraces.modules',
        stackTraces='payload.stackTraces',
        oom_size='payload.metadata.OOMAllocationSize',
        AvailablePhysicalMemory='payload.metadata.AvailablePhysicalMemory',
        AvailableVirtualMemory='payload.metadata.AvailableVirtualMemory',
        TotalPhysicalMemory='payload.metadata.TotalPhysicalMemory',
        TotalVirtualMemory='payload.metadata.TotalVirtualMemory',
        reason='payload.metadata.MozCrashReason',
        payload='payload',
    )
)
pings = sel.records(sc)

Working with Normandy events

A common request is to count the number of users who have enrolled or unenrolled from a SHIELD experiment.

The events table includes Normandy enrollment and unenrollment events for both pref-flip and add-on studies. Note that the events table is updated nightly.

Normandy events have event_category normandy. The event_string_value will contain the experiment slug (for pref-flip experiments) or name (for add-on experiments).

Normandy events are described in detail in the Firefox source tree docs.

Note that addon studies do not have branch information in the events table, since addons, not Normandy, are responsible for branch assignment. For studies built with the add-on utilities, branch assignments are published to the telemetry_shield_study_parquet dataset.

Counting pref-flip enrollment events by branch

The event_map_values column of enroll events contains a branch key, describing which branch the user enrolled in.

To fetch a count of events by branch in PySpark:

import pyspark.sql.functions as f
events = spark.table("events")

# For example...
EXPERIMENT_SLUG = "prefflip-webrender-v1-2-1492568"
EXPERIMENT_START = "20180920"

enrollments_by_day = (
  events
  .filter(events.event_category == "normandy")
  .filter(events.event_method == "enroll")
  .filter(events.event_string_value == EXPERIMENT_SLUG)
  .filter(events.submission_date_s3 >= EXPERIMENT_START)
  .withColumn("branch", events.event_map_values.getItem("branch"))
  .groupBy(events.submission_date_s3, "branch")
  .agg(f.count("*").alias("n"))
  .toPandas()
)

Equivalently, in Presto SQL:

SELECT
  submission_date_s3,
  event_map_values['branch'] AS branch,
  COUNT(*) AS n
FROM events
WHERE
  event_category = 'normandy'
  AND event_method = 'enroll'
  AND event_string_value = '{{experiment_slug}}'
  AND submission_date_s3 >= '{{experiment_start}}'
GROUP BY 1, 2
ORDER BY 1, 2

Counting pref-flip unenrollment events by branch

The event_map_values column of unenroll events includes a reason key. Reasons are described in the Normandy docs. Normal unenroll events at the termination of a study will occur for the reason recipe-not-seen.

To fetch a count of events by reason and branch in PySpark:

unenrollments_by_reason = (
  events
  .filter(events.event_category == "normandy")
  .filter(events.event_method == "unenroll")
  .filter(events.event_string_value == EXPERIMENT_SLUG)
  .filter(events.submission_date_s3 >= EXPERIMENT_START)
  .withColumn("branch", events.event_map_values.getItem("branch"))
  .withColumn("reason", events.event_map_values.getItem("reason"))
  .groupBy(events.submission_date_s3, "branch", "reason")
  .agg(f.count("*").alias("n"))
  .toPandas()
)

Real-time

Creating a Real-time Analysis Plugin

This technique relies on the AWS ingestion pipeline. In BigQuery, the tables in the moz-fx-data-shared-prod:telemetry_live dataset have only a few minutes of latency, so you can query those from STMO or the BigQuery console for near real-time data access instead of writing an analysis plugin.

Getting Started

Creating an analysis plugin consists of three steps:

  1. Writing a message matcher

    The message matcher allows one to select specific data from the data stream.

  2. Writing the analysis code/business logic

    The analysis code allows one to aggregate, detect anomalies, apply machine learning algorithms etc.

  3. Writing the output code

    The output code allows one to structure the analysis results in an easy to consume format.

Step by Step Setup

  1. Go to the CEP site: https://pipeline-cep.prod.mozaws.net/

  2. Login/Register using your Google @mozilla.com account

  3. Click on the Plugin Deployment tab

  4. Create a message matcher

    1. Edit the message_matcher variable in the Heka Analysis Plugin Configuration text area. For this example we are selecting all telemetry messages. The full syntax of the message matcher can be found here: http://mozilla-services.github.io/lua_sandbox/util/message_matcher.html

      message_matcher = "Type == 'telemetry'"
      
  5. Test the message matcher

    1. Click the Run Matcher button.

      Your results or error message will appear to the right. You can browse the returned messages to examine their structure and the data they contain; this is very helpful when developing the analysis code but is also useful for data exploration even when not developing a plugin.

  6. Delete the code in the Heka Analysis Plugin text area

  7. Create the Analysis Code (process_message)

    The process_message function is invoked every time a message is matched and should return 0 for success and -1 for failure. Full interface documentation: http://mozilla-services.github.io/lua_sandbox/heka/analysis.html

    1. Here is the minimum implementation; type it into the Heka Analysis Plugin text area:

      function process_message()
          return 0 -- success
      end
      
  8. Create the Output Code (timer_event)

    The timer_event function is invoked every ticker_interval seconds.

    1. Here is the minimum implementation; type it into the Heka Analysis Plugin text area:

      function timer_event()
      end
      
  9. Test the Plugin

    1. Click the Test Plugin button.

      Your results or error message will appear to the right. If an error is output, correct it and test again.

  10. Extend the Code to Perform a Simple Message Count Analysis/Output

    1. Replace the code in the Heka Analysis Plugin text area with the following:

      local cnt = 0
      function process_message()
          cnt = cnt + 1                       -- count the number of messages that matched
          return 0
      end
      
      function timer_event()
          inject_payload("txt", "types", cnt) -- output the count
      end
      
  11. Test the Plugin

    1. Click the Test Plugin button.

      Your results or error message will appear to the right. If an error is output, correct it and test again.

  12. Extend the Code to Perform a More Complex Count by Type Analysis/Output

    1. Replace the code in the Heka Analysis Plugin text area with the following:

      types = {}
      function process_message()
          -- read the docType from the message, if it doesn't exist set it to "unknown"
          local dt = read_message("Fields[docType]") or "unknown"
      
          -- look up the docType in the types hash
          local cnt = types[dt]
          if cnt then
              types[dt] = cnt + 1   -- if the type cnt exists, increment it by one
          else
              types[dt] = 1         -- if the type cnt didn't exist, initialize it to one
          end
          return 0
      end
      
      function timer_event()
          add_to_payload("docType = Count\n")   -- add a header to the output
          for k, v in pairs(types) do           -- iterate over all the key/values (docTypes/cnt in the hash)
              add_to_payload(k, " = ", v, "\n") -- add a line to the output
          end
          inject_payload("txt", "types")        -- finalize all the data written to the payload
      end
      
  13. Test the Plugin

    1. Click the Test Plugin button.

      Your results or error message will appear to the right. If an error is output, correct it and test again.

  14. Deploy the plugin

    1. Click the Deploy Plugin button and dismiss the successfully deployed dialog.
  15. View the running plugin

    1. Click the Plugins tab and look for the plugin that was just deployed {user}.example
    2. Right click on the plugin to active the context menu allowing you to view the source or stop the plugin.
  16. View the plugin output

    1. Click on the Dashboards tab
    2. Click on the Raw Dashboard Output link
    3. Click on analysis.{user}.example.types.txt link

Where to go from here

  • Lua Reference: http://www.lua.org/manual/5.1/manual.html
  • Available Lua Modules: https://mozilla-services.github.io/lua_sandbox_extensions/
  • Support

See My Pings

This technique relies on the AWS ingestion pipeline. In BigQuery, the tables in the moz-fx-data-shared-prod:telemetry_live dataset have only a few minutes of latency, so you can query those tables for pings from your client_id using STMO or the BigQuery console instead of writing an analysis plugin.

So you want to see what you're sending the telemetry pipeline, huh? Well follow these steps and we'll have you reading some JSON in no time.

For a more thorough introduction, see Creating a Real-Time Analysis Plugin Cookbook.

Steps to Create a Viewing Output

  1. Get your clientId from whatever product you're using. For desktop, it's available in about:telemetry.

  2. Go to the CEP site: https://pipeline-cep.prod.mozaws.net/

  3. Login/Register using your Google @mozilla.com account

  4. Click on the "Analysis Plugin Deployment" tab

  5. Under "Heka Analysis Plugin Configuration", put the following config:

filename = '<your_name>_<product>_pings.lua'
message_matcher = 'Type == "telemetry" && Fields[docType] == "<doctype>" && Fields[clientId] == "<your_client_id>"'
preserve_data = false
ticker_interval = 60

Where <product> is whatever product you're testing, and <doctype> is whatever ping you're testing (e.g. main, core, mobile-event, etc.).

  1. Under "Heka Analysis Plugin" put the following. This will, by default, show the most recent 10 pings that match your clientId on the specified docType.

NOTE: If you are looking at main, saved-session, or crash pings, the submitted data is split out into several pieces. Reading just Fields[submission] will not give you the entire submitted ping contents. You can change that to e.g. Fields[environment.system], Fields[payload.histograms], Fields[payload.keyedHistograms]. To see all of the available fields, look at a ping in the Matcher tab.

require "string"
require "table"

output = {}
max_len = 10
cur_ind = 1

function process_message()
    output[cur_ind] = read_message("Fields[submission]")
    cur_ind = cur_ind + 1
    if cur_ind > max_len then
        cur_ind = 1
    end
    return 0
end

function timer_event(ns, shutdown)
    local res = table.concat(output, ",")
    add_to_payload("[" .. res .. "]")
    inject_payload("json")
end
  1. Click "Run Matcher", then "Test Plugin". Check that no errors appear in "Debug Output"

  2. Click "Deploy Plugin". Your output will be available at https://pipeline-cep.prod.mozaws.net/dashboard_output/analysis.<username>_mozilla_com.<your_name>_<product>_pings..json

CEP Matcher

This technique relies on the AWS ingestion pipeline. In BigQuery, the tables in the moz-fx-data-shared-prod:telemetry_live dataset have only a few minutes of latency, so you can query those from STMO or the BigQuery console for near real-time data access instead of writing an analysis plugin.

The CEP Matcher tab lets you easily view some current pings of any ping type. To access it, follow these first few directions for accessing the CEP. Once there, click on the "Matcher" tab. The message-matcher is set by default to TRUE, meaning all pings will be matched. Click "Run Matcher" and a few pings will show up.

Editing the Message Matcher

Changing the message matcher will filter down the accepted pings, letting you hone in on a certain type. Generally, you can filter on any fields in a ping. For example, docType:

Fields[docType] == "main"

Or OS:

Fields[os] == "Android"

We can also combine matchers together:

Fields[docType] == "core" && Fields[os] == "Android" && Fields[appName] == "Focus"

Note that most of the time, you want just proper telemetry pings, so include this in your matcher:

Type == "telemetry"

Which would get us a sample of Focus Android core pings.

The Message Matcher documentation has more information on the syntax.

To see the available fields that you can filter on for any docType, see this document. For example, look under the telemetry top-level field at system-addon-deployment-diagnostics. The available fields to filter on are:

required binary Logger;
required fixed_len_byte_array(16) Uuid;
optional int32 Pid;
optional int32 Severity;
optional binary EnvVersion;
required binary Hostname;
required int64 Timestamp;
optional binary Payload;
required binary Type;
required group Fields {
    required binary submission;
    required binary Date;
    required binary appUpdateChannel;
    required double sourceVersion;
    required binary documentId;
    required binary docType;
    required binary os;
    optional binary environment.addons;
    optional binary DNT;
    required binary environment.partner;
    required binary sourceName;
    required binary appVendor;
    required binary environment.profile;
    required binary environment.settings;
    required binary normalizedChannel;
    required double sampleId;
    required binary Host;
    required binary geoCountry;
    required binary geoCity;
    required boolean telemetryEnabled;
    required double creationTimestamp;
    required binary appVersion;
    required binary appBuildId;
    required binary environment.system;
    required binary environment.build;
    required binary clientId;
    required binary submissionDate;
    required binary appName;
}

So, for example, you could have a message matcher like:

Type == "telemetry" && Fields[geoCountry] == "US"

Metrics

DAU and MAU

For the purposes of DAU, a profile is considered active if it sends any main ping.

  • Dates are defined by submission_date_s3 or submission_date.

DAU is the number of clients sending a main ping on a given day.

MAU is the number of unique clients who have been a DAU on any day in the last 28 days. In other words, any client that contributes to DAU in the last 28 days would also contribute to MAU for that day. Note that this is not simply the sum of DAU over 28 days, since any particular client could be active on many days.

WAU is the number of unique clients who have been a DAU on any day in the last 7 days. Caveats above for MAU also apply to WAU.

To make the time boundaries more clear, let's consider a particular date 2019-01-28. The DAU number assigned to 2019-01-28 should consider all main pings received during 2019-01-28 UTC. We cannot observe the full data until 2019-01-28 closes (and in practice we need to wait a bit longer since we are usually referencing derived datasets like clients_daily that are updated once per day over several hours following midnight UTC), so the earliest we can calculate this value is on 2019-01-29. If plotted as a time series, this value should always be plotted at the point labeled 2019-01-28. Likewise, MAU for 2019-01-28 should consider a 28 day range that includes main pings received on 2019-01-28 and back to beginning of day UTC 2019-01-01. Again, the earliest we can calculate the value is on 2019-01-29.

For quick analysis, using firefox_desktop_exact_mau28_by_dimensions is recommended. Below is an example query for getting MAU, WAU, and DAU for 2018 using firefox_desktop_exact_mau28_by_dimensions.

SELECT
  submission_date,
  SUM(mau) AS mau,
  SUM(wau) AS wau,
  SUM(dau) AS dau
FROM
  telemetry.firefox_desktop_exact_mau28_by_dimensions
WHERE
  submission_date_s3 >= '2018-01-01'
  AND submission_date_s3 < '2019-01-01'
GROUP BY
  submission_date
ORDER BY
  submission_date

For analysis of dimensions not available in firefox_desktop_exact_mau28_by_dimensions, using clients_last_seen is recommended. Below is an example query for getting MAU, WAU, and DAU by app_version for 2018 using clients_last_seen.

SELECT
  submission_date,
  app_version,
  -- days_since_seen is always between 0 and 28, so MAU could also be
  -- calculated with COUNT(days_since_seen) or COUNT(*)
  COUNTIF(days_since_seen < 28) AS mau,
  COUNTIF(days_since_seen < 7) AS wau,
  -- days_since_* values are always between 0 and 28 or null, so DAU could also
  -- be calculated with COUNTIF(days_since_seen = 0)
  COUNTIF(days_since_seen < 1) AS dau
FROM
  telemetry.clients_last_seen
WHERE
  submission_date_s3 >= '2018-01-01'
  AND submission_date_s3 < '2019-01-01'
GROUP BY
  submission_date,
  app_version
ORDER BY
  submission_date,
  app_version

For analysis of only DAU, using clients_daily is more efficient than clients_last_seen. Getting MAU and WAU from clients_daily is not recommended. Below is an example query for getting DAU for 2018 using clients_daily.

SELECT
  submission_date_s3,
  COUNT(*) AS dau
FROM
  telemetry.clients_daily
WHERE
  -- In BigQuery use yyyy-MM-DD, e.g. '2018-01-01'
  submission_date_s3 >= '20180101'
  AND submission_date_s3 < '20190101'
GROUP BY
  submission_date_s3
ORDER BY
  submission_date_s3

main_summary can also be used for getting DAU. Below is an example query using a 1% sample over March 2018 using main_summary:

SELECT
  submission_date_s3,
  -- Note: this does not include NULL client_id in count where above methods do
  COUNT(DISTINCT client_id) * 100 AS DAU
FROM
  telemetry.main_summary
WHERE
  sample_id = '51'
  -- In BigQuery use yyyy-MM-DD, e.g. '2018-03-01'
  AND submission_date_s3 >= '20180301'
  AND submission_date_s3 < '20180401'
GROUP BY
  submission_date_s3
ORDER BY
  submission_date_s3

Active DAU and Active MAU

An Active User is defined as a client who has total_daily_uri >= 5 URI for a given date.

  • Dates are defined by submission_date_s3 or submission_date.
  • A client's total_daily_uri is defined as their sum of scalar_parent_browser_engagement_total_uri_count for a given date1.

Active DAU (aDAU) is the number of Active Users on a given day.

Active MAU (aMAU) is the number of unique clients who have been an Active User on any day in the last 28 days. In other words, any client that contributes to aDAU in the last 28 days would also contribute to aMAU for that day. Note that this is not simply the sum of aDAU over 28 days, since any particular client could be active on many days.

Active WAU (aWAU) is the number of unique clients who have been an Active User on any day in the last 7 days. Caveats above for aMAU also apply to aWAU.

To make the time boundaries more clear, let's consider a particular date 2019-01-28. The aDAU number assigned to 2019-01-28 should consider all main pings received during 2019-01-28 UTC. We cannot observe the full data until 2019-01-28 closes (and in practice we need to wait a bit longer since we are usually referencing derived datasets like clients_daily that are updated once per day over several hours following midnight UTC), so the earliest we can calculate this value is on 2019-01-29. If plotted as a time series, this value should always be plotted at the point labeled 2019-01-28. Likewise, aMAU for 2019-01-28 should consider a 28 day range that includes main pings received on 2019-01-28 and back to beginning of day UTC 2019-01-01. Again, the earliest we can calculate the value is on 2019-01-29.

For quick analysis, using firefox_desktop_exact_mau28_by_dimensions is recommended. Below is an example query for getting MAU, WAU, and DAU for 2018 using firefox_desktop_exact_mau28_by_dimensions.

SELECT
  submission_date,
  SUM(visited_5_uri_mau) AS visited_5_uri_mau,
  SUM(visited_5_uri_wau) AS visited_5_uri_wau,
  SUM(visited_5_uri_dau) AS visited_5_uri_dau
FROM
  telemetry.firefox_desktop_exact_mau28_by_dimensions
WHERE
  submission_date_s3 >= '2018-01-01'
  AND submission_date_s3 < '2019-01-01'
GROUP BY
  submission_date
ORDER BY
  submission_date

For analysis of dimensions not available in firefox_desktop_exact_mau28_by_dimensions, using clients_last_seen is recommended. Below is an example query for getting aMAU, aWAU, and aDAU by app_version for 2018 using clients_last_seen.

SELECT
  submission_date,
  app_version,
  -- days_since_* values are always < 28 or null, so aMAU could also be
  -- calculated with COUNT(days_since_visited_5_uri)
  COUNTIF(days_since_visited_5_uri < 28) AS visited_5_uri_mau,
  COUNTIF(days_since_visited_5_uri < 7) AS visited_5_uri_wau,
  -- days_since_* values are always >= 0 or null, so aDAU could also be
  -- calculated with COUNTIF(days_since_visited_5_uri = 0)
  COUNTIF(days_since_visited_5_uri < 1) AS visited_5_uri_dau
FROM
  telemetry.clients_last_seen
WHERE
  submission_date_s3 >= '2018-01-01'
  AND submission_date_s3 < '2019-01-01'
GROUP BY
  submission_date,
  app_version
ORDER BY
  submission_date,
  app_version

For analysis of only aDAU, using clients_daily is more efficient than clients_last_seen. Getting aMAU and aWAU from clients_daily is not recommended. Below is an example query for getting aDAU for 2018 using clients_daily.

SELECT
  submission_date_s3,
  COUNT(*) AS visited_5_uri_dau
FROM
  telemetry.clients_daily
WHERE
  scalar_parent_browser_engagement_total_uri_count_sum >= 5
  -- In BigQuery use yyyy-MM-DD, e.g. '2018-01-01'
  AND submission_date_s3 >= '20180101'
  AND submission_date_s3 < '20190101'
GROUP BY
  submission_date_s3
ORDER BY
  submission_date_s3

main_summary can also be used for getting aDAU. Below is an example query using a 1% sample over March 2018 using main_summary:

SELECT
  submission_date_s3,
  COUNT(*) * 100 AS visited_5_uri_dau
FROM (
  SELECT
    submission_date_s3,
    client_id,
    SUM(scalar_parent_browser_engagement_total_uri_count) >= 5 AS visited_5_uri
  FROM
    telemetry.main_summary
  WHERE
    sample_id = '51'
    -- In BigQuery use yyyy-MM-DD, e.g. '2018-03-01'
    AND submission_date_s3 >= '20180301'
    AND submission_date_s3 < '20180401'
  GROUP BY
    submission_date_s3,
    client_id)
WHERE
  visited_5_uri
GROUP BY
  submission_date_s3
ORDER BY
  submission_date_s3

1: Note, the probe measuring scalar_parent_browser_engagement_total_uri_count only exists in clients with Firefox 50 and up. Clients on earlier versions of Firefox won't be counted as an Active User (regardless of their use). Similarly, scalar_parent_browser_engagement_total_uri_count doesn't increment when a client is in Private Browsing mode, so that won't be included as well.

Authored by the Product Data Science Team. Please direct questions/concerns to Ben Miroglio (bmiroglio).

Retention

Retention measures the rate at which users are continuing to use Firefox, making it one of the more important metrics we track. We commonly measure retention between releases, experiment cohorts, and various Firefox subpopulations to better understand how a change to the user experience or use of a specific feature affect behavior.

N Week Retention

Time is an embedded component of retention. Most retention analysis starts with some anchor, or action that is associated with a date (experiment enrollment date, profile creation date, button clicked on date d, etc.). We then look 1, 2, …, N weeks beyond the anchor to see what percent of users have submitted a ping (signaling their continued use of Firefox).

For example, let’s say we are calculating retention for new Firefox users. Each user can then be anchored by their profile_creation_date, and we can count the number of users who submitted a ping between 7-13 days after profile creation (1 Week retention), 14-20 days after profile creation (2 Week Retention), etc.

Example Methodology

Given a dataset in Spark, we can construct a field retention_period that uses submission_date_s3 to determine the period to which a ping belongs (i.e. if a user created their profile on April 1st, all pings submitted between April 8th and April 14th are assigned to week 1). 1-week retention can then be simplified to the percent of users with a 1 value for retention_period, 2-week retention simplifies to the percent of users with a 2 value for retention_period, ..., etc. Note that each retention period is independent of the others, so it is possible to have higher 2-week retention than 1-week retention (especially during holidays).

First let's map 1, 2, ..., N week retention the the amount of days elapsed after the anchor point:

PERIODS = {}
N_WEEKS = 6
for i in range(1, N_WEEKS + 1):
    PERIODS[i] = {
        'start': i * 7,
        'end': i * 7 + 6
    }  

Which gives us

{1: {'end': 13, 'start': 7},
 2: {'end': 20, 'start': 14},
 3: {'end': 27, 'start': 21},
 4: {'end': 34, 'start': 28},
 5: {'end': 41, 'start': 35},
 6: {'end': 48, 'start': 42}}

Next, let's define some helper functions:

import datetime as dt
import pandas as pd
import pyspark.sql.types as st
import pyspark.sql.functions as F

udf = F.udf

def date_diff(d1, d2, fmt='%Y%m%d'):
    """
    Returns days elapsed from d2 to d1 as an integer

    Params:
    d1 (str)
    d2 (str)
    fmt (str): format of d1 and d2 (must be the same)

    >>> date_diff('20170205', '20170201')
    4

    >>> date_diff('20170201', '20170205)
    -4
    """
    try:
        return (pd.to_datetime(d1, format=fmt) -
                pd.to_datetime(d2, format=fmt)).days
    except:
        return None


@udf(returnType=st.IntegerType())
def get_period(anchor, submission_date_s3):
    """
    Given an anchor and a submission_date_s3,
    returns what period a ping belongs to. This
    is a spark UDF.

    Params:
    anchor (col): anchor date
    submission_date_s3 (col): a ping's submission_date to s3

    Global:
    PERIODS (dict): defined globally based on n-week method

    Returns an integer indicating the retention period
    """
    if anchor is not None:
        diff = date_diff(submission_date_s3, anchor)
        if diff >= 7: # exclude first 7 days
            for period in sorted(PERIODS):
                if diff <= PERIODS[period]['end']:
                    return period

@udf(returnType=st.StringType())
def from_unixtime_handler(ut):
    """
    Converts unix time (in days) to a string in %Y%m%d format.
    This is a spark UDF.

    Params:
    ut (int): unix time in days

    Returns a date as a string if it is parsable by datetime, otherwise None
    """
    if ut is not None:
        try:
            return (dt.datetime.fromtimestamp(ut * 24 * 60 * 60).strftime("%Y%m%d"))
        except:
            return None

Now we can load in a subset of main_summary and construct the necessary fields for retention calculations:

ms = spark.sql("""
    SELECT
        client_id,
        submission_date_s3,
        profile_creation_date,
        os
    FROM main_summary
    WHERE
        submission_date_s3 >= '20180401'
        AND submission_date_s3 <= '20180603'
        AND sample_id = '42'
        AND app_name = 'Firefox'
        AND normalized_channel = 'release'
        AND os in ('Darwin', 'Windows_NT', 'Linux')
    """)

PCD_CUTS = ('20180401', '20180415')

ms = (
    ms.withColumn("pcd", from_unixtime_handler("profile_creation_date")) # i.e. 17500 -> '20171130'
      .filter("pcd >= '{}'".format(PCD_CUTS[0]))
      .filter("pcd <= '{}'".format(PCD_CUTS[1]))
      .withColumn("period", get_period("pcd", "submission_date_s3"))
)

Note that we filter to profiles that were created in the first half of April so that we have sufficient time to observe 6 weeks of behavior. Now we can calculate retention!

os_counts = (
    ms
    .groupby("os")
    .agg(F.countDistinct("client_id").alias("total_clients"))
)

weekly_counts = (
    ms
    .groupby("period", "os")
    .agg(F.countDistinct("client_id").alias("n_week_clients"))
)

retention_by_os = (
    weekly_counts
    .join(os_counts, on='os')
    .withColumn("retention", F.col("n_week_clients") / F.col("total_clients"))
    # Add a 95% confidence interval based on the normal approximation for a binomial distribution,
    # p ± z * sqrt(p*(1-p)/n).
    # The 95% CI spans the range `retention ± ci_95_semi_interval`.
    .withColumn(
      "ci_95_semi_interval",
      F.lit(1.96) * F.sqrt(F.col("retention") * (F.lit(1) - F.col("retention")) / F.col("total_clients"))
    )
)

Peeking at 6-Week Retention

retention_by_os.filter("period = 6").show()
+----------+------+--------------+-------------+-------------------+--------------------+
|        os|period|n_week_clients|total_clients| retention         | ci_95_semi_interval|
+----------+------+--------------+-------------+-------------------+--------------------+
|     Linux|     6|          1495|        22422|0.06667558647756668|0.003265266498407...|
|    Darwin|     6|          1288|         4734|0.27207435572454586|0.012677372722376635|
|Windows_NT|     6|         29024|       124872|0.23243000832852842|0.002342764476746...|
+----------+------+--------------+-------------+-------------------+--------------------+


we observe that 6.7% ± 0.3% of Linux users whose profile was created in the first half of April submitted a ping 6 weeks later, and so forth. The example code snippets are consolidated in this notebook.

New vs. Existing User Retention

The above example calculates New User Retention, which is distinct from Existing User Retention. This distinction is important when understanding retention baselines (i.e. does this number make sense?). Existing users typically have much higher retention numbers than new users.

Note that is more common in industry to refer to Existing User Retention as "Churn" (Churn = 1 - Retention), however, we use retention across the board for the sake of consistency and interpretability.

Please be sure to specify whether or not your retention analysis is for new or existing users.

What If There's No Anchor Point?

Sometimes there isn't a clear anchor point like profile_creation_date or enrollment_date.

For example, imagine you are tasked with reporting retention numbers for users that enabled sync (sync_configured) compared to users that haven't. Being a boolean pref, there is no straightforward way to determine when sync_enabled flipped from false to true aside from looking at a client's entire history (which is not recommended!). What now?

We can construct an artificial anchor point using fixed weekly periods; the retention concepts then remain unchanged. The process can be summarized by the following steps:

  • Define a baseline week cohort
    • For this example let's define the baseline as users that submitted pings between 2018-01-01 and 2018-01-07
  • Count all users with/without sync enabled in this period
  • Assign these users to an anchor point of 2018-01-01 (the beginning of the baseline week)
  • Count the number of users in the baseline week that submitted a ping between 7-13 days after 2018-01-01 (1 Week retention), 14-20 days after 2018-01-01 (2 Week Retention), etc.
  • Shift the baseline week up 7 days (and all other dates) and repeat as necessary

This method is also valid in the presence of an anchor point, however, it is recommended the anchor point method is employed when possible.

Confounding Factors

When performing retention analysis between two or more groups, it is important to look at other usage metrics to get an understanding of other influential factors.

For example (borrowing the sync example from the previous section) you find that users with and without sync have a 1 week retention of 0.80 and 0.40, respectively. Wow--we should really be be promoting sync as it could double retention numbers!

Not quite. Turns out you next look at active_ticks and total_uri_count and find that sync users report much higher numbers for these measures as well. Now how can we explain this difference in retention?

There could be an entirely separate cookbook devoted to answering this question, however this contrived example is meant to demonstrate that simply comparing retention numbers between two groups isn't capturing the full story. Sans an experiment or model-based approach, all we can say is "enabling sync is associated with higher retention numbers." There is still value in this assertion, however it should be stressed that association/correlation != causation!

Collecting New Data

Guidelines

For information about what sorts of data may be collected, and for information on getting a data collection request reviewed, please read the Data Collection Guidelines.

Mechanics

The mechanics of how to instrument new data collection in Firefox are covered in Adding a new Telemetry probe.

For non-Telemetry data collection, we have a mechanism for streamlining ingestion of structured (JSON) data that utilizes the same underlying infrastructure. See this cookbook for details on using it.

Client Implementation Guidelines for Experiments

There are three supported approaches for enabling experimental features for Firefox:

  • Firefox Prefs
    • Prefs can be used to control features that land in-tree. Feature Gates provide a wrapper around prefs that can be used from JavaScript.
  • Firefox Extensions AKA "Add-ons".
    • If the feature being tested should not land in the tree, or if it will ultimately ship as an extension, then an extension should be used.

New features go through the standard Firefox review, testing, and deployment processes, and are then enabled experimentally in the field using Normandy.

Prefs

Firefox Preferences (AKA "prefs") are commonly used to enable and disable features. However, prefs are more complex to implement correctly than feature gates.

Each pref should represent a different experimental treatment. If your experimental feature requires multiple prefs, then Normandy does not currently support this but will soon. In the meantime, an extension such as multipreffer may be used.

There are three types of Prefs:

  1. Built-in prefs - shipped with Firefox, in firefox.js.
  2. user branch - set by the user, overriding built-in prefs.
  3. default branch - Overrides both built-in and user branch prefs. Only persists until the browser session ends, next restart will revert to either built-in or user branch (if set).

Normandy supports overriding both the user and default branches, although the latter is preferred as it does not permanently override user settings. default branch prefs are simple to reset since they do not persist past a restart.

In order for features to be activated experimentally using default branch prefs:

  • The feature must not start up before final-ui-startup is observed.

For instance, to set an observer:

Services.obs.addObserver(this, "final-ui-startup", true);

In this example, this would implement an observe(subject, topic, data) function which will be called when final-ui-startup is observed. See the Observer documentation for more information.

  • It must be possible to enable/disable the feature at runtime, via a pref change.

This is similar to the observer pattern above:

Services.prefs.addObserver("pref_name", this);

More information is available in the Preference service documentation.

  • Never use Services.prefs.prefHasUserValue(), or any other function specific to user branch prefs.

  • Prefs should be set by default in firefox.js

If your feature cannot abide by one or more of these rules (for instance, it needs to run at startup and/or cannot be toggled at runtime) then experimental preferences can be set on the user branch. This is more complex than using the methods described above; user branch prefs override the users choice, which is a really complex thing to try to support when flipping prefs experimentally. We also need to be careful to back up and reset the pref, and then figure out how to resolve conflicts if the user has changed the pref in the meantime.

Feature Gates

A new Feature Gate library for Firefox Desktop is now available.

Each feature gate should represent a different experimental treatment. If your experimental feature requires multiple flags, then Normandy will not be able to support this directly and an extension may be used.

Feature Gate caveats

The current Feature Gate library comes with a few caveats, and may not be appropriate for your situation:

  • Only JS is supported.
  • Always asynchronous.

Future versions of the Feature Gate API will include C++/Rust support and a synchronous API.

Using the Feature Gate library

Read the documentation to get started.

Extensions

Firefox currently supports the Web Extensions API.

If new WebExtension APIs are needed, they should land in-tree. Extensions which are signed by Mozilla can load privileged code using the WebExtension Experiments, but this is not preferred.

WebExtensions go through the same correctness and performance tests as other features. This is possible using the Mozilla tryserver by dropping your XPI into testing/profiles/common/extensions in mozilla-central and pushing to Tryserver - see the Testing Extensions section below.

NOTE - it is ideal to test against the version of Firefox which the extension will release against, but there is a bug related to artifact builds on release channels which must be worked around. The workaround is pretty simple (modify an artifacts.py file), but this bug being resolved will make it much simpler.

Each extension can represent a different experimental treatment (preferred), or the extension can choose the branch internally.

SHIELD studies

The previous version of the experiments program, SHIELD, always bundled privileged code with extensions and would do things such as mock UI features in Firefox.

This sort of approach is discouraged for new features - land these (or the necessary WebExtension APIs) in-tree instead.

For the moment, the SHIELD Study Add-on Utilities may be used if the extension needs to control the lifecycle of the study, but using one extension per experimental treatment makes this unnecessary and is preferred. The APIs provided by the SHIELD Study Add-on Utilities will be available as privileged APIs shipped with Firefox soon.

Development and Testing

Testing Built-in Features

Firefox features go through standard development and testing processes. See the Firefox developer guide for more information.

Testing Extensions

Extensions do not need to go through the same process, but should take advantage of Mozilla CI and bug tracking systems:

  1. Use the Mozilla CI to test changes (tryserver).
  2. Performance tests (this step is required) - extension XPI files should be placed in testing/profiles/common/extensions/, which will cause test harnesses to load the XPI.
  3. Custom unit/functional tests (AKA xpcshell/mochitest) may be placed in testing/extensions, although running these tests outside Mozilla CI is acceptable so these are optional.
  4. Receive reviewer approval. A Firefox peer must sign off if this extension contains privileged code, aka WebExtension Experiments.
  • Any Firefox Peer should be able to do the review, or point you to someone who can.
  1. Extension is signed.
  2. Email to pi-request@mozilla.com is sent to request QA
  3. QA approval signed off in Bugzilla.
  4. Extension is shipped via Normandy.

Example Extensions Testing Workflow

Note that for the below to work you only need Mercurial installed, but if you want to do local testing you must be set up to build Firefox. You don't need to build Firefox from source; artifact builds are sufficient.

In order to use Mozilla CI (AKA "Tryserver"), you must have a full clone of the mozilla-central repository:

hg clone https://hg.mozilla.org/mozilla-central
cd mozilla-central

Copy in unsigned XPI, and commit it to your local Mercurial repo:

cp ~/src/my-extension.xpi testing/profiles/common/extensions/
hg add testing/profiles/common/extensions/my-extension.xpi
hg commit -m "Bug nnn - Testing my extension" testing/profiles/common/extensions/my-extension.xpi

Push to Try:

./mach try -p linux64,macosx64,win64 -b do -u none -t all --artifact

This will run Mozilla CI tests on all platforms

Note that you must have Level 1 commit access to use tryserver. If you are interested in interacting with Mozilla CI from Github (which only requires users to be in the Mozilla GitHub org), check out the Taskcluster Integration proof-of-concept.

Also note that this requires an investment time to set up just as CircleCI or Travis-CI would, so it's not really appropriate for short-term projects. Use tryserver directly instead.

Telemetry Events Best Practices

Overview:

The Telemetry Events API allows users to define and record events in the browser.

Events are defined in Events.yaml and each events creates records with the following properties:

  • timestamp
  • category
  • method
  • object
  • value
  • extra

With the following restrictions and features:

  • The category, method, and object properties of any record produced by an event must have a value.
  • All combinations of values from the category, method, and object properties must be unique to that particular event (no other event can produce events with the same combination).
  • Events can be 'turned on' or 'turned off' by it's category value. i.e. we can instruct the browser to "stop sending us events from the devtools category."

These records are then stored in event pings and available in the events dataset.

Identifying Events

One challenge with this data is it can be difficult to identify all the records from a particular event. Unlike Scalars and Histograms, which keep data in individual locations (like scalar_parent_browser_engagement_total_uri_count for total_uri_count), all event records are stored together, regardless of which event generated them. The records themselves don't have a field identifying which event produced it[1].

Take, for example, the manage event in the addonsManager category.

addonsManager: # category
  manage: # event name
    description: >
      ...
    objects: ["extension", "theme", "locale", "dictionary", "other"] # object values
    methods: ["disable", "enable", "sideload_prompt", "uninstall"] # method values
    extra_keys: # extra values
      ...
    notification_emails: ...
    expiry_version: ...
    record_in_processes: ...
    bug_numbers: ...
    release_channel_collection: ...

This event will produce records that look like:

timestampcategorymethodobjectvalueextra
...addonsManagerinstallextension...
...addonsManagerupdatelocale...
...addonsManagersideload_promptother...

But none of these records will indicate that it was produced by the manage event. To find all records produced by manage, one would have to query all records where

category = ...
AND method in [...,]
AND object in [...,]

which is not ideal.

Furthermore, if one encounters this data without knowledge of how the manage event works, they need to look up the event based on the category, method, and object values in order to find the event, and then query the data again to find all the related events. It's not immediately clear from the data if this record:

timestampcategorymethodobjectvalueextra
...addonsManagerupdatelocale...

and and this record:

timestampcategorymethodobjectvalueextra
...addonsManagerinstallextension...

are related or not.

Another factor that can add to confusion is the fact that other events can share similar values for methods or objects (or even the combination of method and object). For example:

timestampcategorymethodobjectvalueextra
...normandyupdatepreference_rollout...

which can further confuser users.

[1]: Events do have name fields, but they aren't included in the event records and thus are not present in the resulting dataset. Also, If a user defines an event in Events.yaml without specifying a list of acceptable methods, the method will default to the name of the event for records created by that event.

Suggested Convention:

To simplify things in the future, we suggest adding the event name to the category field using dot notation when designing new events:

"category.event_name"

For example:

  • "navigation.search"
  • "addonsManager.manage"
  • "frame.tab"

This provides 3 advantages:

  1. Records produced by this event will be easily identifiable. Also, the event which produced the record will be easier to locate in the code.
  2. Events can be controlled more easily. The category field is what we use to "turn on" and "turn off" events. By creating a 1 to 1 mapping between categories and events, we can control events on an individual level.
  3. By having the category field act as the event identifier, it makes it easier to pass on events to Amplitude and other platforms.

Sending a Custom Ping

Got some new data you want to send to us? How in the world do you send a new ping? Follow this guide to find out.

Note: Most new data collection in Firefox via Telemetry or Glean does not require creating a new ping document type. To add a histogram, scalar, or event collection to Firefox, please see the documentation on adding a new probe.

Write Your Questions

Do not try and implement new pings unless you know specifically what questions you're trying to answer. General questions about "How do users use our product?" won't cut it - these need to be specific, concrete asks that can be translated to data points. This will also make it easier down the line as you start data review.

More detail on how to design and implement new pings for Firefox Desktop can be found here.

Choose a Namespace and DocType

Choose a namespace that uniquely identifies the product that will be generating the data. The telemetry namespace is reserved for pings added by the Firefox Client Telemetry team.

The DocType is used to differentiate pings within a namespace. It can be as simple as event, but should generally be descriptive of the data being collected.

Both namespace and DocType are limited to the pattern [a-zA-Z-]. In other words, hyphens and letters from the ISO basic Latin alphabet.

Create a Schema

Write a JSON Schema. See the "Adding a new schema" documentation and examples schemas in the Mozilla Pipeline Schemas repo. This schema is used to validate the incoming data; any ping that doesn't match the schema will be removed. This schema will also be transformed into a BigQuery table schema via the Mozilla Schema Generator. Note that parquet schemas are no longer necessary because of the generated schemas. Validate your JSON Schema using a validation tool.

Ensuring the ping contains a unique top-level id will enable document-level deduplication, which catches over 90% of duplicates and removes them from the dataset.

Start a Data Review

Data review for new pings is often more complicated than adding new probes. See Data Review for Focus-Event Ping as an example. Consider where the data falls under the Data Collection Categories.

Submit Schema to mozilla-services/mozilla-pipeline-schemas

Create a PR including a template and rendered schema to mozilla-pipeline-schemas. Add at least one validation ping that exercises the structure of schema as a test. These pings are validated during the build and help catch mistakes during the writing process.

Example: A rendered schema for response times

We may want to collect a set of response measurements in milliseconds on a per-client basis. The pings take on the following shape:

{"id": "08317b11-85f7-4688-9b35-48af10c3ccdf", "clientId": "1d5ce2fc-a554-42f0-ab21-2ad8ada9bb88", "payload": {"response_ms": 324}}
{"id": "a97108ac-483b-40be-9c64-3419326f5113", "clientId": "3f1b2e1c-c241-464f-aa46-576f5795e488", "payload": {"response_ms": 221}}
{"id": "b8a7e3f9-38c0-4a13-b42a-c969feb454f6", "clientId": "14f27409-5f6f-46e0-9f9d-da5cd716ee42", "payload": {"response_ms": 549}}

This document can be described in the following way:

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "type": "object",
    "properties": {
        "id": {
            "type": "string",
            "description": "The document identifier"
        },
        "clientId": {
            "type": "string",
            "description": "The client identifier"
        },
        "payload": {
            "type": "object",
            "properties": {
                "response_ms": {
                    "type": "integer",
                    "minimum": 0,
                    "description": "Response time of the client, in milliseconds"
                }
            }
        }
    }
}

Fields like id and clientId have template components as part of the build-system. These would be included as @TELEMETRY_ID_1_JSON@ and @TELEMETRY_CLIENTID_1_JSON@ respectively. The best way to become familiar with template schemas is to browse the repository; the telemetry/main/main.4.schema.json document a good starting place.

As part of the automated deployment process, the JSON schemas are translated into a table schema used by BigQuery. These schemas closely reflect the schemas used for data validation.

[
  {
    "mode": "NULLABLE",
    "name": "clientId",
    "type": "STRING"
  },
  {
    "mode": "NULLABLE",
    "name": "id",
    "type": "STRING"
  },
  {
    "fields": [
      {
        "mode": "NULLABLE",
        "name": "response_ms",
        "type": "INT64"
      }
    ],
    "mode": "NULLABLE",
    "name": "payload",
    "type": "RECORD"
  }
]

Ingestion Metadata

The generated schemas contain metadata added to the schema before deployment to the ingestion service. These are fields added to the ping at ingestion time; they might come from the URL submitted to the edge server, or the IP Address used to make the request. This document lists available metadata fields for the telemetry-ingestion pings, which are largely shared across all namespaces.

A list of metadata fields are listed here for reference, but refer to the above document or the schema explorer for an up-to-date list of metadata fields.

fielddescription
additional_propertiesA JSON string containing any payload properties not present in the schema
document_idThe document ID specified in the URI when the client sent this message
normalized_app_nameSet to "Other" if this message contained an unrecognized app name
normalized_channelSet to "Other" if this message contained an unrecognized channel name
normalized_country_codeAn ISO 3166-1 alpha-2 country code
normalized_osSet to "Other" if this message contained an unrecognized OS name
normalized_os_versionN/A
sample_idHashed version of client_id (if present) useful for partitioning; ranges from 0 to 99
submission_timestampTime when the ingestion edge server accepted this message
metadata.user_agent.browserN/A
metadata.user_agent.osN/A
metadata.user_agent.versionN/A
metadata.uri.app_build_idN/A
metadata.uri.app_nameN/A
metadata.uri.app_update_channelN/A
metadata.uri.app_versionN/A
metadata.header.dateDate HTTP header
metadata.header.dntDNT (Do Not Track) HTTP header
metadata.header.x_debug_idX-Debug-Id HTTP header
metadata.header.x_pingsender_versionX-PingSender-Version HTTP header
metadata.geo.cityN/A
metadata.geo.countryAn ISO 3166-1 alpha-2 country code
metadata.geo.db_versionThe specific geo database version used for this lookup
metadata.geo.subdivision1First major country subdivision, typically a state, province, or county
metadata.geo.subdivision2Second major country subdivision; not applicable for most countries

Testing The Schema

For new data, use the edge validator to test your schema.

Deployment

Schemas are automatically deployed once a day around 00:00 UTC, scheduled after the probe scraper in the following Airflow DAG. The latest schemas can be viewed at mozilla-pipeline-schemas/generated-schemas.

Start Sending Data

Use the built-in Telemetry APIs when possible. A few examples are the Gecko Telemetry APIs, or the iOS Telemetry APIs. Users on Android should use Glean, which does not require building out custom pings.

For all other use-cases, send documents to the ingestion endpoint:

https://incoming.telemetry.mozilla.org

See the HTTP edge server specification and the non-Telemetry example for documentation about the expected format.

Access Your Data

First confirm with the reviewers of your schema pull request that your schemas have been deployed. You may also check the diff of the latest commit to mozilla-pipeline-schemas/generated schemas.

In the following links, replace <namespace>, <doctype> And <docversion> with appropriate values. Also replace - with _ in <namespace> if your namespace contains - characters.

STMO / BigQuery

In the BigQuery (beta) data source, several new tables will be created for your data.

The first table is the live table found under moz-fx-data-shared-prod.<namespace>_live.<doctype>_v<docversion>. This table is updated on a 5 minute interval, partitioned on submission_timestamp, and may contain partial days of data.

SELECT
    count(*) as n_rows
FROM
  `moz-fx-data-shared-prod.telemetry_live.main_v4`
WHERE
  submission_timestamp > TIMESTAMP_SUB(current_timestamp, INTERVAL 30 minute)

The second table that is created is the clustered table view under moz-fx-data-shared-prod.<namespace>.<doctype>_v<docversion>. This view will only contain complete days of submissions. The data is clustered by submission_timestamp and sample_id to improve the efficiency of queries.

SELECT
  COUNT(DISTINCT client_id)*100 AS dau
FROM
  `moz-fx-data-shared-prod.telemetry.main`
WHERE
  submission_timestamp > TIMESTAMP_SUB(current_timestamp, INTERVAL 1 day)
  AND sample_id = 1

This table may take up to a day to appear in the BigQuery source; if you still don't see a table for your new ping after 24 hours, contact Data Operations so that they can investigate. Once the table is available, it should contain all the pings sent during that first day, regardless of how long it takes for the table to appear.

Spark

Refer to the Spark FAQ for details on accessing this table via Spark.

Build Dashboards Using Spark or STMO

Last steps! What are you using this data for anyway?

Dataset Reference

After completing Choosing a Dataset you should have a high level understanding of what questions each dataset is able to answer. This section contains references that focus on a single dataset each. Reading this section front to back is not recommended. Instead, identify a dataset you'd like to understand better and read through the relevant documentation. After reading the tutorial, you should know all you need about the dataset.

Each tutorial should include:

  • Introduction
    • A short overview of why we built the dataset and what need it's meant to solve
    • What data source the data is collected from, and a high level overview of how the data is organized
    • How it is stored and how to access the data including
      • whether the data is available in re:dash
      • S3 paths
  • Reference
    • An example query to give the reader an idea of what the data looks like and how it is meant to be used
    • How the data is processed and sampled
    • How frequently it's updated, and how it's scheduled
    • An up-to-date schema for the dataset
    • How to augment or modify the dataset

Raw Ping Data

Introduction

We receive data from our users via pings. There are several types of pings, each containing different measurements and sent for different purposes. To review a complete list of ping types and their schemata, see this section of the Mozilla Source Tree Docs.

Many pings are also described by a JSONSchema specification which can be found in this repository.

There are a few pings that are central to delivering our core data collection primitives (Histograms, Events, Scalars) and for keeping an eye on Firefox behaviour (Environment, New Profiles, Updates, Crashes).

For instance, a user's first session in Firefox might have four pings like this:

Flowchart of pings in the user's first session

"main" ping

The "main" ping is the workhorse of the Firefox Telemetry system. It delivers the Telemetry Environment as well as Histograms and Scalars for all process types that collect data in Firefox. It has several variants each with specific delivery characteristics:

ReasonSent whenNotes
shutdownFirefox session ends cleanlyAccounts for about 80% of all "main" pings. Sent by Pingsender immediately after Firefox shuts down, subject to conditions: Firefox 55+, if the OS isn't also shutting down, and if this isn't the client's first session. If Pingsender fails or isn't used, the ping is sent by Firefox at the beginning of the next Firefox session.
dailyIt has been more than 24 hours since the last "main" ping, and it is around local midnightIn long-lived Firefox sessions we might go days without receiving a "shutdown" ping. Thus the "daily" ping is sent to ensure we occasionally hear from long-lived sessions.
environment-changeTelemetry Environment changesIs sent immediately when triggered by Firefox (Installing or removing an addon or changing a monitored user preference are common ways for the Telemetry Environment to change)
aborted-sessionFirefox session doesn't end cleanlySent by Firefox at the beginning of the next Firefox session.

It was introduced in Firefox 38.

"first-shutdown" ping

The "first-shutdown" ping is identical to the "main" ping with reason "shutdown" created at the end of the user's first session, but sent with a different ping type. This was introduced when we started using Pingsender to send shutdown pings as there would be a lot of first-session "shutdown" pings that we'd start receiving all of a sudden.

It is sent using Pingsender.

It was introduced in Firefox 57.

"event" ping

The "event" ping provides low-latency eventing support to Firefox Telemetry. It delivers the Telemetry Environment, Telemetry Events from all Firefox processes, and some diagnostic information about Event Telemetry. It is sent every hour if there have been events recorded, and up to once every 10 minutes (governed by a preference) if the maximum event limit for the ping (default to 1000 per process, governed by a preference) is reached before the hour is up.

It was introduced in Firefox 62.

"update" ping

Firefox Update is the most important means we have of reaching our users with the latest fixes and features. The "update" ping notifies us when an update is downloaded and ready to be applied (reason: "ready") and when the update has been successfully applied (reason: "success"). It contains the Telemetry Environment and information about the update.

It was introduced in Firefox 56.

"new-profile" ping

When a user starts up Firefox for the first time, a profile is created. Telemetry marks the occasion with the "new-profile" ping which sends the Telemetry Environment. It is sent either 30 minutes after Firefox starts running for the first time in this profile (reason: "startup") or at the end of the profile's first session (reason: "shutdown"), whichever comes first. "new-profile" pings are sent immediately when triggered. Those with reason "startup" are sent by Firefox. Those with reason "shutdown" are sent by Pingsender.

It was introduced in Firefox 55.

"crash" ping

The "crash" ping provides diagnostic information whenever a Firefox process exits abnormally. Unlike the "main" ping with reason "aborted-session", this ping does not contain Histograms or Scalars. It contains a Telemetry Environment, Crash Annotations, and Stack Traces.

It was introduced in Firefox 40.

"optout" ping

In the event a user opts out of Telemetry, we send one final "optout" ping to let us know. We try exactly once to send it, discarding the ping if sending fails. It contains only the common ping data and an empty payload.

It was introduced in Firefox 63.

Pingsender

Pingsender is a small application shipped with Firefox which attempts to send pings even if Firefox is not running. If Firefox has crashed or has already shut down we would otherwise have to wait for the next Firefox session to begin to send pings.

Pingsender was introduced in Firefox 54 to send "crash" pings. It was expanded to send "main" pings of reason "shutdown" in Firefox 55 (excepting the first session). It sends the "first-shutdown" ping since its introduction in Firefox 57.

Analysis

The large majority of analyses can be completed using only the main ping. This ping includes histograms, scalars, and other performance and diagnostic data.

Few analyses actually rely directly on any raw ping data. Instead, we provide derived datasets which are processed versions of these data, made to be:

  • Easier and faster to query
  • Organized to make the data easier to analyze
  • Cleaned of erroneous or misleading data

Before analyzing raw ping data, check to make sure there isn't already a derived dataset made for your purpose. If you do need to work with raw ping data, be aware that loading the data can take a while. Try to limit the size of your data by controlling the date range, etc.

Accessing the Data

Ping data lives in BigQuery and is accessible in re:dash; see our BigQuery intro. There is currently limited history for main pings available in BigQuery; an import of historical data is planned, but without a determined timeline, so longer history requires an ATMO cluster using the Dataset API.

Further Reading

You can find the complete ping documentation. To augment our data collection, see Collecting New Data and the Data Collection Policy.

Data Reference

You can find the reference documentation for all ping types here.

Ping Metadata

The telemetry data pipeline appends metadata to arriving pings containing information about the ingestion environment, including timestamps; Geo-IP data about the client; and fields extracted from the ping or client headers that are useful for downstream processing.

The Dataset API represents this metadata by appending a meta key to the ping body. These fields may also be available as members of a metadata struct column in direct-to-parquet datasets.

Since the metadata are not present in the ping as it is sent by the client, these fields are documented here, instead of in the source tree docs.

As of September 28, 2018, members of the meta key on main pings include:

KeyDescription
appBuildId
appNamee.g. "Firefox"
appUpdateChannelRaw incoming update channel. E.g. nightly-cck-example
appVendore.g. "Mozilla"
appVersion
clientId
creationTimestampClient creationDate field, transformed to nanoseconds since epoch
DateClient HTTP header reflecting the client time when the ping is sent, like Fri, 28 Sep 2018 14:01:57 GMT
DNTClient "do not track" HTTP header. Not present in all pings
docTypee.g. "main"
documentIdA UUID identifying the ping, generated by the client
geoCityfrom Geo-IP lookup of client IP; ?? if unknown
geoCountryfrom Geo-IP lookup of client IP; ?? if unknown
geoSubdivision1from Geo-IP lookup of client IP; not present in all pings
geoSubdivision2from Geo-IP lookup of client IP; not present in all pings
HostPublic hostname of the ingestion endpoint
HostnamePrivate hostname of the ingestion endpoint
normalizedChannelNormalized update channel. E.g. "release"
normalizedOSVersion
os
reasonDocumented in the source tree
sampleIdcrc32(clientId) % 100
sourceNamee.g. "telemetry"
sourceVersionClient version field (reflecting the version of the ping format)
submissionDateServer date (GMT) when ping was received, like 20180928; derived from Timestamp
telemetryEnabledExtracted value of environment.settings.telemetryEnabled, describing whether opt-in telemetry is enabled
TimestampServer timestamp when the ping is received, expressed as nanoseconds since epoch
Typee.g. "telemetry"
X-PingSender-VersionPresent when a ping is sent with PingSender

Derived Datasets

See Choosing a Dataset for a discussion on the differences between pings and derived datasets.

Intro

The active_profiles dataset gives client-level estimates of whether a profile is still an active user of the browser at a given point in time, as well as probabilistic forecasts of the client's future activity. These quantities are estimated by a model that attempts to infer and decouple a client's latent propensity to leave Firefox and become inactive, as well as their latent propensity to use the browser while still active. These estimates are currently generated for release desktop browser profiles only, across all operating systems and geographies.

Model

The model generates predictions for each client by looking at just the recency and frequency of a client's daily usage within the previous 90 day window. Usage is defined by the daily level binary indicator of whether they show up in clients_daily on a given day.

The table contains columns related to these quantities:

  • submission_date: Day marking the end of the 90 day window. Earliest submission_date that the table covers is '2019-05-13'.
  • min_day: First day in the window that the client was seen. This could be anywhere between the first day in the window and the last day in the window.
  • max_day: Last day in the window the client was seen. The highest value this can be is submission_date.
  • recency: Age of client in days.
  • frequency: Number of days in the window that a client has returned to use the browser after min_day.
  • num_opportunities: Given a first appearance at min_day, what is the highest number of days a client could have returned. That is, what is the highest possible value for frequency.

Since the model is only using these 2 coarse-grained statistics, these columns should make it relatively straightforward to interpret why the model made the predictions that it did for a given profile.

Latent quantities

The model estimates the expected value for 2 related latent probability variables for a user. The values in prob_daily_leave give our expectation of the probability that they will become inactive on a given day, and prob_daily_usage represents the probability that a user will return on a given day, given that they are still active.

These quantities could be useful for disentangling usage rate from the likelihood that a user is still using the browser. We could, for example, identify intense users who are at risk of churning, or users who at first glance appear to have churned, but are actually just infrequent users.

prob_active is the expected value of the probability that a user is still active on submission_date, given their most recent 90 days' of activity. 'Inactive' in this sense means that the profile will not use the browser again, whether because they have uninstalled the browser or for some other reason.

Predictions

There are several columns of the form e_total_days_in_next_7_days, which give the expected number of times that a user will show up in the next 7 days (or 14, 21, 28 days). These predictions take into account both the likelihood that a user will become inactive in the future, as well as their daily propensity to use the browser, given that they are still active. The values in e_total_days_in_next_7_days will be between 0 and 7.

An estimate for the probability that a client will contribute to MAU is available in the column prob_mau. This is simply the probability that the user will return at any point in the following 28 days, thereby contributing to MAU. Since it is a probability, the values will range between 0 and 1, just like prob_daily_leave and prob_daily_usage.

Attributes

There are several columns that contain attributes of the client, like os, locale, normalized_channel, normalized_os_version, and country. sample_id is also included, which can be useful for quicker queries, as the table is clustered by this column in BigQuery.

Remarks on the model

A way to think about the model that infers these quantities is to imagine a simple process where each client is given 2 weighted coins when they become users, and that they flip each day. Since they're weighted, the probability of heads won't be 50%, but rather some probability between 0 and 100%, specific to each client's coin. One coin, called L, comes up heads with probability prob_daily_leave, and if it ever comes up heads, the client will never use the browser again. The daily usage coin, U, has heads prob_daily_usage% of the time. While they are still active, clients flip this coin to decide whether they will use the browser on that day, and show up in clients_daily.

The combination of these two coin flipping processes results in a history of activity that we can see in clients_daily. While the model is simple, it has very good predictive power that can tell, in aggregate, how many users will still be active at some point in the future. A downside of the model's simplicity, however, is that its predictions are not highly tailored to an individual client. The very simplified features do not take into account things like seasonality, or finer grained attributes of their usage (like active hours, addons, etc.). Further, the predictions in this table only account for existing users that have been seen in the 90 days of history, and so longer term forecasts of user activity would need to somehow model new users separately.

Caveats and future work

Due to the lightweight feature space of the model, the predictions perform better at the population level rather than the individual client level, and there will be a lot of client-level variation in behavior. That is, when grouping clients by different dimensions, say all of the en-IN users on Darwin, the average MAU prediction should be quite close, but a lot of users' behavior can deviate significantly from the predictions.

The model will also be better at medium- to longer- term forecasts. In particular, the model will not be well suited to give predictions for new users who have appeared only once in the data set very recently. These constitute a disproportionately large share of users, but do not have enough history for this model to make good use of. These single day profiles are currently the subject of an investigation that will hopefully yield good heuristics for users that only show up for a single day.

Sample query

Here is a sample query that will give averages for predicted MAU, probability that users are still active, and other quantities across different operating systems:

select
  ap.os
  , cast(sum(ap.prob_mau) AS int64) as predicted_mau
  , count(*) as n
  , round(avg(ap.prob_active) * 100, 1) as prob_active
  , round(avg(ap.prob_daily_leave) * 100, 1) as prob_daily_leave
  , round(avg(ap.prob_daily_usage) * 100, 1) as prob_daily_usage
  , round(avg(ap.e_total_days_in_next_28_days), 1) as e_total_days_in_next_28_days
from `telemetry.active_profiles` ap
where submission_date = '2019-08-01'
      and sample_id = 1
group by 1
having count(*) > 100
order by avg(ap.prob_daily_usage) desc

Scheduling

The code behind the model can be found in the bgbb_lib repo, or on PyPI. The airflow job is defined in the bgbb_airflow repo.

The model to fit the parameters is run weekly, and the table is updated daily.

Addons Datasets

Introduction

This is a work in progress. The work is being tracked here.

Data Reference

Example Queries

Sampling

It contains one or more records for every Main Summary record that contains a non-null value for client_id. Each Addons record contains info for a single addon, or if the main ping did not contain any active addons, there will be a row with nulls for all the addon fields (to identify client_ids/records without any addons).

Like the Main Summary dataset, No attempt is made to de-duplicate submissions by documentId, so any analysis that could be affected by duplicate records should take care to remove duplicates using the documentId field.

Scheduling

This dataset is updated daily via the telemetry-airflow infrastructure. The job DAG runs every day after the Main Summary data has been generated. The DAG is here.

Schema

As of 2017-03-16, the current version of the addons dataset is v2, and has a schema as follows:

root
 |-- document_id: string (nullable = true)
 |-- client_id: string (nullable = true)
 |-- subsession_start_date: string (nullable = true)
 |-- normalized_channel: string (nullable = true)
 |-- addon_id: string (nullable = true)
 |-- blocklisted: boolean (nullable = true)
 |-- name: string (nullable = true)
 |-- user_disabled: boolean (nullable = true)
 |-- app_disabled: boolean (nullable = true)
 |-- version: string (nullable = true)
 |-- scope: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- foreign_install: boolean (nullable = true)
 |-- has_binary_components: boolean (nullable = true)
 |-- install_day: integer (nullable = true)
 |-- update_day: integer (nullable = true)
 |-- signed_state: integer (nullable = true)
 |-- is_system: boolean (nullable = true)
 |-- submission_date_s3: string (nullable = true)
 |-- sample_id: string (nullable = true)

For more detail on where these fields come from in the raw data, please look in the AddonsView code.

The fields are all simple scalar values.

addons_daily Derived Dataset

Introduction:

The addons_daily dataset serves as the central hub for all Firefox extension related questions. This includes questions regarding browser performance, user engagement, click through rates, etc. Each row in the table represents a unique add-on, and each column is a unique metric.

Contents

Prior to construction of this dataset, extension related data lived in several different sources. Addons_daily has combined metrics aggregated from several sources, including raw pings, telemetry data, and google analytics data. Note that the data is a 1% sample of Release Firefox, so metrics like DAU, WAU, etc are approximate.

Accessing the Data

The data is stored as a parquet table in S3 s3://telemetry-parquet/addons_daily/v1/

The addons_daily table is accessible through re:dash using the Athena data source. It is also available via the Presto data source, though Athena should be preferred for performance and stability reasons.

Data Reference

Example Queries

Query 1

Select average daily, weekly, monthly active users, as well as the proportion of total daily active users per for all non system add-ons.

SELECT addon_id,
       arbitrary(name) as name,
       avg(dau) as "Average DAU",
       avg(wau) as "Average WAU",
       avg(mau) as "Average MAU",
       avg(dau_prop) as "Average % of Total DAU"
FROM addons_daily
WHERE
  is_system = false
  and addon_id not like '%mozilla%'
GROUP BY 1
ORDER BY 3 DESC

This query can be seen and ran in STMO here.

Query 2

Select average daily active users for the uBlock add-on for all dates.

SELECT DATE_PARSE(submission_date_s3, '%Y%m%d') as "Date",
       dau as "DAU"
FROM addons_daily
WHERE
    addon_id = 'uBlock0@raymondhill.net'

This query can be seen and ran in STMO here

Scheduling

This dataset is updated daily via the telemetry-airflow infrastructure. The job runs as part of the main_summary DAG.

Schema

The data is partitioned by submission_date_s3 which is formatted as %Y%m%d, like 20180130. As of 2019-06-05, the current version of the addons_daily dataset is v1, and has a schema as follows:

root
|-- addon_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- os_pct: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = false)
 |-- country_pct: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = false)
 |-- avg_time_total: double (nullable = true)
 |-- active_hours: double (nullable = true)
 |-- disabled: long (nullable = true)
 |-- avg_tabs: double (nullable = true)
 |-- avg_bookmarks: double (nullable = true)
 |-- avg_toolbox_opened_count: double (nullable = true)
 |-- avg_uri: double (nullable = true)
 |-- pct_w_tracking_prot_enabled: double (nullable = true)
 |-- mau: long (nullable = true)
 |-- wau: long (nullable = true)
 |-- dau: long (nullable = true)
 |-- dau_prop: double (nullable = true)
 |-- search_with_ads: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- ad_click: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- organic_searches: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- sap_searches: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- tagged_sap_searches: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- installs: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- download_times: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = false)
 |-- uninstalls: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- is_system: boolean (nullable = true)
 |-- avg_webext_storage_local_get_ms_: double (nullable = true)
 |-- avg_webext_storage_local_set_ms_: double (nullable = true)
 |-- avg_webext_extension_startup_ms_: double (nullable = true)
 |-- top_10_coinstalls: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- avg_webext_background_page_load_ms_: double (nullable = true)
 |-- avg_webext_browseraction_popup_open_ms_: double (nullable = true)
 |-- avg_webext_pageaction_popup_open_ms_: double (nullable = true)
 |-- avg_webext_content_script_injection_ms_: double (nullable = true)

Code Reference

All code can be found here. Refer to this repository for information on how to run or augment the dataset.

Clients Daily

Introduction

The clients_daily table is intended as the first stop for asking questions about how people use Firefox. It should be easy to answer simple questions. Each row in the table is a (client_id, submission_date) and contains a number of aggregates about that day's activity.

Contents

Many questions about Firefox take the form "What did clients with characteristics X, Y, and Z do during the period S to E?" The clients_daily table is aimed at answer those questions.

Accessing the Data

The data is stored as a parquet table in S3 at the following address.

s3://telemetry-parquet/clients_daily/v6/

The clients_daily table is accessible through re:dash using the Athena data source. It is also available via the Presto data source, though Athena should be preferred for performance and stability reasons.

Here's an example query.

Data Reference

Example Queries

Compute Churn for a one-day cohort:

SELECT
  date_parse(submission_date_s3, '%Y%m%d') AS submission_date_s3,
  approx_distinct(client_id) AS cohort_dau
FROM clients_daily
WHERE
  submission_date_s3 > '20170831'
  AND submission_date_s3 < '20171001'
  AND profile_creation_date LIKE '2017-09-01%'
GROUP BY 1
ORDER BY 1

Distribution of pings per client per day:

SELECT
  normalized_channel,
  CASE
    WHEN pings_aggregated_by_this_row > 50 THEN 50
    ELSE pings_aggregated_by_this_row
  END AS pings_per_day,
  approx_distinct(client_id) AS client_count
FROM clients_daily
WHERE
  submission_date_s3 = '20170901'
  AND normalized_channel <> 'Other'
GROUP BY
  1,
  2
ORDER BY
  2,
  1

Scheduling

This dataset is updated daily via the telemetry-airflow infrastructure. The job runs as part of the main_summary DAG.

Schema

The data is partitioned by submission_date_s3 which is formatted as %Y%m%d, like 20180130.

As of 2018-11-01, the current version of the clients_daily dataset is v6, and has a schema as follows:

root
 |-- client_id: string (nullable = true)
 |-- aborts_content_sum: long (nullable = true)
 |-- aborts_gmplugin_sum: long (nullable = true)
 |-- aborts_plugin_sum: long (nullable = true)
 |-- active_addons: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- addon_id: string (nullable = true)
 |    |    |-- blocklisted: boolean (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- user_disabled: boolean (nullable = true)
 |    |    |-- app_disabled: boolean (nullable = true)
 |    |    |-- version: string (nullable = true)
 |    |    |-- scope: integer (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- foreign_install: boolean (nullable = true)
 |    |    |-- has_binary_components: boolean (nullable = true)
 |    |    |-- install_day: integer (nullable = true)
 |    |    |-- update_day: integer (nullable = true)
 |    |    |-- signed_state: integer (nullable = true)
 |    |    |-- is_system: boolean (nullable = true)
 |    |    |-- is_web_extension: boolean (nullable = true)
 |    |    |-- multiprocess_compatible: boolean (nullable = true)
 |-- active_addons_count_mean: double (nullable = true)
 |-- active_hours_sum: double (nullable = true)
 |-- addon_compatibility_check_enabled: boolean (nullable = true)
 |-- app_build_id: string (nullable = true)
 |-- app_display_version: string (nullable = true)
 |-- app_name: string (nullable = true)
 |-- app_version: string (nullable = true)
 |-- attribution: struct (nullable = true)
 |    |-- source: string (nullable = true)
 |    |-- medium: string (nullable = true)
 |    |-- campaign: string (nullable = true)
 |    |-- content: string (nullable = true)
 |-- blocklist_enabled: boolean (nullable = true)
 |-- channel: string (nullable = true)
 |-- city: string (nullable = true)
 |-- client_clock_skew_mean: double (nullable = true)
 |-- client_submission_latency_mean: double (nullable = true)
 |-- country: string (nullable = true)
 |-- cpu_cores: integer (nullable = true)
 |-- cpu_count: integer (nullable = true)
 |-- cpu_family: integer (nullable = true)
 |-- cpu_l2_cache_kb: integer (nullable = true)
 |-- cpu_l3_cache_kb: integer (nullable = true)
 |-- cpu_model: integer (nullable = true)
 |-- cpu_speed_mhz: integer (nullable = true)
 |-- cpu_stepping: integer (nullable = true)
 |-- cpu_vendor: string (nullable = true)
 |-- crashes_detected_content_sum: long (nullable = true)
 |-- crashes_detected_gmplugin_sum: long (nullable = true)
 |-- crashes_detected_plugin_sum: long (nullable = true)
 |-- crash_submit_attempt_content_sum: long (nullable = true)
 |-- crash_submit_attempt_main_sum: long (nullable = true)
 |-- crash_submit_attempt_plugin_sum: long (nullable = true)
 |-- crash_submit_success_content_sum: long (nullable = true)
 |-- crash_submit_success_main_sum: long (nullable = true)
 |-- crash_submit_success_plugin_sum: long (nullable = true)
 |-- default_search_engine: string (nullable = true)
 |-- default_search_engine_data_load_path: string (nullable = true)
 |-- default_search_engine_data_name: string (nullable = true)
 |-- default_search_engine_data_origin: string (nullable = true)
 |-- default_search_engine_data_submission_url: string (nullable = true)
 |-- devtools_toolbox_opened_count_sum: long (nullable = true)
 |-- distribution_id: string (nullable = true)
 |-- e10s_enabled: boolean (nullable = true)
 |-- env_build_arch: string (nullable = true)
 |-- env_build_id: string (nullable = true)
 |-- env_build_version: string (nullable = true)
 |-- experiments: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- first_paint_mean: double (nullable = true)
 |-- flash_version: string (nullable = true)
 |-- geo_subdivision1: string (nullable = true)
 |-- geo_subdivision2: string (nullable = true)
 |-- gfx_features_advanced_layers_status: string (nullable = true)
 |-- gfx_features_d2d_status: string (nullable = true)
 |-- gfx_features_d3d11_status: string (nullable = true)
 |-- gfx_features_gpu_process_status: string (nullable = true)
 |-- histogram_parent_devtools_aboutdebugging_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_animationinspector_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_browserconsole_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_canvasdebugger_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_computedview_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_custom_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_developertoolbar_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_dom_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_eyedropper_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_fontinspector_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_inspector_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_jsbrowserdebugger_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_jsdebugger_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_jsprofiler_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_layoutview_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_memory_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_menu_eyedropper_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_netmonitor_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_options_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_paintflashing_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_picker_eyedropper_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_responsive_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_ruleview_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_scratchpad_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_scratchpad_window_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_shadereditor_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_storage_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_styleeditor_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_webaudioeditor_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_webconsole_opened_count_sum: long (nullable = true)
 |-- histogram_parent_devtools_webide_opened_count_sum: long (nullable = true)
 |-- install_year: long (nullable = true)
 |-- is_default_browser: boolean (nullable = true)
 |-- is_wow64: boolean (nullable = true)
 |-- locale: string (nullable = true)
 |-- memory_mb: integer (nullable = true)
 |-- normalized_channel: string (nullable = true)
 |-- normalized_os_version: string (nullable = true)
 |-- os: string (nullable = true)
 |-- os_service_pack_major: long (nullable = true)
 |-- os_service_pack_minor: long (nullable = true)
 |-- os_version: string (nullable = true)
 |-- pings_aggregated_by_this_row: long (nullable = true)
 |-- places_bookmarks_count_mean: double (nullable = true)
 |-- places_pages_count_mean: double (nullable = true)
 |-- plugin_hangs_sum: long (nullable = true)
 |-- plugins_infobar_allow_sum: long (nullable = true)
 |-- plugins_infobar_block_sum: long (nullable = true)
 |-- plugins_infobar_shown_sum: long (nullable = true)
 |-- plugins_notification_shown_sum: long (nullable = true)
 |-- previous_build_id: string (nullable = true)
 |-- profile_age_in_days: integer (nullable = true)
 |-- profile_creation_date: string (nullable = true)
 |-- push_api_notify_sum: long (nullable = true)
 |-- sample_id: string (nullable = true)
 |-- sandbox_effective_content_process_level: integer (nullable = true)
 |-- scalar_combined_webrtc_nicer_stun_retransmits_sum: long (nullable = true)
 |-- scalar_combined_webrtc_nicer_turn_401s_sum: long (nullable = true)
 |-- scalar_combined_webrtc_nicer_turn_403s_sum: long (nullable = true)
 |-- scalar_combined_webrtc_nicer_turn_438s_sum: long (nullable = true)
 |-- scalar_content_navigator_storage_estimate_count_sum: long (nullable = true)
 |-- scalar_content_navigator_storage_persist_count_sum: long (nullable = true)
 |-- scalar_parent_aushelper_websense_reg_version: string (nullable = true)
 |-- scalar_parent_browser_engagement_max_concurrent_tab_count_max: integer (nullable = true)
 |-- scalar_parent_browser_engagement_max_concurrent_window_count_max: integer (nullable = true)
 |-- scalar_parent_browser_engagement_tab_open_event_count_sum: long (nullable = true)
 |-- scalar_parent_browser_engagement_total_uri_count_sum: long (nullable = true)
 |-- scalar_parent_browser_engagement_unfiltered_uri_count_sum: long (nullable = true)
 |-- scalar_parent_browser_engagement_unique_domains_count_max: integer (nullable = true)
 |-- scalar_parent_browser_engagement_unique_domains_count_mean: double (nullable = true)
 |-- scalar_parent_browser_engagement_window_open_event_count_sum: long (nullable = true)
 |-- scalar_parent_devtools_accessibility_node_inspected_count_sum: long (nullable = true)
 |-- scalar_parent_devtools_accessibility_opened_count_sum: long (nullable = true)
 |-- scalar_parent_devtools_accessibility_picker_used_count_sum: long (nullable = true)
 |-- scalar_parent_devtools_accessibility_select_accessible_for_node_sum: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)
 |-- scalar_parent_devtools_accessibility_service_enabled_count_sum: long (nullable = true)
 |-- scalar_parent_devtools_copy_full_css_selector_opened_sum: long (nullable = true)
 |-- scalar_parent_devtools_copy_unique_css_selector_opened_sum: long (nullable = true)
 |-- scalar_parent_devtools_toolbar_eyedropper_opened_sum: long (nullable = true)
 |-- scalar_parent_navigator_storage_estimate_count_sum: long (nullable = true)
 |-- scalar_parent_navigator_storage_persist_count_sum: long (nullable = true)
 |-- scalar_parent_storage_sync_api_usage_extensions_using_sum: long (nullable = true)
 |-- search_cohort: string (nullable = true)
 |-- search_count_all: long (nullable = true)
 |-- search_count_abouthome: long (nullable = true)
 |-- search_count_contextmenu: long (nullable = true)
 |-- search_count_newtab: long (nullable = true)
 |-- search_count_searchbar: long (nullable = true)
 |-- search_count_system: long (nullable = true)
 |-- search_count_urlbar: long (nullable = true)
 |-- session_restored_mean: double (nullable = true)
 |-- sessions_started_on_this_day: long (nullable = true)
 |-- shutdown_kill_sum: long (nullable = true)
 |-- subsession_hours_sum: decimal(37,6) (nullable = true)
 |-- ssl_handshake_result_failure_sum: long (nullable = true)
 |-- ssl_handshake_result_success_sum: long (nullable = true)
 |-- sync_configured: boolean (nullable = true)
 |-- sync_count_desktop_sum: long (nullable = true)
 |-- sync_count_mobile_sum: long (nullable = true)
 |-- telemetry_enabled: boolean (nullable = true)
 |-- timezone_offset: integer (nullable = true)
 |-- update_auto_download: boolean (nullable = true)
 |-- update_channel: string (nullable = true)
 |-- update_enabled: boolean (nullable = true)
 |-- vendor: string (nullable = true)
 |-- web_notification_shown_sum: long (nullable = true)
 |-- windows_build_number: long (nullable = true)
 |-- windows_ubr: long (nullable = true)
 |-- submission_date_s3: string (nullable = true)

Code Reference

This dataset is generated by telemetry-batch-view. Refer to this repository for information on how to run or augment the dataset.

Clients Last Seen Reference

Introduction

The clients_last_seen dataset is useful for efficiently determining exact user counts such as DAU and MAU.

It does not use approximates, unlike the HyperLogLog algorithm used in the client_count_daily dataset, and it includes the most recent values in a 28 day window for all columns in the clients_daily dataset.

This dataset should be used instead of client_count_daily.

Content

For each submission_date this dataset contains one row per client_id that appeared in clients_daily in a 28 day window including submission_date and preceding days.

The days_since_seen column indicates the difference between submission_date and the most recent submission_date in clients_daily where the client_id appeared. A client observed on the given submission_date will have days_since_seen = 0.

Other days_since_ columns use the most recent date in clients_daily where a certain condition was met. If the condition was not met for a client_id in a 28 day window NULL is used. For example days_since_visited_5_uri uses the condition scalar_parent_browser_engagement_total_uri_count_sum >= 5. These columns can be used for user counts where a condition must be met on any day in a window instead of using the most recent values for each client_id.

The rest of the columns use the most recent value in clients_daily where the client_id appeared.

Background and Caveats

User counts generated using days_since_seen only reflect the most recent values from clients_daily for each client_id in a 28 day window. This means Active MAU as defined cannot be efficiently calculated using days_since_seen because if a given client_id appeared every day in February and only on February 1st had scalar_parent_browser_engagement_total_uri_count_sum >= 5 then it would only be counted on the 1st, and not the 2nd-28th. Active MAU can be efficiently and correctly calculated using days_since_visited_5_uri.

MAU can be calculated over a GROUP BY submission_date[, ...] clause using COUNT(*), because there is exactly one row in the dataset for each client_id in the 28 day MAU window for each submission_date.

User counts generated using days_since_seen can use SUM to reduce groups, because a given client_id will only be in one group per submission_date. So if MAU were calculated by country and channel, then the sum of the MAU for each country would be the same as if MAU were calculated only by channel.

Accessing the Data

The data is available in Re:dash and BigQuery. Take a look at this full running example query in Re:dash.

Data Reference

Example Queries

Compute DAU for non-windows clients for the last week

SELECT
    submission_date,
    os,
    COUNT(*) AS count
FROM
    clients_last_seen
WHERE
    submission_date >= DATE_SUB(CURRENT_DATE, INTERVAL 1 WEEK)
    AND days_since_seen = 0
GROUP BY
    submission_date,
    os
HAVING
    count > 10 -- remove outliers
    AND lower(os) NOT LIKE '%windows%'
ORDER BY
    os,
    submission_date DESC

Compute WAU by Channel for the last week

SELECT
    submission_date,
    normalized_channel,
    COUNT(*) AS count
FROM
    clients_last_seen
WHERE
    submission_date >= DATE_SUB(CURRENT_DATE, INTERVAL 1 WEEK)
    AND days_since_seen < 7
GROUP BY
    submission_date,
    normalized_channel
HAVING
    count > 10 -- remove outliers
ORDER BY
    normalized_channel,
    submission_date DESC

Scheduling

This dataset is updated daily via the telemetry-airflow infrastructure. The job runs as part of the main_summary DAG.

Schema

The data is partitioned by submission_date.

As of 2019-03-25, the current version of the clients_last_seen dataset is v1, and the schema is visible in the BigQuery console here.

Crash Summary Reference

Introduction

The crash_summary table is the most direct representation of a crash ping.

Contents

The crash_summary table contains one row for each crash ping. Each column represents one field from the crash ping payload, though only a subset of all crash ping fields are included.

Accessing the Data

The data is stored as a parquet table in S3 at the following address.

s3://telemetry-parquet/crash_summary/v1/

crash_summary is accessible through re:dash. Here's an example query.

Further Reading

The technical documentation for crash_summary is located in the telemetry-batch-view documentation.

The code responsible for generating this dataset is here

Data Reference

Example Queries

Here is an example query to get the total number of main crashes by gfx_compositor:

select gfx_compositor, count(*)
from crash_summary
where application = 'Firefox'
and (payload.processType IS NULL OR payload.processType = 'main')
group by gfx_compositor

Sampling

CrashSummary contains one record for every crash ping submitted by Firefox.

Scheduling

This dataset is updated daily, shortly after midnight UTC. The job is scheduled on telemetry-airflow. The DAG is here.

Schema

root
 |-- client_id: string (nullable = true)
 |-- normalized_channel: string (nullable = true)
 |-- build_version: string (nullable = true)
 |-- build_id: string (nullable = true)
 |-- channel: string (nullable = true)
 |-- crash_time: string (nullable = true)
 |-- application: string (nullable = true)
 |-- os_name: string (nullable = true)
 |-- os_version: string (nullable = true)
 |-- architecture: string (nullable = true)
 |-- country: string (nullable = true)
 |-- experiment_id: string (nullable = true)
 |-- experiment_branch: string (nullable = true)
 |-- experiments: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- e10s_enabled: boolean (nullable = true)
 |-- gfx_compositor: string (nullable = true)
 |-- profile_created: integer (nullable = true)
 |-- payload: struct (nullable = true)
 |    |-- crashDate: string (nullable = true)
 |    |-- processType: string (nullable = true)
 |    |-- hasCrashEnvironment: boolean (nullable = true)
 |    |-- metadata: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- version: integer (nullable = true)
 |-- submission_date: string (nullable = true)

For more detail on where these fields come from in the raw data, please look at the case classes in the CrashSummaryView code.

Cross Sectional Reference

This data set has been deprecated in favor of Clients Daily

Error Aggregates Reference

Introduction

The error_aggregates_v2 table represents counts of errors counted from main and crash pings, aggregated every 5 minutes. It is the dataset backing the main mission control view, but may also be queried independently.

Contents

The error_aggregates_v2 table contains counts of various error measures (for example: crashes, "the slow script dialog showing"), aggregated across each unique set of dimensions (for example: channel, operating system) every 5 minutes. You can get an aggregated count for any particular set of dimensions by summing using SQL.

Experiment unpacking

It's important to note that when this dataset is written, pings from clients participating in an experiment are aggregated on the experiment_id and experiment_branch dimensions corresponding to what experiment and branch they are participating in. However, they are also aggregated with the rest of the population where the values of these dimensions are null. Therefore care must be taken when writing aggregating queries over the whole population - in these cases one needs to filter for experiment_id is null and experiment_branch is null in order to not double-count pings from experiments.

Accessing the data

You can access the data via re:dash. Choose Athena and then select the telemetry.error_aggregates_v2 table.

Further Reading

The code responsible for generating this dataset is here.

Data Reference

Example Queries

Getting a large number of different crash measures across many platforms and channels (view on Re:dash):

SELECT window_start,
       build_id,
       channel,
       os_name,
       version,
       sum(usage_hours) AS usage_hours,
       sum(main_crashes) AS main,
       sum(content_crashes) AS content,
       sum(gpu_crashes) AS gpu,
       sum(plugin_crashes) AS plugin,
       sum(gmplugin_crashes) AS gmplugin
FROM error_aggregates_v2
  WHERE application = 'Firefox'
  AND (os_name = 'Darwin' or os_name = 'Linux' or os_name = 'Windows_NT')
  AND (channel = 'beta' or channel = 'release' or channel = 'nightly' or channel = 'esr')
  AND build_id > '201801'
  AND window_start > current_timestamp - (1 * interval '24' hour)
  AND experiment_id IS NULL
  AND experiment_branch IS NULL
GROUP BY window_start, channel, build_id, version, os_name

Get the number of main_crashes on Windows over a small interval (view on Re:dash):

SELECT window_start as time, sum(main_crashes) AS main_crashes
FROM error_aggregates_v2
  WHERE application = 'Firefox'
  AND os_name = 'Windows_NT'
  AND channel = 'release'
  AND version = '58.0.2'
  AND window_start > timestamp '2018-02-21'
  AND window_end < timestamp '2018-02-22'
  AND experiment_id IS NULL
  AND experiment_branch IS NULL
GROUP BY window_start

Sampling

Data sources

The aggregates in this data source are derived from main, crash and core pings:

  • crash pings are used to count/gather main and content crash events, all other errors from desktop clients (including all other crashes) are gathered from main pings
  • core pings are used to count usage hours, first subsession and unique client counts.

Scheduling

The error_aggregates job is run continuously, using the Spark Streaming infrastructure

Schema

The error_aggregates_v2 table has the following columns which define its dimensions:

  • window_start: beginning of interval when this sample was taken
  • window_end: end of interval when this sample was taken (will always be 5 minutes more than window_start for any given row)
  • submission_date_s3: the date pings were submitted for a particular aggregate
  • channel: the channel, like release or beta
  • version: the version e.g. 57.0.1
  • display_version: like version, but includes beta number if applicable e.g. 57.0.1b4
  • build_id: the YYYYMMDDhhmmss timestamp the program was built, like 20160123180541. This is also known as the build ID or buildid
  • application: application name (e.g. Firefox or Fennec)
  • os_name: name of the OS (e.g. Darwin or Windows_NT)
  • os_version: version of the OS
  • architecture: build architecture, e.g. x86
  • country: country code for the user (determined using geoIP), like US or UK
  • experiment_id: identifier of the experiment being participated in, such as e10s-beta46-noapz@experiments.mozilla.org, null if no experiment or for unpacked rows (see Experiment unpacking)
  • experiment_branch: the branch of the experiment being participated in, such as control or experiment, null if no experiment or for unpacked rows (see Experiment unpacking)

And these are the various measures we are counting:

  • usage_hours: number of usage hours (i.e. total number of session hours reported by the pings in this aggregate, note that this might include time where people are not actively using the browser or their computer is asleep)
  • count: number of pings processed in this aggregate
  • main_crashes: number of main process crashes (or just program crashes, in the non-e10s case)
  • startup_crashes : number of startup crashes
  • content_crashes: number of content process crashes (version => 58 only)
  • gpu_crashes: number of GPU process crashes
  • plugin_crashes: number of plugin process crashes
  • gmplugin_crashes: number of Gecko media plugin (often abbreviated GMPlugin) process crashes
  • content_shutdown_crashes: number of content process crashes that were caused by failure to shut down in a timely manner (version => 58 only)
  • browser_shim_usage_blocked: number of times a CPOW shim was blocked from being created by browser code
  • permissions_sql_corrupted: number of times the permissions SQL error occurred (beta/nightly only)
  • defective_permissions_sql_removed: number of times there was a removal of defective permissions.sqlite (beta/nightly only)
  • slow_script_notice_count: number of times the slow script notice count was shown (beta/nightly only)
  • slow_script_page_count: number of pages that trigger slow script notices (beta/nightly only)

Introduction

This is a work in progress. The work is being tracked here.

Data Reference

Example Queries

This is a work in progress. The work is being tracked here.

Sampling

The events dataset contains one row for each event in a main ping. This dataset is derived from main_summary so any of main_summary's filters affect this dataset as well.

Data is currently available from 2017-01-05 on.

Scheduling

The events dataset is updated daily, shortly after main_summary is updated. The job is scheduled on Airflow. The DAG is here.

Firefox events

Firefox has an API to record events, which are then submitted through the main ping. The format and mechanism of event collection in Firefox is documented here.

The full events data pipeline is documented here.

Schema

As of 2017-01-26, the current version of the events dataset is v1, and has a schema as follows:

root
 |-- document_id: string (nullable = true)
 |-- client_id: string (nullable = true)
 |-- normalized_channel: string (nullable = true)
 |-- country: string (nullable = true)
 |-- locale: string (nullable = true)
 |-- app_name: string (nullable = true)
 |-- app_version: string (nullable = true)
 |-- os: string (nullable = true)
 |-- os_version: string (nullable = true)
 |-- subsession_start_date: string (nullable = true)
 |-- subsession_length: long (nullable = true)
 |-- sync_configured: boolean (nullable = true)
 |-- sync_count_desktop: integer (nullable = true)
 |-- sync_count_mobile: integer (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- sample_id: string (nullable = true)
 |-- event_timestamp: long (nullable = false)
 |-- event_category: string (nullable = false)
 |-- event_method: string (nullable = false)
 |-- event_object: string (nullable = false)
 |-- event_string_value: string (nullable = true)
 |-- event_map_values: map (nullable = true)
 |    |-- key: string
 |    |-- value: string
 |-- submission_date_s3: string (nullable = true)
 |-- doc_type: string (nullable = true)

Exact MAU Data

Introduction

This article introduces the usage of and methodology behind the "exact MAU" tables in BigQuery:

  • firefox_desktop_exact_mau28_by_dimensions_v1,
  • firefox_nondesktop_exact_mau28_by_dimensions_v1, and
  • firefox_accounts_exact_mau28_by_dimensions_v1.

The calculation of MAU (monthly active users) has historically been fraught with troubling details around exact definitions and computational limitations, leading to disagreements between analyses. These tables contain pre-computed MAU, WAU, and DAU aggregates for various usage criteria and dimensions, allowing efficient calculation of aggregates across arbitrary slices of those dimensions. The tables follow a consistent methodology which is intended as a standard across Mozilla for MAU analysis going forward.

Table of Contents

Conceptual Model

Metric

A metric is anything want to (and can) measure. In order for a metric to be calculated, a usage criterion and a slice must be specified. The metric will produce a single value per day, summarizing data:

  • for one day or more days (i.e. the metric value for a particular day may depend on data from other days as well)
  • for all users (whatever notion of user makes sense for the data, generally profiles) in a particular sub-population
  • where the sub-population will include users that meet the specified usage criteria and are in the specified slice.

A simple usage criterion is "All Desktop Activity", which includes all Firefox Desktop users that we have any data (telemetry ping) for on the day in question. The simplest slice is "All" which places no restrictions on the sub-population.

For example, the metric "Daily Active Users (DAU)" with usage criteria "All Desktop Activity" and slice "All" involves summarizing data on all users of Firefox Desktop over a single day.

Usage Criteria

Active user counts must always be calculated in reference to some specific usage criterion, a binary condition we use to determine whether a given user should be considered "active" in a given product or feature. It may be something simple like "All Desktop Activity" (as above) or, similarly, "All Mobile Activity". It may also be something more specific like "Desktop Visited 5 URI" corresponding to calculation of aDAU.

Distinct usage criteria correspond to distinct *_mau columns in the Exact MAU tables.

Slice

A slice defines the sub-population on which we can calculate a metric and is specified by setting restrictions in different dimensions. Examples of dimensions include: "Country", "Attribution Source", and "Firefox Version". Thus an example slice may be "Country = US; Firefox Version = 60|61", which restricts to profiles that report usage in the US on Firefox versions 60 or 61. There is implicitly no restriction on any other dimensions. Thus, the empty slice - "All" - is also a valid slice and simply places no restrictions on any dimension. Note there are some complexities here:

  • Firstly, a dimension may be scalar and need to be suitably bucketed (instead of every possible profile age being a unique slice element, maybe we prefer to group users between 12 and 16 months old into a single slice element); likewise we may need to use normalized versions of string fields
  • Secondly, we require that dimensions be non-overlapping, especially for metrics calculated over multiple days of user activity. In a given day, a profile may be active in multiple countries, but we aggregate that to a single value by taking the most frequent value seen in that day, breaking ties by taking the value that occurs last. In a given month, the assigned country may change from day to day; we use the value from the most recent active day up to the day we're calculating usage for.

A slice is expressed as a set of conditions in WHERE or GROUP BY clauses when querying Exact MAU tables.

Using the Tables

The various Exact MAU datasets are computed daily from the *_last_seen tables (see clients_last_seen) and contain pre-computed DAU, WAU, and MAU counts per usage criterion per each unique combination of dimensions values. Because of our restriction that dimension values be non-overlapping, we can recover MAU for a particular slice of the data by summing over all rows matching the slice definition.

The simple case of retrieving MAU for usage criterion "All Desktop Activity" and slice "All" looks like:

SELECT
    submission_date,
    SUM(mau) AS mau
FROM
    `moz-fx-data-derived-datasets.telemetry.firefox_desktop_exact_mau28_by_dimensions_v1`
GROUP BY
    submission_date
ORDER BY
    submission_date

Now, let's refine our slice to "Country = US; Campaign = whatsnew" via a WHERE clause:

SELECT
    submission_date,
    SUM(mau) AS mau
FROM
    `moz-fx-data-derived-datasets.telemetry.firefox_desktop_exact_mau28_by_dimensions_v1`
WHERE
    country = 'US'
    AND campaign = 'whatsnew'
GROUP BY
    submission_date
ORDER BY
    submission_date

Perhaps we want to compare MAU as above to aDAU over the same slice. The column visited_5_uri_dau gives DAU as calculated with the "Desktop Visited 5 URI" usage criterion, corresponding to aDAU:

SELECT
    submission_date,
    SUM(mau) AS mau,
    SUM(visited_5_uri_dau) AS adau
FROM
    `moz-fx-data-derived-datasets.telemetry.firefox_desktop_exact_mau28_by_dimensions_v1`
WHERE
    country = 'US'
    AND campaign = 'whatsnew'
GROUP BY
    submission_date
ORDER BY
    submission_date

Additional usage criteria may be added in the future as new columns named *_*mau, etc. where the prefix describes the usage criterion.

For convenience and clarity, we make the exact data presented in the 2019 Key Performance Indicator Dashboard available as views that do not require any aggregation:

  • firefox_desktop_exact_mau28_v1,
  • firefox_nondesktop_exact_mau28_v1, and
  • firefox_accounts_exact_mau28_v1.

An example query for desktop:

SELECT
    submission_date,
    mau,
    tier1_mau
FROM
    `moz-fx-data-derived-datasets.telemetry.firefox_desktop_exact_mau28_v1`

These views contain no dimensions and abstract away the detail that FxA data uses the "Last Seen in Tier 1 Country" usage criterion while desktop and non-desktop data use the "Country" dimension to determine tier 1 membership.

Additional Details

Inclusive Tier 1 Calculation for FxA

The 2019 Key Performance Indicator definition for Relationships relies on a MAU calculation restricted to a specific set of "Tier 1" countries. In the Exact MAU datasets, country is a dimension that would normally be specified in a slice definition. Indeed, for desktop and non-desktop clients, the definition of "Tier 1 MAU" looks like:

SELECT
    submission_date,
    SUM(mau) AS mau
FROM
    `moz-fx-data-derived-datasets.telemetry.firefox_desktop_exact_mau28_by_dimensions_v1`
WHERE
    country IN ('US', 'UK', 'DE', 'FR', 'CA')
GROUP BY
    submission_date
ORDER BY
    submission_date

Remember that our non-overlapping dimensions methodology means that the filter in the query above considers only the country value from the most recent daily aggregation, so a user that appeared in one of the specified countries early in the month but then changed location to a non-tier 1 country would not count toward MAU.

Due to the methodology used when forecasting goal values for the year, however, we need to follow a more inclusive definition for "Tier 1 FxA MAU" where a user counts if they register even a single FxA event originating from a tier 1 country in the 28 day MAU window. That calculation requires a separate "FxA Seen in Tier 1 Country" criterion and is represented in the exact MAU table as seen_in_tier1_country_mau:

SELECT
    submission_date,
    SUM(seen_in_tier1_country_mau) AS tier1_mau
FROM
    `moz-fx-data-derived-datasets.telemetry.firefox_accounts_exact_mau28_by_dimensions_v1`
GROUP BY
    submission_date
ORDER BY
    submission_date

Confidence Intervals

The Exact MAU tables enable tracking of MAU for potentially very small subpopulations of users where statistical variation can often overwhelm real trends in the data. In order to support statistical inference (confidence intervals and hypothesis tests), these tables include a "pseudo-dimension" we call id_bucket. We assign each client (or user, in the case of FxA data) to one of 20 buckets based on a hash of their client_id (or user_id), with the effect that each user is randomly assigned to one and only one bucket. If we sum MAU numbers for each bucket individually, we can use resampling techniques to determine the magnitude of variation and assign a confidence interval to our sums.

As an example of calculating confidence intervals, see the Desktop MAU KPI query in STMO which uses a jackknife resampling technique implemented as a BigQuery UDF.

Data Reference

Scheduling

These tables are updated daily via the parquet to BigQuery infrastructure in the following DAGs:

Schema

The data is partitioned by submission_date.

As of 2019-03-29, the current version for all Exact MAU tables is v1, and the schemas are visible via the telemetry dataset in the BigQuery console.

First Shutdown Summary

Introduction

The first_shutdown_summary table is a summary of the first-shutdown ping.

Contents

The first shutdown ping contains first session usage data. The dataset has rows similar to the telemetry_new_profile_parquet, but in the shape of main_summary.

Background and Caveats

Ping latency was reduced through the shutdown ping-sender mechanism in Firefox 55. To maintain consistent historical behavior, the first main ping is not sent until the second start up. In Firefox 57, a separate first-shutdown ping was created to evaluate first-shutdown behavior while maintaining backwards compatibility.

In many cases, the first-shutdown ping is a duplicate of the main ping. The first-shutdown summary can be used in conjunction with the main summary by taking the union and deduplicating on the document_id.

Accessing the Data

The data can be accessed as first_shutdown_summary. It is currently stored in the following path.

s3://telemetry-parquet/first_shutdown_summary/v4/

The data is backfilled to 2017-09-22, the date of its first nightly appearance. This data should be available to all releases on and after Firefox 57.

Code Reference

This dataset is generated by telemetry-batch-view.

Longitudinal Reference

Introduction

The longitudinal dataset is a 1% sample of main ping data organized so that each row corresponds to a client_id. If you're not sure which dataset to use for your analysis, this is probably what you want.

Contents

Each row in the longitudinal dataset represents one client_id, which is approximately a user. Each column represents a field from the main ping. Most fields contain arrays of values, with one value for each ping associated with a client_id. Using arrays give you access to the raw data from each ping, but can be difficult to work with from SQL. Here's a query showing some sample data to help illustrate.

Background and Caveats

Think of the longitudinal table as wide and short. The dataset contains more columns than main_summary and down-samples to 1% of all clients to reduce query computation time and save resources.

In summary, the longitudinal table differs from main_summary in two important ways:

  • The longitudinal dataset groups all data so that one row represents a client_id
  • The longitudinal dataset samples to 1% of all client_ids

Please note that this dataset only contains release (or opt-out) histograms and scalars.

Accessing the Data

The longitudinal is available in re:dash, though it can be difficult to work with the array values in SQL. Take a look at this example query.

The data is stored as a parquet table in S3 at the following address.

s3://telemetry-parquet/longitudinal/

Data Reference

Sampling

Pings Within Last 6 Months

The longitudinal filters to main pings from within the last 6 months.

1% Sample

The longitudinal dataset samples down to 1% of all clients in the above sample. The sample is generated by the following process:

  • hash the client_id for each ping from the last 6 months.
  • project that hash onto an integer from 1:100, inclusive
  • filter to pings with client_ids matching a 'magic number' (in this case 42)

This process has a couple of nice properties:

  • The sample is consistent over time. The longitudinal dataset is regenerated weekly. The clients included in each run are very similar with this process. The only change will come from never-before-seen clients, or clients without a ping in the last 180 days.
  • We don't need to adjust the sample as new clients enter or exit our pool.

More practically, the sample is created by filtering to pings with main_summary.sample_id == 42. If you're working with main_summary, you can recreate this sample by doing this filter manually.

Scheduling

The longitudinal job is run weekly, early on Sunday morning UTC. The job is scheduled on Airflow. The DAG is here.

Schema

TODO(harter): https://bugzilla.mozilla.org/show_bug.cgi?id=1361862

Code Reference

This dataset is generated by telemetry-batch-view. Refer to this repository for information on how to run or augment the dataset.

Main Summary

Introduction

The main_summary table is the most direct representation of a main ping but can be difficult to work with due to its size. Prefer the clients_daily dataset unless it doesn't aggregate the measurements you're interested in.

Contents

The main_summary table contains one row for each ping. Each column represents one field from the main ping payload, though only a subset of all main ping fields are included. This dataset does not include most histograms.

Background and Caveats

This table is massive, and due to its size, it can be difficult to work with. You should avoid querying main_summary from re:dash. Your queries will be slow to complete and can impact performance for other users, since re:dash on a shared cluster.

Instead, we recommend using the longitudinal or clients_daily dataset where possible. If these datasets do not suffice, consider using Spark on Databricks. In the odd case where these queries are necessary, make use of the sample_id field and limit to a short submission date range.

Accessing the Data

The data is stored as a parquet table in S3 at the following address.

s3://telemetry-parquet/main_summary/v4/

Though not recommended main_summary is accessible through re:dash. Here's an example query. Your queries will be slow to complete and can impact performance for other users, since re:dash is on a shared cluster.

Further Reading

The technical documentation for main_summary is located in the telemetry-batch-view documentation.

The code responsible for generating this dataset is here

Adding New Fields

We support a few basic types that can be easily added to main_summary.

Non-addon scalars are automatically added to main_summary.

User Preferences

These are added in the userPrefsList, near the top of the Main Summary file. They must be available in the ping environment to be included here. There is more information in the file itself.

Once added, they will show as top-level fields, with the string user_pref prepended. For example, IntegerUserPref("dom.ipc.processCount") becomes user_pref_dom_ipc_processcount.

Histograms

Histograms can simply be added to the histogramsWhitelist near the top of Main Summary file. Simply add the name of the histogram in the alphabetically-sorted position in the list.

Each process a histogram is recorded in will have a column in main_summary, with the string histogram_ prepended. For example, CYCLE_COLLECTOR_MAX_PAUSE is recorded in the parent, content, and gpu processes (according to the definition). It will then result in three columns:

  • histogram_parent_cycle_collector_max_pause
  • histogram_content_cycle_collector_max_pause
  • histogram_gpu_cycle_collector_max_pause

Addon Scalars

Addon scalars are recorded by an addon. To include one of these, add the definition to the addon scalars definition file in telemetry-batch-view. Be sure to include the section:

    record_in_processes:
      - 'dynamic'

The addon scalars can then be found in the associated column, depending on their type:

  • string_addon_scalars
  • keyed_string_addon_scalars
  • uint_addon_scalars
  • keyed_uint_addon_scalars
  • boolean_addon_scalars
  • keyed_boolean_addon_scalars

These columns are all maps. Each addon scalar will be a key within that map, concatenating the top-level subsection within Scalars.yaml with its name to get the key. As an example, consider the following scalar definition:

test:
  misunderestimated_nucular:
    description: A test scalar, no soup for you!
    expires: never
    kind: string
    keyed: true
    notification_emails:
      - frank@mozilla.com
    record_in_processes:
      - 'dynamic'

For example, you could find the addon scalar test.misunderestimated_nucular, a keyed string scalar, using keyed_string_addon_scalars['test_misunderestimated_nucular']. In general, use element_at, which returns NULL when the key is not found: element_at(keyed_string_addon_scalars, 'test_misunderestimated_nucular')

Other Fields

We can include other types of fields as well, for example if there needs to be a specific transformation done. We do need the data to be available in the Main Ping

Data Reference

Example Queries

We recommend working with this dataset via Spark rather than sql.t.m.o. Due to the large number of records, queries can consume a lot of resources on the shared cluster and impact other users. Queries via sql.t.m.o should limit to a short submission_date_s3 range, and ideally make use of the sample_id field.

When using Presto to query the data from sql.t.m.o, you can use the UNNEST feature to access items in the search_counts, popup_notification_stats and active_addons fields.

For example, to compare the search volume for different search source values, you could use:

WITH search_data AS (
  SELECT
    s.source AS search_source,
    s.count AS search_count
  FROM
    main_summary
    CROSS JOIN UNNEST(search_counts) AS t(s)
  WHERE
    submission_date_s3 = '20160510'
    AND sample_id = '42'
    AND search_counts IS NOT NULL
)

SELECT
  search_source,
  sum(search_count) as total_searches
FROM search_data
GROUP BY search_source
ORDER BY sum(search_count) DESC

Sampling

The main_summary dataset contains one record for each main ping as long as the record contains a non-null value for documentId, submissionDate, and Timestamp. We do not ever expect nulls for these fields.

Scheduling

This dataset is updated daily via the telemetry-airflow infrastructure. The job DAG runs every day shortly after midnight UTC. You can find the job definition here

Schema

As of 2017-12-03, the current version of the main_summary dataset is v4, and has a schema as follows:

root
 |-- document_id: string (nullable = false)
 |-- client_id: string (nullable = true)
 |-- channel: string (nullable = true)
 |-- normalized_channel: string (nullable = true)
 |-- normalized_os_version: string (nullable = true)
 |-- country: string (nullable = true)
 |-- city: string (nullable = true)
 |-- geo_subdivision1: string (nullable = true)
 |-- geo_subdivision2: string (nullable = true)
 |-- os: string (nullable = true)
 |-- os_version: string (nullable = true)
 |-- os_service_pack_major: long (nullable = true)
 |-- os_service_pack_minor: long (nullable = true)
 |-- windows_build_number: long (nullable = true)
 |-- windows_ubr: long (nullable = true)
 |-- install_year: long (nullable = true)
 |-- is_wow64: boolean (nullable = true)
 |-- memory_mb: integer (nullable = true)
 |-- cpu_count: integer (nullable = true)
 |-- cpu_cores: integer (nullable = true)
 |-- cpu_vendor: string (nullable = true)
 |-- cpu_family: integer (nullable = true)
 |-- cpu_model: integer (nullable = true)
 |-- cpu_stepping: integer (nullable = true)
 |-- cpu_l2_cache_kb: integer (nullable = true)
 |-- cpu_l3_cache_kb: integer (nullable = true)
 |-- cpu_speed_mhz: integer (nullable = true)
 |-- gfx_features_d3d11_status: string (nullable = true)
 |-- gfx_features_d2d_status: string (nullable = true)
 |-- gfx_features_gpu_process_status: string (nullable = true)
 |-- gfx_features_advanced_layers_status: string (nullable = true)
 |-- apple_model_id: string (nullable = true)
 |-- antivirus: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- antispyware: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- firewall: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- profile_creation_date: long (nullable = true)
 |-- profile_reset_date: long (nullable = true)
 |-- previous_build_id: string (nullable = true)
 |-- session_id: string (nullable = true)
 |-- subsession_id: string (nullable = true)
 |-- previous_session_id: string (nullable = true)
 |-- previous_subsession_id: string (nullable = true)
 |-- session_start_date: string (nullable = true)
 |-- subsession_start_date: string (nullable = true)
 |-- session_length: long (nullable = true)
 |-- subsession_length: long (nullable = true)
 |-- subsession_counter: integer (nullable = true)
 |-- profile_subsession_counter: integer (nullable = true)
 |-- creation_date: string (nullable = true)
 |-- distribution_id: string (nullable = true)
 |-- submission_date: string (nullable = false)
 |-- sync_configured: boolean (nullable = true)
 |-- sync_count_desktop: integer (nullable = true)
 |-- sync_count_mobile: integer (nullable = true)
 |-- app_build_id: string (nullable = true)
 |-- app_display_version: string (nullable = true)
 |-- app_name: string (nullable = true)
 |-- app_version: string (nullable = true)
 |-- timestamp: long (nullable = false)
 |-- env_build_id: string (nullable = true)
 |-- env_build_version: string (nullable = true)
 |-- env_build_arch: string (nullable = true)
 |-- e10s_enabled: boolean (nullable = true)
 |-- e10s_multi_processes: long (nullable = true)
 |-- locale: string (nullable = true)
 |-- update_channel: string (nullable = true)
 |-- update_enabled: boolean (nullable = true)
 |-- update_auto_download: boolean (nullable = true)
 |-- attribution: struct (nullable = true)
 |    |-- source: string (nullable = true)
 |    |-- medium: string (nullable = true)
 |    |-- campaign: string (nullable = true)
 |    |-- content: string (nullable = true)
 |-- sandbox_effective_content_process_level: integer (nullable = true)
 |-- active_experiment_id: string (nullable = true)
 |-- active_experiment_branch: string (nullable = true)
 |-- reason: string (nullable = true)
 |-- timezone_offset: integer (nullable = true)
 |-- plugin_hangs: integer (nullable = true)
 |-- aborts_plugin: integer (nullable = true)
 |-- aborts_content: integer (nullable = true)
 |-- aborts_gmplugin: integer (nullable = true)
 |-- crashes_detected_plugin: integer (nullable = true)
 |-- crashes_detected_content: integer (nullable = true)
 |-- crashes_detected_gmplugin: integer (nullable = true)
 |-- crash_submit_attempt_main: integer (nullable = true)
 |-- crash_submit_attempt_content: integer (nullable = true)
 |-- crash_submit_attempt_plugin: integer (nullable = true)
 |-- crash_submit_success_main: integer (nullable = true)
 |-- crash_submit_success_content: integer (nullable = true)
 |-- crash_submit_success_plugin: integer (nullable = true)
 |-- shutdown_kill: integer (nullable = true)
 |-- active_addons_count: long (nullable = true)
 |-- flash_version: string (nullable = true)
 |-- vendor: string (nullable = true)
 |-- is_default_browser: boolean (nullable = true)
 |-- default_search_engine_data_name: string (nullable = true)
 |-- default_search_engine_data_load_path: string (nullable = true)
 |-- default_search_engine_data_origin: string (nullable = true)
 |-- default_search_engine_data_submission_url: string (nullable = true)
 |-- default_search_engine: string (nullable = true)
 |-- devtools_toolbox_opened_count: integer (nullable = true)
 |-- client_submission_date: string (nullable = true)
 |-- client_clock_skew: long (nullable = true)
 |-- client_submission_latency: long (nullable = true)
 |-- places_bookmarks_count: integer (nullable = true)
 |-- places_pages_count: integer (nullable = true)
 |-- push_api_notify: integer (nullable = true)
 |-- web_notification_shown: integer (nullable = true)
 |-- popup_notification_stats: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- offered: integer (nullable = true)
 |    |    |-- action_1: integer (nullable = true)
 |    |    |-- action_2: integer (nullable = true)
 |    |    |-- action_3: integer (nullable = true)
 |    |    |-- action_last: integer (nullable = true)
 |    |    |-- dismissal_click_elsewhere: integer (nullable = true)
 |    |    |-- dismissal_leave_page: integer (nullable = true)
 |    |    |-- dismissal_close_button: integer (nullable = true)
 |    |    |-- dismissal_not_now: integer (nullable = true)
 |    |    |-- open_submenu: integer (nullable = true)
 |    |    |-- learn_more: integer (nullable = true)
 |    |    |-- reopen_offered: integer (nullable = true)
 |    |    |-- reopen_action_1: integer (nullable = true)
 |    |    |-- reopen_action_2: integer (nullable = true)
 |    |    |-- reopen_action_3: integer (nullable = true)
 |    |    |-- reopen_action_last: integer (nullable = true)
 |    |    |-- reopen_dismissal_click_elsewhere: integer (nullable = true)
 |    |    |-- reopen_dismissal_leave_page: integer (nullable = true)
 |    |    |-- reopen_dismissal_close_button: integer (nullable = true)
 |    |    |-- reopen_dismissal_not_now: integer (nullable = true)
 |    |    |-- reopen_open_submenu: integer (nullable = true)
 |    |    |-- reopen_learn_more: integer (nullable = true)
 |-- search_counts: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- engine: string (nullable = true)
 |    |    |-- source: string (nullable = true)
 |    |    |-- count: long (nullable = true)
 |-- active_addons: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- addon_id: string (nullable = false)
 |    |    |-- blocklisted: boolean (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- user_disabled: boolean (nullable = true)
 |    |    |-- app_disabled: boolean (nullable = true)
 |    |    |-- version: string (nullable = true)
 |    |    |-- scope: integer (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- foreign_install: boolean (nullable = true)
 |    |    |-- has_binary_components: boolean (nullable = true)
 |    |    |-- install_day: integer (nullable = true)
 |    |    |-- update_day: integer (nullable = true)
 |    |    |-- signed_state: integer (nullable = true)
 |    |    |-- is_system: boolean (nullable = true)
 |    |    |-- is_web_extension: boolean (nullable = true)
 |    |    |-- multiprocess_compatible: boolean (nullable = true)
 |-- disabled_addons_ids: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- active_theme: struct (nullable = true)
 |    |-- addon_id: string (nullable = false)
 |    |-- blocklisted: boolean (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- user_disabled: boolean (nullable = true)
 |    |-- app_disabled: boolean (nullable = true)
 |    |-- version: string (nullable = true)
 |    |-- scope: integer (nullable = true)
 |    |-- type: string (nullable = true)
 |    |-- foreign_install: boolean (nullable = true)
 |    |-- has_binary_components: boolean (nullable = true)
 |    |-- install_day: integer (nullable = true)
 |    |-- update_day: integer (nullable = true)
 |    |-- signed_state: integer (nullable = true)
 |    |-- is_system: boolean (nullable = true)
 |    |-- is_web_extension: boolean (nullable = true)
 |    |-- multiprocess_compatible: boolean (nullable = true)
 |-- blocklist_enabled: boolean (nullable = true)
 |-- addon_compatibility_check_enabled: boolean (nullable = true)
 |-- telemetry_enabled: boolean (nullable = true)
 |-- user_prefs: struct (nullable = true)
 |    |-- dom_ipc_process_count: integer (nullable = true)
 |    |-- extensions_allow_non_mpc_extensions: boolean (nullable = true)
 |-- events: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- timestamp: long (nullable = false)
 |    |    |-- category: string (nullable = false)
 |    |    |-- method: string (nullable = false)
 |    |    |-- object: string (nullable = false)
 |    |    |-- string_value: string (nullable = true)
 |    |    |-- map_values: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |-- ssl_handshake_result_success: integer (nullable = true)
 |-- ssl_handshake_result_failure: integer (nullable = true)
 |-- ssl_handshake_result: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)
 |-- active_ticks: integer (nullable = true)
 |-- main: integer (nullable = true)
 |-- first_paint: integer (nullable = true)
 |-- session_restored: integer (nullable = true)
 |-- total_time: integer (nullable = true)
 |-- plugins_notification_shown: integer (nullable = true)
 |-- plugins_notification_user_action: struct (nullable = true)
 |    |-- allow_now: integer (nullable = true)
 |    |-- allow_always: integer (nullable = true)
 |    |-- block: integer (nullable = true)
 |-- plugins_infobar_shown: integer (nullable = true)
 |-- plugins_infobar_block: integer (nullable = true)
 |-- plugins_infobar_allow: integer (nullable = true)
 |-- plugins_infobar_dismissed: integer (nullable = true)
 |-- experiments: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- search_cohort: string (nullable = true)
 |-- gfx_compositor: string (nullable = true)
 |-- quantum_ready: boolean (nullable = true)
 |-- gc_max_pause_ms_main_above_150: long (nullable = true)
 |-- gc_max_pause_ms_main_above_250: long (nullable = true)
 |-- gc_max_pause_ms_main_above_2500: long (nullable = true)
 |-- gc_max_pause_ms_content_above_150: long (nullable = true)
 |-- gc_max_pause_ms_content_above_250: long (nullable = true)
 |-- gc_max_pause_ms_content_above_2500: long (nullable = true)
 |-- cycle_collector_max_pause_main_above_150: long (nullable = true)
 |-- cycle_collector_max_pause_main_above_250: long (nullable = true)
 |-- cycle_collector_max_pause_main_above_2500: long (nullable = true)
 |-- cycle_collector_max_pause_content_above_150: long (nullable = true)
 |-- cycle_collector_max_pause_content_above_250: long (nullable = true)
 |-- cycle_collector_max_pause_content_above_2500: long (nullable = true)
 |-- input_event_response_coalesced_ms_main_above_150: long (nullable = true)
 |-- input_event_response_coalesced_ms_main_above_250: long (nullable = true)
 |-- input_event_response_coalesced_ms_main_above_2500: long (nullable = true)
 |-- input_event_response_coalesced_ms_content_above_150: long (nullable = true)
 |-- input_event_response_coalesced_ms_content_above_250: long (nullable = true)
 |-- input_event_response_coalesced_ms_content_above_2500: long (nullable = true)
 |-- ghost_windows_main_above_1: long (nullable = true)
 |-- ghost_windows_content_above_1: long (nullable = true)
 |-- user_pref_dom_ipc_plugins_sandbox_level_flash: integer (nullable = true)
 |-- user_pref_dom_ipc_processcount: integer (nullable = true)
 |-- user_pref_extensions_allow_non_mpc_extensions: boolean (nullable = true)
 |-- user_pref_extensions_legacy_enabled: boolean (nullable = true)
 |-- user_pref_browser_search_widget_innavbar: boolean (nullable = true)
 |-- user_pref_general_config_filename: string (nullable = true)
 |-- ** dynamically included scalar fields, see source **
 |-- ** dynamically included whitelisted histograms, see source **
 |-- boolean_addon_scalars: map (nullable = true)
 |    |-- key: string
 |    |-- value: boolean (valueContainsNull = true)
 |-- keyed_boolean_addon_scalars: map (nullable = true)
 |    |-- key: string
 |    |-- value: map (valueContainsNull = true)
 |    |    |-- key: string
 |    |    |-- value: boolean (valueContainsNull = true)
 |-- string_addon_scalars: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- keyed_string_addon_scalars: map (nullable = true)
 |    |-- key: string
 |    |-- value: map (valueContainsNull = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |-- uint_addon_scalars: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)
 |-- keyed_uint_addon_scalars: map (nullable = true)
 |    |-- key: string
 |    |-- value: map (valueContainsNull = true)
 |    |    |-- key: string
 |    |    |-- value: integer (valueContainsNull = true)
 |-- submission_date_s3: string (nullable = true)
 |-- sample_id: string (nullable = true)

For more detail on where these fields come from in the raw data, please look in the MainSummaryView code. in the buildSchema function.

Most of the fields are simple scalar values, with a few notable exceptions:

  • The search_count field is an array of structs, each item in the array representing a 3-tuple of (engine, source, count). The engine field represents the name of the search engine against which the searches were done. The source field represents the part of the Firefox UI that was used to perform the search. It contains values such as abouthome, urlbar, and searchbar. The count field contains the number of searches performed against this engine+source combination during that subsession. Any of the fields in the struct may be null (for example if the search key did not match the expected pattern, or if the count was non-numeric).
  • The loop_activity_counter field is a simple struct containing inner fields for each expected value of the LOOP_ACTIVITY_COUNTER Enumerated Histogram. Each inner field is a count for that histogram bucket.
  • The popup_notification_stats field is a map of String keys to struct values, each field in the struct being a count for the expected values of the POPUP_NOTIFICATION_STATS Keyed Enumerated Histogram.
  • The places_bookmarks_count and places_pages_count fields contain the mean value of the corresponding Histogram, which can be interpreted as the average number of bookmarks or pages in a given subsession.
  • The active_addons field contains an array of structs, one for each entry in the environment.addons.activeAddons section of the payload. More detail in Bug 1290181.
  • The disabled_addons_ids field contains an array of strings, one for each entry in the payload.addonDetails which is not already reported in the environment.addons.activeAddons section of the payload. More detail in Bug 1390814. Please note that while using this field is generally OK, this was introduced to support the TAAR project and you should not count on it in the future. The field can stay in the main_summary, but we might need to slightly change the ping structure to something better than payload.addonDetails.
  • The theme field contains a single struct in the same shape as the items in the active_addons array. It contains information about the currently active browser theme.
  • The user_prefs field contains a struct with values for preferences of interest.
  • The events field contains an array of event structs.
  • Dynamically-included histogram fields are present as key->value maps, or key->(key->value) nested maps for keyed histograms.

Time formats

Columns in main_summary may use one of a handful of time formats with different precisions:

Column NameOriginDescriptionExampleSparkPresto
timestampstamped at ingestionnanoseconds since epoch1504689165972861952from_unixtime(timestamp/1e9)from_unixtime(timestamp/1e9)
submission_date_s3derived from timestampYYYYMMDD date string of timestamp in UTC20170906from_unixtime(unix_timestamp(submission_date, 'yyyyMMdd'))date_parse(submission_date, '%Y%m%d')
client_submission_datederived from HTTP header: Fields[Date]HTTP date header string sent with the pingTue, 27 Sep 2016 16:28:23 GMTunix_timestamp(client_submission_date, 'EEE, dd M yyyy HH:mm:ss zzz')date_parse(substr(client_submission_date, 1, 25), '%a, %d %b %Y %H:%i:%s')
creation_datecreationDatetime of ping creation ISO8601 at UTC+02017-09-06T08:21:36.002Zto_timestamp(creation_date, "yyyy-MM-dd'T'HH:mm:ss.SSSXXX")from_iso8601_timestamp(creation_date) AT TIME ZONE 'GMT'
timezone_offsetinfo.timezoneOffsettimezone offset in minutes120
subsession_start_dateinfo.subsessionStartDatehourly precision, ISO8601 date in local time2017-09-06T00:00:00.0+02:00from_iso8601_timestamp(subsession_start_date) AT TIME ZONE 'GMT'
subsession_lengthinfo.subsessionLengthsubsession length in seconds599date_add('second', subsession_length, subsession_start_date)
profile_creation_dateenvironment.profile.creationDatedays since epoch15,755from_unixtime(profile_creation_date * 86400)

Code Reference

This dataset is generated by telemetry-batch-view. Refer to this repository for information on how to run or augment the dataset.

New Profile

Introduction

The telemetry_new_profile_parquet table is the most direct representation of a new-profile ping.

Contents

The table contains one row for each ping. Each column represents one field from the new-profile ping payload, though only a subset of all fields are included.

Accessing the Data

The data is stored as a parquet table in S3 at the following address.

s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-new-profile-parquet/v2/

The telemetry_new_profile_parquet is accessible through re:dash. Here's an example query.

Further Reading

This dataset is generated automatically using direct to parquet. The configuration responsible for generating this dataset was introduced in bug 1360256.

Data Reference

Schema

As of 2018-06-26, the current version of the telemetry_new_profile_parquet dataset is v2, and has a schema as follows:

root
 |-- id: string (nullable = true)
 |-- client_id: string (nullable = true)
 |-- metadata: struct (nullable = true)
 |    |-- timestamp: long (nullable = true)
 |    |-- date: string (nullable = true)
 |    |-- normalized_channel: string (nullable = true)
 |    |-- geo_country: string (nullable = true)
 |    |-- geo_city: string (nullable = true)
 |    |-- geo_subdivision1: string (nullable = true)
 |    |-- geo_subdivision2: string (nullable = true)
 |    |-- creation_timestamp: long (nullable = true)
 |    |-- x_ping_sender_version: string (nullable = true)
 |-- environment: struct (nullable = true)
 |    |-- build: struct (nullable = true)
 |    |    |-- application_name: string (nullable = true)
 |    |    |-- architecture: string (nullable = true)
 |    |    |-- version: string (nullable = true)
 |    |    |-- build_id: string (nullable = true)
 |    |    |-- vendor: string (nullable = true)
 |    |    |-- hotfix_version: string (nullable = true)
 |    |-- partner: struct (nullable = true)
 |    |    |-- distribution_id: string (nullable = true)
 |    |    |-- distribution_version: string (nullable = true)
 |    |    |-- partner_id: string (nullable = true)
 |    |    |-- distributor: string (nullable = true)
 |    |    |-- distributor_channel: string (nullable = true)
 |    |    |-- partner_names: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |-- settings: struct (nullable = true)
 |    |    |-- is_default_browser: boolean (nullable = true)
 |    |    |-- default_search_engine: string (nullable = true)
 |    |    |-- default_search_engine_data: struct (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- load_path: string (nullable = true)
 |    |    |    |-- origin: string (nullable = true)
 |    |    |    |-- submission_url: string (nullable = true)
 |    |    |-- telemetry_enabled: boolean (nullable = true)
 |    |    |-- locale: string (nullable = true)
 |    |    |-- attribution: struct (nullable = true)
 |    |    |    |-- source: string (nullable = true)
 |    |    |    |-- medium: string (nullable = true)
 |    |    |    |-- campaign: string (nullable = true)
 |    |    |    |-- content: string (nullable = true)
 |    |    |-- update: struct (nullable = true)
 |    |    |    |-- channel: string (nullable = true)
 |    |    |    |-- enabled: boolean (nullable = true)
 |    |    |    |-- auto_download: boolean (nullable = true)
 |    |-- system: struct (nullable = true)
 |    |    |-- os: struct (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- version: string (nullable = true)
 |    |    |    |-- locale: string (nullable = true)
 |    |-- profile: struct (nullable = true)
 |    |    |-- creation_date: long (nullable = true)
 |-- payload: struct (nullable = true)
 |    |-- reason: string (nullable = true)
 |-- submission: string (nullable = true)

For more detail on the raw ping these fields come from, see the raw data.

Socorro Crash Reports

Introduction

Public crash statistics for Firefox are available through the Data Platform in a socorro_crash dataset. The crash data in Socorro is sanitized and made available to ATMO and STMO. A nightly import job converts batches of JSON documents into a columnar format using the associated JSON Schema.

Contents

Accessing the Data

The dataset is available in parquet at s3://telemetry-parquet/socorro_crash/v2. It is also indexed with Athena and Presto with the table name socorro_crash.

Data Reference

Example

The dataset can be queried using SQL. For example, we can aggregate the number of crashes and total up-time by date and reason.

SELECT crash_date,
       reason,
       count(*) as n_crashes,
       avg(uptime) as avg_uptime,
       stddev(uptime) as stddev_uptime,
       approx_percentile(uptime, ARRAY [0.25, 0.5, 0.75]) as qntl_uptime
FROM socorro_crash
WHERE crash_date='20180520'
GROUP BY 1,
         2

STMO Source

Scheduling

The job is schedule on a nightly basis on airflow. The dag is available under mozilla/telemetry-airflow:/dags/socorro_import.py.

Schema

The source schema is available on the mozilla/socorro GitHub repository. This schema is transformed into a Spark-SQL structure and serialized to parquet after transforming column names from camelCase to snake_case.

Code Reference

The code is a notebook in the mozilla-services/data-pipeline repository.

SSL Ratios

Introduction

The public SSL dataset publishes the percentage of page loads Firefox users have performed that were conducted over SSL. This dataset is used to produce graphs like Let's Encrypt's to determine SSL adoption on the Web over time.

Content

The public SSL dataset is a table where each row is a distinct set of dimensions, with their associated SSL statistics. The dimensions are submission_date, os, and country. The statistics are reporting_ratio, normalized_pageloads, and ratio.

Background and Caveats

  • We're using normalized values in normalized_pageloads to obscure absolute page load counts.
  • This is across the entirety of release, not per-version, because we're looking at Web health, not Firefox user health.
  • Any dimension tuple (any given combination of submission_date, os, and country) with fewer than 5000 page loads is omitted from the dataset.
  • This is hopefully just a temporary dataset to stopgap release aggregates going away until we can come up with a better way to publicly publish datasets.

Accessing the Data

For details on accessing the data, please look at bug 1414839.

Data Reference

Combining Rows

This is a dataset of ratios. You can't combine ratios if they have different bases. For example, if 50% of 10 loads (5 loads) were SSL and 5% of 20 loads (1 load) were SSL, you cannot calculate that 20% (6 loads) of the total loads (30 loads) were SSL unless you know that the 50% was for 10 and the 5% was for 20.

If you're reluctant, for product reasons, to share the numbers 10 and 20, this gets tricky.

So what we've done is normalize the whole batch of 30 down to 1. That means we tell you that 50% of one-third of the loads (0.333...) was SSL and 5% of the other two-thirds of the loads (0.666...) was SSL. Then you can figure out the overall 20% figure by this calculation:

0.5 * 0.333 + 0.05 * 0.666 = 0.2

For this dataset the same rule applies. To combine rows' ratios (to, for example, see what the SSL ratio was across all os and country for a given submission_date), you must first multiply them by the rows' normalized_pageviews values.

Or, in JavaScript:

let rows = query_result.data.rows;
let ratioForDateInQuestion = rows
  .filter(row => row.submission_date == dateInQuestion)
  .reduce((row, acc) => acc + row.normalized_pageloads * row.ratio, 0);

Schema

The data is output in re:dash API format:

"query_result": {
  "retrieved_at": <timestamp>,
  "query_hash": <hash>,
  "query": <SQL>,
  "runtime": <number of seconds>,
  "id": <an id>,
  "data_source_id": 26, // Athena
  "data_scanned": <some really large number, as a string>,
  "data": {
    "data_scanned": <some really large number, as a number>,
    "columns": [
      {"friendly_name": "submission_date", "type": "datetime", "name": "submission_date"},
      {"friendly_name": "os", "type": "string", "name": "os"},
      {"friendly_name": "country", "type": "string", "name": "country"},
      {"friendly_name": "reporting_ratio", "type": "float", "name": "reporting_ratio"},
      {"friendly_name": "normalized_pageloads", "type": "float", "name": "normalized_pageloads"},
      {"friendly_name": "ratio", "type": "float", "name": "ratio"}
    ],
    "rows": [
      {
        "submission_date": "2017-10-24T00:00:00", // date string, day resolution
        "os": "Windows_NT", // operating system family of the clients reporting the pageloads. One of "Windows_NT", "Linux", or "Darwin".
        "country": "CZ", // ISO 639 two-character country code, or "??" if we have no idea. Determined by performing a geo-IP lookup of the clients that submitted the pings.
        "reporting_ratio": 0.006825266611977031, // the ratio of pings that reported any pageloads at all. A number between 0 and 1. See [bug 1413258](https://bugzilla.mozilla.org/show_bug.cgi?id=1413258).
        "normalized_pageloads": 0.00001759145263985348, // the proportion of total pageloads in the dataset that are represented by this row. Provided to allow combining rows. A number between 0 and 1.
        "ratio": 0.6916961976822144 // the ratio of the pageloads that were performed over SSL. A number between 0 and 1.
      }, ...
    ]
  }
}

Scheduling

The dataset updates every 24 hours.

Code Reference

You can find the query that generates the SSL dataset here.

Telemetry Aggregates Reference

Introduction

The telemetry_aggregates dataset is a daily aggregation of the pings, aggregating the histograms across a set of dimensions.

Rows and Columns

There's one column for each of the dimensions and the histogram and each row is a distinct set of dimensions, along with their associated histograms.

Accessing the Data

This dataset is accessible via re:dash by selecting from telemetry_aggregates.

The data is stored as a parquet table in S3 at the following address.

s3://telemetry-parquet/aggregates_poc/v1/

Data Reference

Example Queries

Here's an example query that shows the number of pings received per submission_date for the dimensions provided.

SELECT
    submission_date,
    SUM(count) AS pings
FROM
    telemetry_aggregates
WHERE
    channel = 'nightly'
    AND metric = 'GC_MS'
    AND aggregate_type = 'build_id'
    AND period = '201901'
GROUP BY
    submission_date
ORDER BY
    submission_date
;

Sampling

Invalid Pings

We ignore invalid pings in our processing. Invalid pings are defined as those that:

  • The submission dates are invalid or missing.
  • The build ID is malformed.
  • The docType field is missing or unknown.
  • The build ID is older than a defined cutoff days. (See the BUILD_ID_CUTOFFS variable in the code for the max days per channel)

Scheduling

The telemetry_aggregates job is run daily, at midnight UTC. The job is scheduled on Airflow. The DAG is here

Schema

The telemetry_aggregates table has a set of dimensions and set of aggregates for those dimensions.

The partitioned dimensions are the following columns. Filtering by one of these fields to limit the resulting number of rows can run significantly faster:

  • metric is the name of the metric, like "GC_MS".
  • aggregate_type is the type of aggregation, either "build_id" or "submission_date", representing how this aggregation was grouped.
  • period is a string representing the month in YYYYMM format that a ping was submitted, like '201901'.

The rest of the dimensions are:

  • submission_date is the date pings were submitted for a particular aggregate.
  • channel is the channel, like release or beta.
  • version is the program version, like 46.0a1.
  • build_id is the YYYYMMDDhhmmss timestamp the program was built, like 20190123192837.
  • application is the program name, like Firefox or Fennec.
  • architecture is the architecture that the program was built for (not necessarily the one it is running on).
  • os is the name of the OS the program is running on, like Darwin or Windows_NT.
  • os_version is the version of the OS the program is running on.
  • key is the key of a keyed metric. This will be empty if the underlying metric is not a keyed metric.
  • process_type is the process the histogram was recorded in, like content or parent.

The aggregates are:

  • count is the aggregate sum of the number of pings per dimensions.
  • sum is the aggregate sum of the histogram values per dimensions.
  • histogram is the aggregated histogram per dimensions.

Update

Introduction

The update ping is sent from Firefox Desktop when a browser update is ready to be applied and after it was correctly applied. It contains the build information and the update blob information, in addition to some information about the user environment. The telemetry_update_parquet table is the most direct representation of an update ping.

Contents

The table contains one row for each ping. Each column represents one field from the update ping payload, though only a subset of all fields are included.

Accessing the Data

The data is stored as a parquet table in S3 at the following address.

s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-update-parquet/v1/

The telemetry_update_parquet is accessible through re:dash. Here's an example query.

Further Reading

This dataset is generated automatically using direct to parquet. The configuration responsible for generating this dataset was introduced in bug 1384861.

Data Reference

Schema

As of 2017-09-07, the current version of the telemetry_update_parquet dataset is v1, and has a schema as follows:

root
 |-- id: string (nullable = true)
 |-- client_id: string (nullable = true)
 |-- metadata: struct (nullable = true)
 |    |-- timestamp: long (nullable = true)
 |    |-- date: string (nullable = true)
 |    |-- normalized_channel: string (nullable = true)
 |    |-- geo_country: string (nullable = true)
 |    |-- geo_city: string (nullable = true)
 |    |-- creation_timestamp: long (nullable = true)
 |    |-- x_ping_sender_version: string (nullable = true)
 |-- application: struct (nullable = true)
 |    |-- displayVersion: string (nullable = true)
 |-- environment: struct (nullable = true)
 |    |-- build: struct (nullable = true)
 |    |    |-- application_name: string (nullable = true)
 |    |    |-- architecture: string (nullable = true)
 |    |    |-- version: string (nullable = true)
 |    |    |-- build_id: string (nullable = true)
 |    |    |-- vendor: string (nullable = true)
 |    |    |-- hotfix_version: string (nullable = true)
 |    |-- partner: struct (nullable = true)
 |    |    |-- distribution_id: string (nullable = true)
 |    |    |-- distribution_version: string (nullable = true)
 |    |    |-- partner_id: string (nullable = true)
 |    |    |-- distributor: string (nullable = true)
 |    |    |-- distributor_channel: string (nullable = true)
 |    |    |-- partner_names: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |-- settings: struct (nullable = true)
 |    |    |-- telemetry_enabled: boolean (nullable = true)
 |    |    |-- locale: string (nullable = true)
 |    |    |-- update: struct (nullable = true)
 |    |    |    |-- channel: string (nullable = true)
 |    |    |    |-- enabled: boolean (nullable = true)
 |    |    |    |-- auto_download: boolean (nullable = true)
 |    |-- system: struct (nullable = true)
 |    |    |-- os: struct (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- version: string (nullable = true)
 |    |    |    |-- locale: string (nullable = true)
 |    |-- profile: struct (nullable = true)
 |    |    |-- creation_date: long (nullable = true)
 |-- payload: struct (nullable = true)
 |    |-- reason: string (nullable = true)
 |    |-- target_channel: string (nullable = true)
 |    |-- target_version: string (nullable = true)
 |    |-- target_build_id: string (nullable = true)
 |    |-- target_display_version: string (nullable = true)
 |    |-- previous_channel: string (nullable = true)
 |    |-- previous_version: string (nullable = true)
 |    |-- previous_build_id: string (nullable = true)
 |-- submission_date_s3: string (nullable = true)

For more detail on the raw ping these fields come from, see the raw data.

Work in Progress

This article is a work in progress. The work is being tracked in this bug.

Guide to our Experimental Tools

Shield

  • Shield is an addon-based experimentation platform with fine-tuned enrollment criteria. The system add-on landed in FF 53.
  • For the moment, it sends back data in its own shield type ping, so there's lots of flexibility in data you can collect.
  • Uses the Normandy server to serve out study “recipes” (?)
  • Annotates the main ping in the environment/experiments block
  • The shield system is itself a system add-on, so rolling out changes to the entire system does not require riding release trains
  • Strategy and Insights (strategyandinsights@mozilla.com) team are product owners and shepherd the study development and release process along
  • Opt-out experiments should be available soon?
  • Further reading:

Preference Flipping experiments

Uses Normandy, requires NO additional addon as long as a preference rides the release train

Heartbeat

Survey mechanism, also run via Normandy

Telemetry Experiments

Pre-release only https://gecko.readthedocs.io/en/latest/browser/experiments/experiments/index.html

Funnelcake

Custom builds of Firefox that are served to some percentage of the direct download population

Accessing Heartbeat data

Heartbeat survey studies return telemetry on user engagement with the survey prompt. The heartbeat pings do not contain the survey responses themselves, which are stored by SurveyGizmo.

The telemetry is received using the heartbeat document type, which is described in the Firefox source tree docs.

These pings are aggregated into the telemetry_heartbeat_parquet table, and may also be accessed using the Dataset API.

Linking Heartbeat responses to telemetry

Heartbeat responses may be linked to Firefox telemetry if there is a "includeTelemetryUUID": true key in the arguments object of the show-heartbeat recipe.

Heartbeat never reports telemetry client_ids to SurveyGizmo, but, when includeTelemetryUUID is true, the Normandy user_id is reported to SurveyGizmo as the userid URL variable. Simultaneously, a heartbeat ping is sent to Mozilla, containing both the telemetry client_id and the Normandy userid that was reported to SurveyGizmo.

The userid is reported by appending it to the surveyId field of the ping, like:

hb-example-slug::e87bcae5-bb63-4829-822a-85ba41ee5d53

These can be extracted from the Parquet table for analysis using expressions like:

SPLIT(payload.survey_id,'::')[1] AS surveygizmo_userid

Data reference

The telemetry_heartbeat_parquet table is partitioned by submission_date_s3 and has the schema:

telemetry_heartbeat_parquet
 |-- type: string (nullable = true)
 |-- id: string (nullable = true)
 |-- creation_date: string (nullable = true)
 |-- version: double (nullable = true)
 |-- client_id: string (nullable = true)
 |-- application: struct (nullable = true)
 |    |-- architecture: string (nullable = true)
 |    |-- build_id: string (nullable = true)
 |    |-- channel: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- platform_version: string (nullable = true)
 |    |-- version: string (nullable = true)
 |    |-- display_version: string (nullable = true)
 |    |-- vendor: string (nullable = true)
 |    |-- xpcom_abi: string (nullable = true)
 |-- payload: struct (nullable = true)
 |    |-- version: long (nullable = true)
 |    |-- flow_id: string (nullable = true)
 |    |-- offered_ts: long (nullable = true)
 |    |-- learn_more_ts: long (nullable = true)
 |    |-- voted_ts: long (nullable = true)
 |    |-- engaged_ts: long (nullable = true)
 |    |-- closed_ts: long (nullable = true)
 |    |-- expired_ts: long (nullable = true)
 |    |-- window_closed_ts: long (nullable = true)
 |    |-- score: long (nullable = true)
 |    |-- survey_id: string (nullable = true)
 |    |-- survey_version: string (nullable = true)
 |    |-- testing: boolean (nullable = true)
 |-- metadata: struct (nullable = true)
 |    |-- timestamp: long (nullable = true)
 |    |-- app_version: string (nullable = true)
 |    |-- date: string (nullable = true)
 |    |-- normalized_channel: string (nullable = true)
 |    |-- app_update_channel: string (nullable = true)
 |    |-- submission_date: string (nullable = true)
 |    |-- geo_city: string (nullable = true)
 |    |-- geo_country: string (nullable = true)
 |    |-- document_id: string (nullable = true)
 |    |-- app_build_id: string (nullable = true)
 |    |-- app_name: string (nullable = true)
 |-- submission_date_s3: string (nullable = true)

Analyzing data from SHIELD studies

This article introduces the datasets that are useful for analyzing SHIELD studies. After reading this article, you should understand how to answer questions about study enrollment, identify telemetry from clients enrolled in an experiment, and locate telemetry from add-on studies.

Table of contents

Dashboards

The Shield Studies Viewer and Experimenter are other places to find lists of live experiments.

Experiment slugs

Each experiment is associated with a slug, which is the label used to identify the experiment to Normandy clients. The slug is also used to identify the experiment in most telemetry. The slug for pref-flip experiments is defined in the recipe by a field named slug; the slug for add-on experiments is defined in the recipe by a field named name.

You can determine the slug for a particular experiment by consulting this summary table or the list of active recipes at https://normandy.cdn.mozilla.net/api/v1/recipe/signed/.

Tables

These tables should be accessible from ATMO, Databricks, Presto, and Athena.

experiments column

main_summary, clients_daily, crash_summary, and some other tables include a experiments column which is a mapping from experiment slug to branch.

You can collect rows from enrolled clients using query syntax like:

SELECT
  *,
  experiments['some-experiment-slug-12345'] AS branch
FROM clients_daily
WHERE experiments['some-experiment-slug-12345'] IS NOT NULL

experiments

The experiments table is a subset of rows from main_summary reflecting pings from clients that are currently enrolled in an experiment. The experiments table has additional string-type experiment_id and experiment_branch columns, and is partitioned by experiment_id, which makes it efficient to query.

Experiments deployed to large fractions of the release channel may have the isHighVolume flag set in the Normandy recipe; those experiments will not be aggregated into the experiments table.

Please note that the experiments table cannot be used for calculating retention for periods extending beyond the end of the experiment. Once a client is unenrolled from an experiment, subsequent pings will not be captured by the experiments table.

events

The events table includes Normandy enrollment and unenrollment events for both pref-flip and add-on studies.

Normandy events have event category normandy. The event value will contain the experiment slug.

The event schema is described in the Firefox source tree.

The events table is updated daily.

telemetry_shield_study_addon_parquet

The telemetry_shield_study_addon_parquet table contains SHIELD telemetry from add-on experiments, i.e. key-value pairs sent with the browser.study.sendTelemetry() method from the SHIELD study add-on utilities library.

The study_name attribute of the payload column will contain the identifier registered with the SHIELD add-on utilities. This is set by the add-on; sometimes it takes the value of applications.gecko.id from the add-on's manifest.json. This is often not the same as the Normandy slug.

The schema for shield-study-addon pings is described in the mozilla-pipeline-schemas repository.

The key-value pairs are present in data attribute of the payload column.

The telemetry_shield_study_addon_parquet table is produced by direct-to-parquet; data latency should be less than 1 hour.

telemetry_shield_study_parquet

The telemetry_shield_study_parquet dataset includes enrollment and unenrollment events for add-on experiments only, sent by the SHIELD study add-on utilities.

The study_name attribute of the payload column will contain the identifier registered with the SHIELD add-on utilities. This is set by the add-on; sometimes it takes the value of applications.gecko.id from the add-on's manifest.json. This is often not the same as the Normandy slug.

Normandy also emits its own enrollment and unenrollment events for these studies, which are available in the events table.

The telemetry_shield_study_parquet table is produced by direct-to-parquet; data latency should be less than 1 hour.

Raw ping sources

telemetry-cohorts

The telemetry-cohorts dataset contains a subset of pings from clients enrolled in experiments, accessible as a Dataset, and partitioned by experimentId and docType.

Experiments deployed to large fractions of the release channel may have the isHighVolume flag set in the Normandy recipe; those experiments will not be aggregated into the telemetry-cohorts source.

To learn which branch clients are enrolled in, reference the environment.experiments map.

1 Add-on experiments are displayed in Test Tube when the name given in the Normandy recipe matches the the identifier registered with the SHIELD add-on utilities (which is sometimes the applications.gecko.id listed in the add-on's manifest.json). These often do not match.

Search Data

Introduction

This article introduces the datasets we maintain for search analyses: search_aggregates and search_clients_daily. After reading this article, you should understand the search datasets well enough to produce moderately complex analyses.

Table of Contents

Permissions

Access to both search_aggregates and search_clients_daily is heavily restricted in re:dash. We also maintain a restricted group for search on Github and Bugzilla. If you reach a 404 on Github or don't have access to a re:dash query or bug this is likely your issue. To get access permissions, file a bug using the search permissions template

Once you have proper permissions, you'll have access to a new source in re:dash called Presto Search. You will not be able to access any of the search datasets via the standard Presto data source, even with proper permissions.

Terminology

Direct vs Follow-on Search

Searches can be split into three major classes: sap, follow-on, and organic.

SAP searches result from a direct interaction with a search access point (SAP), which is part of the Firefox UI. These searches are often called SAP searches. There are currently 7 SAPs:

  • urlbar - entering a search query in the Awesomebar
  • searchbar - the main search bar; not present by default for new profiles on Firefox 57+
  • newtab - the search bar on the about:newtab page
  • abouthome - the search bar on the about:home page
  • contextmenu - selecting text and clicking "Search" from the context menu
  • system - starting Firefox from the command line with an option that immediately makes a search
  • webextension - initiated from a web extension (added as of Firefox 63)
  • alias - initiated from a search keyword (like @google) (added as of Firefox 64)

Users will often interact with the Search Engine Results Page (SERP) to create "downstream" queries. These queries are called follow-on queries. These are sometimes also referred to as in-content queries since they are initiated from the content of the page itself and not from the Firefox UI.

For example, follow-on queries can be caused by:

  • Revising a query (restaurants becomes restaurants near me)
  • Clicking on the "next" button
  • Accepting spelling suggestions

Finally, we track the number of organic searches. These would be searches initiated directly from a search engine provider, not through a search access point.

Tagged vs Untagged Searches

Our partners (search engines) attribute queries to Mozilla using partner codes. When a user issues a query through one of our SAPs, we include our partner code in the URL of the resulting search.

Tagged queries are queries that include one of our partner codes.

Untagged queries are queries that do not include one of our partner codes. If a query is untagged, it's usually because we do not have a partner deal for that search engine and region (or it is an organic search that did not start from an SAP).

If an SAP query is tagged, any follow-on query should also be tagged.

Standard Search Aggregates

We report five types of searches in our search datasets: sap, tagged-sap, tagged-follow-on, organic, and unknown. These aggregates show up as columns in the search_aggregates and search_clients_daily datasets. Our search datasets are all derived from main_summary. The aggregate columns are derived from the SEARCH_COUNTS histogram.

The sap column counts all SAP (or direct) searches. sap search counts are collected via probes within the Firefox UI These counts are very reliable, but do not count follow-on queries.

In 2017-06 we deployed the followonsearch addon, which adds probes for tagged-sap and tagged-follow-on searches. These columns attempt to count all tagged searches by looking for Mozilla partner codes in the URL of requests to partner search engines. These search counts are critical to understanding revenue since they exclude untagged searches and include follow-on searches. However, these search counts have important caveats affecting their reliability. See In Content Telemetry Issues for more information.

In 2018, we incorporated this code into the product (as of version 61) and also started tracking so-called "organic" searches that weren't initiated through a search access point (sap). This data has the same caveats as those for follow on searches, above.

We also started tracking "unknown" searches, which generally correspond to clients submitting random/unknown search data to our servers as part of their telemetry payload. This category can generally safely be ignored, unless its value is extremely high (which indicates a bug in either Firefox or the aggregation code which creates our datasets).

In main_summary, all of these searches are stored in search_counts.count, which makes it easy to over count searches. However, in general, please avoid using main_summary for search analyses -- it's slow and you will need to duplicate much of the work done to make analyses of our search datasets tractable.

Outlier Filtering

We remove search count observations representing more than 10,000 searches for a single search engine in a single ping.

In Content Telemetry Issues

The search code module inside Firefox (formerly implemented as an addon until version 60) implements the probe used to measure tagged-sap and tagged-follow-on searches and also tracks organic searches. This probe is critical to understanding our revenue. It's the only tool that gives us a view of follow-on searches and differentiates between tagged and untagged queries. However, it comes with some notable caveats.

Relies on whitelists

Firefox's search module attempts to count all tagged searches by looking for Mozilla partner codes in the URL of requests to partner search engines. To do this, it relies on a whitelist of partner codes and URL formats. The list of partner codes is incomplete and only covers a few top partners. These codes also occasionally change so there will be gaps in the data.

Additionally, changes to search engine URL formats can cause problems with our data collection. See this query for a notable example.

Limited historical data

The followonsearch addon was first deployed in 2017-06. There is no tagged-* search data available before this.

Search Aggregates

Introduction

search_aggregates is designed to power high level search dashboards. It's quick and easy to query, but the data are coarse. In particular, this dataset allows you to segment by a limited number of client characteristics which are relevant to search markets. However, it is not possible to normalize by client count. If you need fine-grained data, consider using search_clients_daily which breaks down search counts by client

Contents

Each row of search_aggregates contains the standard search count aggregations for each unique combination of the following columns. Unless otherwise noted, these columns are taken directly from main_summary.

  • submission_date - yyyymmdd
  • engine - e.g. google, bing, yahoo
  • source - The UI component used to issue a search - e.g. urlbar, abouthome
  • country
  • locale
  • addon_version - The installed version of the [followonsearch addon] (before version 61)
  • app_version
  • distribution_id - NULL means the standard Firefox build
  • search_cohort - NULL except for small segments relating to search experimentation

There are five aggregation columns: sap, tagged-sap, and tagged-follow-on, organic and unknown. Each of these columns represent different types of searches. For more details, see the search data documentation Note that, if there were no such searches in a row's segment (i.e. the count would be 0), the column value is null.

Accessing the Data

Access to search_aggregates is heavily restricted. You will not be able to access this table without additional permissions. For more details see the search data documentation.

Data Reference

Example Queries

This query calculates daily US searches. If you have trouble viewing this query, it's likely you don't have the proper permissions. For more details see the search data documentation.

Scheduling

This job is scheduled on airflow to run daily.

Schema

As of 2018-11-28, the current version of search_aggregates is v4, and has a schema as follows. The dataset is backfilled through 2016-06-06

root
 |-- country: string (nullable = true)
 |-- engine: string (nullable = true)
 |-- source: string (nullable = true)
 |-- submission_date: string (nullable = true)
 |-- app_version: string (nullable = true)
 |-- distribution_id: string (nullable = true)
 |-- locale: string (nullable = true)
 |-- search_cohort: string (nullable = true)
 |-- addon_version: string (nullable = true)
 |-- tagged-sap: long (nullable = true)
 |-- tagged-follow-on: long (nullable = true)
 |-- sap: long (nullable = true)
 |-- organic: long (nullable = true)
 |-- unknown: long (nullable = true)

Code Reference

The search_aggregates job is defined in python_mozetl

Search Clients Daily

Introduction

search_clients_daily is designed to enable client-level search analyses. Querying this dataset can be slow; consider using search_aggregates for coarse analyses.

Contents

search_clients_daily has one row for each unique combination of: (client_id, submission_date, engine, source).

In addition to the standard search count aggregations, this dataset includes some descriptive data for each client. For example, we include country and channel for each row of data. In the event that a client sends multiple pings on a given submission_date we choose an arbitrary value from the pings for that (client_id, submission_date), unless otherwise noted.

There are five standard search count aggregation columns: sap, tagged-sap, and tagged-follow-on, organic and unknown. Note that, if there were no such searches in a row's segment (i.e. the count would be 0), the column value is null. Each of these columns represent different types of searches. For more details, see the search data documentation

Background and Caveats

search_clients_daily does not include (client_id submission_date) pairs if we did not receive a ping for that submission_date.

We impute a NULL engine and source for pings with no search counts. This ensures users who never search are included in this dataset.

This dataset is large. Consider using an Databricks for heavy analyses. If these datasets do not suffice, consider using Spark on Databricks. If you're querying this dataset from re:dash, heavily limit the data you read using submission_date_s3 or sample_id.

Accessing the Data

Access to search_clients_daily is heavily restricted. You will not be able to access this table without additional permissions. For more details see the search data documentation.

Data Reference

Example Queries

This query calculates searches per normalized_channel for US clients on an arbitrary day. If you have trouble viewing this query, it's likely you don't have the proper permissions. For more details see the search data documentation.

Scheduling

This dataset is scheduled on Airflow (source).

Schema

As of 2018-11-28, the current version of search_clients_daily is v4, and has a schema as follows. It's backfilled through 2016-06-07

root
 |-- client_id: string (nullable = true)
 |-- submission_date: string (nullable = true)
 |-- engine: string (nullable = true)
 |-- source: string (nullable = true)
 |-- country: string (nullable = true)
 |-- app_version: string (nullable = true)
 |-- distribution_id: string (nullable = true)
 |-- locale: string (nullable = true)
 |-- search_cohort: string (nullable = true)
 |-- addon_version: string (nullable = true)
 |-- os: string (nullable = true)
 |-- channel: string (nullable = true)
 |-- profile_creation_date: long (nullable = true)
 |-- default_search_engine: string (nullable = true)
 |-- default_search_engine_data_load_path: string (nullable = true)
 |-- default_search_engine_data_submission_url: string (nullable = true)
 |-- sample_id: string (nullable = true)
 |-- sessions_started_on_this_day: long (nullable = true)
 |-- profile_age_in_days: integer (nullable = true)
 |-- subsession_hours_sum: double (nullable = true)
 |-- active_addons_count_mean: double (nullable = true)
 |-- max_concurrent_tab_count_max: integer (nullable = true)
 |-- tab_open_event_count_sum: long (nullable = true)
 |-- active_hours_sum: double (nullable = true)
 |-- tagged-sap: long (nullable = true)
 |-- tagged-follow-on: long (nullable = true)
 |-- sap: long (nullable = true)
 |-- tagged_sap: long (nullable = true)
 |-- tagged_follow_on: long (nullable = true)
 |-- organic: long (nullable = true)
 |-- unknown: long (nullable = true)
 |-- submission_date_s3: string (nullable = true)

Code Reference

The search_clients_daily job is defined in python_mozetl

Other Datasets

These datasets are for projects outside of the Firefox telemetry domain.

hgpush

This dataset records facts about individual commits to the Firefox source tree in the mozilla-central source code repository.

Data Reference

The dataset is accessible via STMO. Use the eng_workflow_hgpush_parquet_v1 table with the Athena data source. (The Presto data source is also available, but much slower.)

Field Types and Descriptions

See the hgpush ping schema for a description of available fields.

Be careful to:

  • Use the latest schema version. e.g. v1. Browse the hgpush schema directory in the GitHub repo to be sure.
  • Change dataset field names from camelCaseNames to under_score_names in STMO. e.g. reviewSystemUsed in the ping schema becomes review_system_used in STMO.

Example Queries

Select the number of commits with an 'unknown' review system in the last 7 days:

select
    count(1)
from
    eng_workflow_hgpush_parquet_v1
where
    review_system_used = 'unknown'
    and date_diff('day', from_unixtime(push_date), now()) < 7

Code Reference

The dataset is populated via the Commit Telemetry Service.

What is the Stub Installer ping?

When the stub installer completes with almost any result, it generates a ping containing some data about the system and about how the installation went. This ping isn't part of Firefox unified telemetry, it's a bespoke system; we can't use the telemetry client code when it isn't installed yet.

No ping is sent if the installer exits early because initial system requirements checks fail.

How it’s processed

They are formed and sent from NSIS code (!) in the stub installer, in the SendPing subroutine.

They are processed into Redshift by dsmo_load.

How to access the data

The Redshift tables are accessible from the DSMO-RS data source in STMO.

The canonical documentation is in this tree.

There are three tables produced every day (you can see them in Redshift as {tablename}_YYYYMMDD:

  • download_stats_YYYYMMDD (source)
  • download_stats_funnelcake_YYYYMMDD (source)
  • download_stats_errors_YYYYMMDD (source)

The funnelcake tables aggregate funnelcake builds, which have additional metadata for tracking distribution experiments. More on Funnelcake.

download_stats (without the date appended) and download_stats_year are views that union all (or a year's worth) of the per-day tables together, which makes e.g. SELECT * LIMIT 10 operations on them quite slow.

Note about os_version: Previous versions of Windows have used a very small set of build numbers through their entire life cycle. However, Windows 10 gets a new build number with every major update (about every 6 months), and many more builds have been released on its insider channels. So, to prevent a huge amount of noise, queries using this field should generally filter out the build number and only use the major and minor version numbers to differentiate Windows versions, unless the build number is specifically needed.

What is Activity Stream?

Activity Stream is the Firefox module which manages the in product content pages for Firefox:

The Activity Stream team has implemented data collection in and around these pages. This data has some overlap with the standard Firefox Telemetry system, however it is a custom system, designed and maintained by that team.

For specific questions about this data, reach out to the #fx-messaging-system Slack channel directly.

Activity Stream Pings

This data is measured in various custom pings that are sent via PingCentre (different from Pingsender).

Accessing Activity Stream Data

The various Activity Stream pings are stored in tables stored in the Tiles Redshift database, maintained by the Activity Stream team.

This database can be accessed via re:dash, or in Databricks via a workaround provided by the Data Operations team, tracked in this bug.

Gotchas and Caveats

Since this data collection isn't collected or maintained through our standard Telemetry API, there are a number of "gotchas" to keep in mind when working on this data.

  • Ping send conditions: Activity Stream pings have different send conditions, both from Telemetry pings as well as from each other. AS Health Pings, for example, get sent by all profiles with Telemetry enabled, upon startup of each Firefox session. In contrast, AS Session Pings only get sent by profiles that entered an Activity Stream session, at the end of that session, regardless of how long that session is. Compare this to main pings, which get sent by all Telemetry enabled profiles upon subsession end (browser shutdown, environment change, or local midnight cutoff).

    Due to these inconsistencies, using data from different sources can be tricky. For example, if we wanted to know how much of DAU (from main pings) had a custom about:home page (available in AS Health Pings), joining on client_id and a date field would only provide information on profiles that started the session on that same day (active profiles on multi-day sessions would be excluded).

  • Population covered: In addition to the usual considerations when looking at a measurement (in what version of Firefox did this measurement start getting collected? In what channels is it enabled in? etc.), when working with this data, there are additional Activity Stream specific conditions to consider when deciding "who is eligible to send this ping?"

    For example, Pocket recommendations are only enabled in the US, CA, and DE countries, for profiles that are on en-US, en-CA, and DE locales. Furthermore, users can set their about:home and about:newtab page to non-Activity Stream pages. This information can be important when deciding denominators for certain metrics.

  • Different ping types in the same table: The tables in the Tiles database can contain multiple types of pings. For example, the assa_events_daily table contains both AS Page Takeover pings as well as AS User Event pings.

  • Inconsistent fields: In some tables, the same field can have different meanings for different records.

    For example, in the assa_router_events_daily table, the impression_id field corresponds to the standard Telemetry client_id field for Snippets impressions, CFR impressions for pre-release and shield experiments, and onboarding impressions. However, for CFR impressions for release, this field is a separate, impression identifier.

  • Passing Experiment Tags: If a profile is enrolled in a Normandy experiment, the experiment slug for that profile is only passed to the Activity Stream data if it contains the string "activity-stream".

    In other words, Activity Stream will not tag data as belonging to an experiment if it is missing "activity-stream" in the slug, even if it is indeed enrolled in an experiment.

  • Data field formats: The format for some of the data that is shared with standard Telemetry can differ.

    For example, experiment slugs in standard Telemetry is formatted as an array of maps (one for each experiment the profile is enrolled in)

    [{'experiment1_name':'branch_name'}, {'experiment2_name':'branch_name'}]

    However, in the Activity Stream telemetry, experiment slugs are reported in a string, using ; as a separater between experiments and : as a separator between experiment name and branch name.

    'experiment1_name:branch_name;experiment2_name:branch_name'

  • Null handling: Some fields in the Activity Stream data encode nulls with a 'N/A' string or a -1 value.

  • Changes in ping behaviors: These pings continue to undergo development and the behavior as well as possible values for a given ping seem to change over time. For example, older versions of the event pings for clicking on a Topsite do not seem to report card_types and icon_types, while newer versions do. Caution is advised.

  • Pocket data: Data related to Pocket interaction and usage in the about:home and about:newtab pages get sent to Pocket via this data collection and pipeline. However, due to privacy reasons, that data is sanitized and client_id is randomized. So while it is possible to ask, "how many Topsites and Highlights did a given profile click on in a given day?", we cannot answer that question for Pocket tiles.

Examples

Sessions per client_id

Note: only includes client_ids that completed an Activity Stream session that day.

SELECT
	client_id, 
	date, 
	count(DISTINCT session_id) as num_sessions
FROM
	assa_sessions_daily_by_client_id
WHERE
	date = '20190601' 
GROUP BY 
	1

Topsite clicks and Highlights clicks

SELECT
	client_id, 
	date, 
	session_id,
	page, 
	source, 
	action_position, 
	shield_id
FROM
	assa_events_daily
WHERE
	source in ('TOP_SITES', 'HIGHLIGHTS')
	AND event = 'CLICK'
	AND date = '20190601' 

Snippet impressions, blocks, clicks, and dismissals

Note: Which snippet message a record corresponds to can be identified by the message_id (check with Marketing for snippet recipes published).

SELECT 
    impression_id AS client_id, 
    date, 
    source,
    event,
    message_id, 
    value,
    shield_id
FROM 
	assa_router_events_daily
WHERE 
	source = 'snippets_user_event'
  	AND date = '20190601'

Obsolete Datasets

These datasets are no longer updated or maintained. Please reach out to the Data Platform team if you think your needs are best met by an obsolete dataset.

Heavy Users

As of 2018-05-18, this dataset has been deprecated and is no longer maintained. See Bug 1455314

Replacement

We've moved to assigning user's an active tag based on total_uri_count, see the Active DAU definition.

The activity of a user based on active_ticks is available in clients_daily in the active_hours_sum field, which has the sum(active_ticks / 720).

To retrieve a client's 28-day active_hours, use the following query:

SELECT submission_date_s3,
       client_id,
       SUM(active_hours_sum) OVER (PARTITION BY client_id
                                   ORDER BY submission_date_s3 ASC
                                   ROWS 27 PRECEDING) AS monthly_active_hours
FROM
    clients_daily

Introduction

The heavy_users table provides information about whether a given client_id is considered a "heavy user" on each day (using submission date).

Contents

The heavy_users table contains one row per client-day, where day is submission_date. A client has a row for a specific submission_date if they were active at all in the 28 day window ending on that submission_date.

A user is a "heavy user" as of day N if, for the 28 day period ending on day N, the sum of their active_ticks is in the 90th percentile (or above) of all clients during that period. For more analysis on this, and a discussion of new profiles, see this link.

Background and Caveats

  1. Data starts at 20170801. There is technically data in the table before this, but the heavy_user column is NULL for those dates because it needed to bootstrap the first 28 day window.
  2. Because it is top the 10% of clients for each 28 day period, more than 10% of clients active on a given submission_date will be considered heavy users. If you join with another data source (main_summary, for example), you may see a larger proportion of heavy users than expected.
  3. Each day has a separate, but related, set of heavy users. Initial investigations show that approximately 97.5% of heavy users as of a certain day are still considered heavy users as of the next day.
  4. There is no "fixing" or weighting of new profiles - days before the profile was created are counted as zero active_ticks. Analyses may need to use the included profile_creation_date field to take this into account.

Accessing the Data

The data is available both via sql.t.m.o and Spark.

In Spark:

spark.read.parquet("s3://telemetry-parquet/heavy_users/v1")

In SQL:

SELECT * FROM heavy_users LIMIT 3

Further Reading

The code responsible for generating this dataset is here

Data Reference

Example Queries

Example queries:

Scheduling

This dataset is updated daily via the telemetry-airflow infrastructure. The job DAG runs every day after main_summary is complete. You can find the job definition here.

Schema

As of 2017-10-05, the current version of the heavy_users dataset is v1, and has a schema as follows:

root
 |-- client_id: string (nullable = true)
 |-- sample_id: integer (nullable = true)
 |-- profile_creation_date: long (nullable = true)
 |-- active_ticks: long (nullable = true)
 |-- active_ticks_period: long (nullable = true)
 |-- heavy_user: boolean (nullable = true)
 |-- prev_year_heavy_user: boolean (nullable = true)
 |-- submission_date_s3: string (nullable = true)

Code Reference

This dataset is generated by telemetry-batch-view. Refer to this repository for information on how to run or augment the dataset.

Crash Aggregates

As of 2018-04-02, this dataset has been deprecated and is no longer maintained. Please use error aggregates instead. See Bug 1388025 for more information.

Introduction

The crash_aggregates dataset compiles crash statistics over various dimensions for each day.

Rows and Columns

There's one column for each of the stratifying dimensions and the crash statistics. Each row is a distinct set of dimensions, along with their associated crash stats. Example stratifying dimensions include channel and country, example statistics include usage hours and plugin crashes.

Accessing the Data

This dataset is accessible via re:dash.

The data is stored as a parquet table in S3 at the following address.

s3://telemetry-parquet/crash_aggregates/v1/

Further Reading

The technical documentation for this dataset can be found in the telemetry-batch-view documentation

Data Reference

Example Queries

Here's an example query that computes crash rates for each channel (sorted by number of usage hours):

SELECT dimensions['channel'] AS channel,
       sum(stats['usage_hours']) AS usage_hours,
       1000 * sum(stats['main_crashes']) / sum(stats['usage_hours']) AS main_crash_rate,
       1000 * sum(stats['content_crashes']) / sum(stats['usage_hours']) AS content_crash_rate,
       1000 * sum(stats['plugin_crashes']) / sum(stats['usage_hours']) AS plugin_crash_rate,
       1000 * sum(stats['gmplugin_crashes']) / sum(stats['usage_hours']) AS gmplugin_crash_rate,
       1000 * sum(stats['gpu_crashes']) / sum(stats['usage_hours']) AS gpu_crash_rate
FROM crash_aggregates
GROUP BY dimensions['channel']
ORDER BY -sum(stats['usage_hours'])

Main process crashes by build date and OS version.

WITH channel_rates AS (
  SELECT dimensions['build_id'] AS build_id,
         SUM(stats['main_crashes']) AS main_crashes, -- total number of crashes
         SUM(stats['usage_hours']) / 1000 AS usage_kilohours, -- thousand hours of usage
         dimensions['os_version'] AS os_version -- os version
   FROM crash_aggregates
   WHERE dimensions['experiment_id'] is null -- not in an experiment
     AND regexp_like(dimensions['build_id'], '^\d{14}$') -- validate build IDs
     AND dimensions['build_id'] > '20160201000000' -- only in the date range that we care about
   GROUP BY dimensions['build_id'], dimensions['os_version']
)
SELECT cast(parse_datetime(build_id, 'yyyyMMddHHmmss') as date) as build_id, -- program build date
       usage_kilohours, -- thousands of usage hours
       os_version, -- os version
       main_crashes / usage_kilohours AS main_crash_rate -- crash rate being defined as crashes per thousand usage hours
FROM channel_rates
WHERE usage_kilohours > 100 -- only aggregates that have statistically significant usage hours
ORDER BY build_id ASC

Sampling

Invalid Pings

We ignore invalid pings in our processing. Invalid pings are defined as those that:

  • The submission dates or activity dates are invalid or missing.
  • The build ID is malformed.
  • The docType field is missing or unknown.
  • The ping is a main ping without usage hours or a crash ping with usage hours.

Scheduling

The crash_aggregates job is run daily, at midnight UTC. The job is scheduled on Airflow. The DAG is here

Schema

The crash_aggregates table has 4 commonly-used columns:

  • submission_date is the date pings were submitted for a particular aggregate.
    • For example, select sum(stats['usage_hours']) from crash_aggregates where submission_date = '2016-03-15' will give the total number of user hours represented by pings submitted on March 15, 2016.
    • The dataset is partitioned by this field. Queries that limit the possible values of submission_date can run significantly faster.
  • activity_date is the day when the activity being recorded took place.
    • For example, select sum(stats['usage_hours']) from crash_aggregates where activity_date = '2016-03-15' will give the total number of user hours represented by activities that took place on March 15, 2016.
    • This can be several days before the pings are actually submitted, so it will always be before or on its corresponding submission_date.
    • Therefore, queries that are sensitive to when measurements were taken on the client should prefer this field over submission_date.
  • dimensions is a map of all the other dimensions that we currently care about. These fields include:
    • dimensions['build_version'] is the program version, like 46.0a1.
    • dimensions['build_id'] is the YYYYMMDDhhmmss timestamp the program was built, like 20160123180541. This is also known as the build ID or buildid.
    • dimensions['channel'] is the channel, like release or beta.
    • dimensions['application'] is the program name, like Firefox or Fennec.
    • dimensions['os_name'] is the name of the OS the program is running on, like Darwin or Windows_NT.
    • dimensions['os_version'] is the version of the OS the program is running on.
    • dimensions['architecture'] is the architecture that the program was built for (not necessarily the one it is running on).
    • dimensions['country'] is the country code for the user (determined using geoIP), like US or UK.
    • dimensions['experiment_id'] is the identifier of the experiment being participated in, such as e10s-beta46-noapz@experiments.mozilla.org, or null if no experiment.
    • dimensions['experiment_branch'] is the branch of the experiment being participated in, such as control or experiment, or null if no experiment.
    • dimensions['e10s_enabled'] is whether E10s is enabled.
    • dimensions['gfx_compositor'] is the graphics backend compositor used by the program, such as d3d11, opengl and simple. Null values may be reported as none as well.
    • All of the above fields can potentially be blank, which means "not present". That means that in the actual pings, the corresponding fields were null.
  • stats contains the aggregate values that we care about:
    • stats['usage_hours'] is the number of user-hours represented by the aggregate.
    • stats['main_crashes'] is the number of main process crashes represented by the aggregate (or just program crashes, in the non-E10S case).
    • stats['content_crashes'] is the number of content process crashes represented by the aggregate.
    • stats['plugin_crashes'] is the number of plugin process crashes represented by the aggregate.
    • stats['gmplugin_crashes'] is the number of Gecko media plugin (often abbreviated GMPlugin) process crashes represented by the aggregate.
    • stats['content_shutdown_crashes'] is the number of content process crashes that were caused by failure to shut down in a timely manner.
    • stats['gpu_crashes'] is the number of GPU process crashes represented by the aggregate.

TODO(harter): https://bugzilla.mozilla.org/show_bug.cgi?id=1361862

As of 2019-10-23, this dataset has been deprecated and is no longer maintained. See Bug 1585539.

Client Count Daily Reference

As of 2019-04-10, this dataset has been deprecated and is no longer maintained. Please use clients_last_seen instead. See Bug 1543518 for more information.

Replacement

We've moved to calculating exact user counts based on clients_last_seen, see DAU and Active DAU for examples.

Introduction

The client_count_daily dataset is useful for estimating user counts over a few pre-defined dimensions.

The client_count_daily dataset is similar to the deprecated client_count dataset except that is aggregated by submission date and not activity date.

Content

This dataset includes columns for a dozen factors and an HLL variable. The hll column contains a HyperLogLog variable, which is an approximation to the exact count. The factor columns include submission date and the dimensions listed here. Each row represents one combinations of the factor columns.

Background and Caveats

It's important to understand that the hll column is not a standard count. The hll variable avoids double-counting users when aggregating over multiple days. The HyperLogLog variable is a far more efficient way to count distinct elements of a set, but comes with some complexity. To find the cardinality of an HLL use cardinality(cast(hll AS HLL)). To find the union of two HLL's over different dates, use merge(cast(hll AS HLL)). The Firefox ER Reporting Query is a good example to review. Finally, Roberto has a relevant write-up here.

Accessing the Data

The data is available in Re:dash. Take a look at this example query.

I don't recommend accessing this data from ATMO.

Further Reading

Data Reference

Example Queries

Compute DAU for non-windows clients for the last week

WITH sample AS (
  SELECT
    os,
    submission_date,
    cardinality(merge(cast(hll AS HLL))) AS count
  FROM client_count_daily
  WHERE submission_date >= DATE_FORMAT(CURRENT_DATE - INTERVAL '7' DAY, '%Y%m%d')
  GROUP BY
    submission_date,
    os
)

SELECT
  os,
  -- formatting date as late as possible improves performance dramatically
  date_parse(submission_date, '%Y%m%d') AS submission_date,
  count
FROM sample
WHERE
  count > 10 -- remove outliers
  AND lower(os) NOT LIKE '%windows%'
ORDER BY
  os,
  submission_date DESC

Compute WAU by Channel for the last week

WITH dau AS (
  SELECT
    normalized_channel,
    submission_date,
    merge(cast(hll AS HLL)) AS hll
  FROM client_count_daily
  -- 2 days of lag, 7 days of results, and 6 days preceding for WAU
  WHERE submission_date > DATE_FORMAT(CURRENT_DATE - INTERVAL '15' DAY, '%Y%m%d')
  GROUP BY
    submission_date,
    normalized_channel
),
wau AS (
  SELECT
    normalized_channel,
    submission_date,
    cardinality(merge(hll) OVER (
      PARTITION BY normalized_channel
      ORDER BY submission_date
      ROWS BETWEEN 6 PRECEDING AND 0 FOLLOWING
    )) AS count
  FROM dau
)

SELECT
  normalized_channel,
  -- formatting date as late as possible improves performance dramatically
  date_parse(submission_date, '%Y%m%d') AS submission_date,
  count
FROM wau
WHERE
  count > 10 -- remove outliers
  AND submission_date > DATE_FORMAT(CURRENT_DATE - INTERVAL '9' DAY, '%Y%m%d') -- only days that have a full WAU

Caveats

The hll column does not product an exact count. hll stands for HyperLogLog, a sophisticated algorithm that allows for the counting of extremely high numbers of items, sacrificing a small amount of accuracy in exchange for using much less memory than a simple counting structure.

When count is calculated over a column that may change over time, such as total_uri_count_threshold, then a client would be counted in every group where they appear. Over longer windows, like MAU, this is more likely to occur.

Scheduling

This dataset is updated daily via the telemetry-airflow infrastructure. The job runs as part of the main_summary DAG.

Schema

The data is partitioned by submission_date which is formatted as %Y%m%d, like 20180130.

As of 2018-03-15, the current version of the client_count_daily dataset is v2, and has a schema as follows:

root
 |-- app_name: string (nullable = true)
 |-- app_version: string (nullable = true)
 |-- country: string (nullable = true)
 |-- devtools_toolbox_opened: boolean (nullable = true)
 |-- e10s_enabled: boolean (nullable = true)
 |-- hll: binary (nullable = true)
 |-- locale: string (nullable = true)
 |-- normalized_channel: string (nullable = true)
 |-- os: string (nullable = true)
 |-- os_version: string (nullable = true)
 |-- top_distribution_id: string (nullable = true)
 |-- total_uri_count_threshold: integer (nullable = true)

1 Day Retention

As of 2019-08-13, this dataset has been deprecated and is no longer maintained. See Bug 1571565 for historical sources. See the retention cookbook for current best practices.

Introduction

The retention table provides client counts relevant to client retention at a 1-day granularity. The project is tracked in Bug 1381840

Contents

The retention table contains a set of attribute columns used to specify a cohort of users and a set of metric columns to describe cohort activity. Each row contains a permutation of attributes, an approximate set of clients in a cohort, and the aggregate engagement metrics.

This table uses the HyperLogLog (HLL) sketch to create an approximate set of clients in a cohort. HLL allows counting across overlapping cohorts in a single pass while avoiding the problem of double counting. This data-structure has the benefit of being compact and performant in the context of retention analysis, at the expense of precision. For example, calculating a 7-day retention period can be obtained by aggregating over a week of retention data using the union operation. With SQL primitive, this requires a recalculation of COUNT DISTINCT over client_id's in the 7-day window.

Background and Caveats

  1. The data starts at 2017-03-06, the merge date where Nightly started to track Firefox 55 in Mozilla-Central. However, there was not a consistent view into the behavior of first session profiles until the new_profile ping. This means much of the data is inaccurate before 2017-06-26.
  2. This dataset uses 4 day reporting latency to aggregate at least 99% of the data in a given submission date. This figure is derived from the telemetry-health measurements on submission latency, with the discussion in Bug 1407410. This latency metric was reduced Firefox 55 with the introduction of the shutdown ping-sender mechanism.
  3. Caution should be taken before adding new columns. Additional attribute columns will grow the number of rows exponentially.
  4. The number of HLL bits chosen for this dataset is 13. This means the default size of the HLL object is 2^13 bits or 1KiB. This maintains about a 1% error on average. See this table from Algebird's HLL implementation for more details.

Accessing the Data

The data is primarily available through Re:dash on STMO via the Presto source. This service has been configured to use predefined HLL functions.

The column should first be cast to the HLL type. The scalar cardinality(<hll_column>) function will approximate the number of unique items per HLL object. The aggregate merge(<hll_column>) function will perform the set union between all objects in a column.

Example: Cast the count column into the appropriate type.

SELECT cast(hll as HLL) as n_profiles_hll FROM retention

Count the number of clients seen over all attribute combinations.

SELECT cardinality(cast(hll as HLL)) FROM retention

Group-by and aggregate client counts over different release channels.

SELECT channel, cardinality(merge(cast(hll AS HLL))
FROM retention
GROUP BY channel

The HyperLogLog library wrappers are available for use outside of the configured STMO environment, spark-hyperloglog and presto-hyperloglog.

Also see the client_count_daily dataset.

Data Reference

Example Queries

See the Example Usage Dashboard for more usages of datasets of the same shape.

Scheduling

The job is scheduled on Airflow on a daily basis after main_summary is run for the day. This job requires both mozetl and telemetry-batch-view as dependencies.

Schema

As of 2017-10-10, the current version of retention is v1 and has a schema as follows:

root
 |-- subsession_start: string (nullable = true)
 |-- profile_creation: string (nullable = true)
 |-- days_since_creation: long (nullable = true)
 |-- channel: string (nullable = true)
 |-- app_version: string (nullable = true)
 |-- geo: string (nullable = true)
 |-- distribution_id: string (nullable = true)
 |-- is_funnelcake: boolean (nullable = true)
 |-- source: string (nullable = true)
 |-- medium: string (nullable = true)
 |-- content: string (nullable = true)
 |-- sync_usage: string (nullable = true)
 |-- is_active: boolean (nullable = true)
 |-- hll: binary (nullable = true)
 |-- usage_hours: double (nullable = true)
 |-- sum_squared_usage_hours: double (nullable = true)
 |-- total_uri_count: long (nullable = true)
 |-- unique_domains_count: double (nullable = true)

Code Reference

The ETL script for processing the data before aggregation is found in mozetl.engagement.retention. The aggregate job is found in telemetry-batch-view as the RetentionView.

The runner script performs all the necessary setup to run on EMR. This script can be used to perform backfill.

Churn

As of 2019-08-21, this dataset has been deprecated and is no longer maintained. See Bug 1561048 for historical sources. See the retention cookbook for current best practices.

Introduction

The churn dataset tracks the 7-day churn rate of telemetry profiles. This dataset is generally used for analyzing cohort churn across segments and time.

Content

Churn is the rate of attrition defined by (clients seen in week N)/(clients seen in week 0) for groups of clients with some shared attributes. A group of clients with shared attributes is called a cohort. The cohorts in this dataset are created every week and can be tracked over time using the acquisition_date and the weeks since acquisition or current_week.

The following example demonstrates the current logic for generating this dataset. Each column represents the days since some arbitrary starting date.

client000102030405060708091011121314
AXX
BXXXXXX
CXX

All three clients are part of the same cohort. Client A is retained for weeks 0 and 1 since there is activity in both periods. A client only needs to show up once in the period to be counted as retained. Client B is acquired in week 0 and is active frequently but does not appear in following weeks. Client B is considered churned on week 1. However, a client that is churned can become retained again. Client C is considered churned on week 1 but retained on week 2.

The following table summarizes the above daily activity into the following view where every column represents the current week since acquisition date..

client012
AXX
BX
CXX

The clients are then grouped into cohorts by attributes. An attribute describes a property about the cohort such as the country of origin or the binary distribution channel. Each group also contains descriptive aggregates of engagement. Each metric describes the activity of a cohort such as size and overall usage at a given time instance.

Background and Caveats

The original concept for churn is captured in this Mana page. The original derived data-set was created in bug 1198537. The first major revision (v2) of this data-set added attribution, search, and uri counts. The second major revision (v3) included additional clients through the new-profile ping and adjusted the collection window from 10 to 5 days.

  • Each row in this dataset describes a unique segment of users
    • The number of rows is exponential with the number of dimensions
    • New fields should be added sparing to account for data-set size
  • The dataset lags by 10 days in order account for submission latency
    • This value was determined to be time for 99% of main pings to arrive at the server. With the shutdown-ping sender, this has been reduced to 4 days. However, churn_v3 still tracks releases older than Firefox 55.
  • The start of the period is fixed to Sundays. Once it has been aggregated, the period cannot be shifted due to the way clients are counted.
    • A supplementary 1-day retention dataset using HyperLogLog for client counts is available for counting over arbitrary retention periods and date offsets. Additionally, calculating churn or retention over specific cohorts is tractable in STMO with main_summary or clients_daily datasets.

Accessing the Data

churn is available in Re:dash under Athena and Presto. The data is also available in parquet for consumption by columnar data engines at s3://telemetry-parquet/churn/v3.

Data Reference

Example Queries

This section walks through a typical query to generate data suitable for visualization.

fieldtypedescription
cohort_datecommon, attributeThe start date bucket of the cohort. This is week the client was acquired.
elapsed_periodscommon, attributeThe number of periods that have elapsed since the cohort date. In this dataset, the retention period is 7 days.
channelattributePart of the release train model. An attribute that distinguishes cohorts.
geofilter attributeCountry code. Used to filter out all countries other than the 'US'
n_profilesmetricCount of users in a cohort. Use sum to aggregate.

First the fields are extracted and aliased for consistency. cohort_date and elapsed_periods are common to most retention queries and are useful concepts for building on other datasets.

WITH extracted AS (
    SELECT acquisition_period AS cohort_date,
           current_week AS elapsed_periods,
           n_profiles,
           channel,
           geo
    FROM churn
),

The extracted table is filtered down to the attributes of interest. The cohorts of interest originate in the US and are in the release or beta channels. Note that channel here is the concatenation of the normalized channel and the funnelcake id. Only cohorts appearing after August 6, 2017 are chosen to be in this population.

 population AS (
    SELECT channel,
           cohort_date,
           elapsed_periods,
           n_profiles
    FROM extracted
    WHERE geo = 'US'
      AND channel IN ('release', 'beta')
      AND cohort_date > '20170806'
      -- filter out noise from clients with incorrect dates
      AND elapsed_periods >= 0
      AND elapsed_periods < 12
),

The number of profiles is aggregated by the cohort dimensions. The cohort acquisition date and elapsed periods since acquisition are fundamental to cohort analysis.

 cohorts AS (
     SELECT channel,
            cohort_date,
            elapsed_periods,
            sum(n_profiles) AS n_profiles
     FROM population
     GROUP BY 1, 2, 3
),

The table will have the following structure. The table is sorted by the first three columns for demonstration.

channelcohort_dateelapsed_periodsn_profiles
release201701010100
release20170101190
release20170101280
............
beta201701281025

Finally, retention is calculated through the number of profiles at the time of the elapsed_period relative to the initial period. This data can be imported into a pivot table for further analysis.

results AS (
    SELECT c.*,
           iv.n_profiles AS total_n_profiles,
           (0.0+c.n_profiles)*100/iv.n_profiles AS percentage_n_profiles
    FROM cohorts c
    JOIN (
        SELECT *
        FROM cohorts
        WHERE elapsed_periods = 0
    ) iv ON (
        c.cohort_date = iv.cohort_date
        AND c.channel = iv.channel
    )
)
channelcohort_dateelapsed_periodsn_profilestotal_n_profilespercentage_n_profiles
release2017010101001001.0
release201701011901000.9
release201701012801000.8
...................
beta201701281025500.5

Obtain the results.

SELECT *
FROM results

You may consider visualizing using cohort graphs, line charts, or a pivot tables. See Firefox Telemetry Retention: Dataset Example Usage for more examples.

Scheduling

The aggregated churn data is updated weekly on Wednesday.

Schema

As of 2017-10-15, the current version of churn is v3 and has a schema as follows:

root
 |-- channel: string (nullable = true)
 |-- geo: string (nullable = true)
 |-- is_funnelcake: string (nullable = true)
 |-- acquisition_period: string (nullable = true)
 |-- start_version: string (nullable = true)
 |-- sync_usage: string (nullable = true)
 |-- current_version: string (nullable = true)
 |-- current_week: long (nullable = true)
 |-- source: string (nullable = true)
 |-- medium: string (nullable = true)
 |-- campaign: string (nullable = true)
 |-- content: string (nullable = true)
 |-- distribution_id: string (nullable = true)
 |-- default_search_engine: string (nullable = true)
 |-- locale: string (nullable = true)
 |-- is_active: string (nullable = true)
 |-- n_profiles: long (nullable = true)
 |-- usage_hours: double (nullable = true)
 |-- sum_squared_usage_hours: double (nullable = true)
 |-- total_uri_count: long (nullable = true)
 |-- unique_domains_count_per_profile: double (nullable = true)

Code Reference

The script for generating churn currently lives in mozilla/python_mozetl. The job can be found in mozetl/engagement/churn.

Sync Summary and Sync Flat Summary Reference

Introduction

Note: some of the information in this chapter is a duplication of the info found on this wiki page. You can also find more detailed information about the data contained in the sync ping here

sync_summary and sync_flat_summary are the primary datasets that track the health of sync. sync_flat_summary is derived from sync_summary by unpacking/exploding the engines field of the latter, so they ultimately contain the same data (see below).

Which dataset should I use?

Which dataset to use depends on whether you are interested in per-engine sync success or per-sync sync success (see below). If you are interested in whether a sync failed overall, regardless of which engine may have caused the failure, then you can use sync_summary. Otherwise, if you are interested in per-engine data, you should use sync_flat_summary.

If you aren't sure, or just trying to get acquainted, you should probably just use sync_flat_summary.

Data Reference

A note about user IDs

Unlike most other telemetry datasets, these do not contain the profile-level identifier client_id. Because you need to sign up for a Firefox Account in order to use sync, these datasets instead include an anonymised version of the user's Firefox Account user id uid and an anonymised version of their individual devices' device_ids. Put another way, each uid can have many associated device_ids.

Q: Why not include client_id in these datasets so that they can be joined on (e.g.) main_summary?

A: We've had a policy to keep main browser telemetry separate from sync and FxA telemetry. This is in part because FxA uids are ultimately associated with email addresses in the FxA database, and thus a breach of that database in combination with access to telemetry could in theory de-anonymise client-side browser metrics.

Which apps send sync telemetry? What about Fenix?

Currently, Firefox for desktop, Firefox for iOS and Firefox for Android (fennec) all have sync implemented, and they all send sync telemetry. Though there are some differences in the telemetry that each application sends, it all ends up in the sync_summary and sync_flat_summary datasets.

Starting with Fenix, however, sync telemetry will start to be sent through glean. This means that, in all likelihood, Fenix sync telemetry will initially be segregated from existing sync telemetry (one reason is that current sync telemetry is on AWS while glean pings are ingested to GCP).

What's an engine?

Firefox syncs many different types of browser data and (generally speaking) each one of these data types are synced by their own engine. When the app triggers a "sync" each engine makes their own determination of what needs to be synced (if anything). Many syncs can happen in a day (dozens or more on desktop, usually less on mobile). Telemetry about each sync is logged, and each sync ping (sent once a day, and whenever the user logs in or out of sync) contains information about multiple syncs. The scala code responsible for creating the sync_summary dataset unpacks each sync ping into one row per sync. The resulting engines field is an array of "engine records": data about how each engine performed during that sync. sync_flat_summary further unpacking/exploding the engines field and creates a dataset that is one row per engine record.

Existing engines (engine_name in sync_flat_summary) are listed below with brief descriptions in cases where their name isn't transparent.

Note that not every device syncs each of these engines. They can be disabled individually and some are off by default.

  • addons
  • addresses mailing addresses e.g. for e-commerce; part of form autofill.
  • bookmarks
  • clients non-user-facing list of the sync account's associated devices
  • creditcards this used to be nightly only but was recently removed entirely
  • extension-storage WebExtension storage, in support of the storage.sync WebExtension API.
  • history browsing history.
  • passwords
  • forms saved values in web forms
  • prefs not all prefs are synced
  • tabs note that this is not the same as the "send tab" feature, this is the engine that syncs the tabs you have open across your devices (used to populate the synced tabs sidebar). For data on the send-tab feature use the sync_events dataset.

Example Queries

See this dashboard to get a general sense of what this dataset is typically used for.

Here's an example of a query that will calculate the failure and success rates for a subset of engines per day.

WITH
    counts AS (
        SELECT
          submission_date_s3 AS day,
          engine_name AS engine,
          COUNT(*) AS total,
          COUNT(CASE WHEN engine_status IS NOT NULL THEN true ELSE NULL END) AS count_errors,
          /* note that `engine_status` is null on sync success. */  
          COUNT(CASE WHEN engine_status IS NULL THEN true ELSE NULL END) AS count_success
        FROM telemetry.sync_flat_summary
        WHERE engine_name IN ('bookmarks','history','tabs','addons','addresses','passwords','prefs')
        AND cast(submission_date_s3 AS integer) >= 20190101
        GROUP BY 1,2
        ORDER BY 1
    ),

    rates AS (
        SELECT
          day,
          engine,
          total,
          count_errors,
          count_success,
          CAST(count_errors AS double) / CAST(total AS double) * 100 AS error_rate,
          CAST(count_success AS double) / CAST(total AS double) * 100 AS success_rate
        FROM counts
        ORDER BY 1
    )

SELECT * FROM rates

Sampling

Sadly, these datasets are not sampled. It should be possible to derive a sample_id on uid, however. Someone should do that because querying these datasets for long time horizons is very expensive.

Scheduling

This dataset was updated daily, shortly after midnight UTC. The job was scheduled on Airflow. The DAG was here.

Sync Summary Schema

root
 |-- app_build_id: string (nullable = true)
 |-- app_display_version: string (nullable = true)
 |-- app_name: string (nullable = true)
 |-- app_version: string (nullable = true)
 |-- app_channel: string (nullable = true)
 |-- uid: string
 |-- device_id: string (nullable = true)
 |-- when: integer
 |-- took: integer
 |-- why: string (nullable = true)
 |-- failure_reason: struct (nullable = true)
 |    |-- name: string
 |    |-- value: string (nullable = true)
 |-- status: struct (nullable = true)
 |    |-- sync: string (nullable = true)
 |    |-- status: string (nullable = true)
 |-- devices: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- id: string
 |    |    |-- os: string
 |    |    |-- version: string
 |-- engines: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- name: string
 |    |    |-- took: integer
 |    |    |-- status: string (nullable = true)
 |    |    |-- failure_reason: struct (nullable = true)
 |    |    |    |-- name: string
 |    |    |    |-- value: string (nullable = true)
 |    |    |-- incoming: struct (nullable = true)
 |    |    |    |-- applied: integer
 |    |    |    |-- failed: integer
 |    |    |    |-- new_failed: integer
 |    |    |    |-- reconciled: integer
 |    |    |-- outgoing: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = false)
 |    |    |    |    |-- sent: integer
 |    |    |    |    |-- failed: integer
 |    |    |-- validation: struct (containsNull = false)
 |    |    |    |-- version: integer
 |    |    |    |-- checked: integer
 |    |    |    |-- took: integer
 |    |    |    |-- failure_reason: struct (nullable = true)
 |    |    |    |    |-- name: string
 |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |-- problems: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = false)
 |    |    |    |    |    |-- name: string
 |    |    |    |    |    |-- count: integer

Sync Flat Summary Schema

root
|-- app_build_id: string (nullable = true)
|-- app_display_version: string (nullable = true)
|-- app_name: string (nullable = true)
|-- app_version: string (nullable = true)
|-- app_channel: string (nullable = true)
|-- os: string
|-- os_version: string
|-- os_locale: string
|-- uid: string
|-- device_id: string (nullable = true)
|-- when: integer
|-- took: integer
|-- failure_reason: struct (nullable = true)
|    |-- name: string
|    |-- value: string (nullable = true)
|-- status: struct (nullable = true)
|    |-- sync: string (nullable = true)
|    |-- status: string (nullable = true)
|-- why: string (nullable = true)
|-- devices: array (nullable = true)
|    |-- element: struct (containsNull = false)
|    |    |-- id: string
|    |    |-- os: string
|    |    |-- version: string
|-- sync_id: string
|-- sync_day: string
|-- engine_name: string
|-- engine_took: integer
|-- engine_status: string (nullable = true)
|-- engine_failure_reason: struct (nullable = true)
|    |-- name: string
|    |-- value: string (nullable = true)
|-- engine_incoming_applied: integer (nullable = true)
|-- engine_incoming_failed: integer (nullable = true)
|-- engine_incoming_new_failed: integer (nullable = true)
|-- engine_incoming_reconciled: integer (nullable = true)
|-- engine_outgoing_batch_count: integer (nullable = true)
|-- engine_outgoing_batch_total_sent: integer (nullable = true)
|-- engine_outgoing_batch_total_failed: integer (nullable = true)
|-- submission_date_s3: string

Firefox Accounts Data

Table of Contents

Introduction

This article provides an overview of Firefox Accounts metrics: what is measured and how. See the other articles in this chapter for more details about the specific measurements that are available for analysis.

What is Firefox Accounts?

Firefox Accounts is Mozilla's authentication solution for account-based end-user services and features. At the time of writing, sync is by far the most popular account-relying service. Below is a partial list of current FxA-relying services (by the time you are reading this others will likely have been added; we will endeavor to update the list periodically):

  • Sync
    • Requires FxA.
  • Firefox Send
    • FxA Optional; Required to send large files.
  • Lockwise
    • Requires FxA and sync.
  • AMO
    • For developer accounts; not required by end-users to use or download addons.
  • Pocket
    • FxA is an optional authentication method among others.
  • Monitor
    • Required to receive email alerts. Not required for email scans.
  • Mozilla IAM
    • Optional authentication method among others.

A single account can be used to authenticate with all of the services listed above (though see the note below about Chinese users).

Note that in addition to being the most commonly used relier of FxA, sync is also unique in its integration with FxA - unlike the other reliers in the list above, sync is currently not an FxA oauth client. When someone signs into an oauth client using Firefox, nothing in the browser changes - more specifically, client-side telemetry probes such as FXA_CONFIGURED do not change state. Thus at the present time the only way to measure usage of FxA oauth reliers is to use the FxA server-side measures described below.

One more thing: China runs its own stack for sync, but Chinese sign-ups for oauth reliers still go through the "one and only" oauth server. This means that Chinese users who want to use both sync and an oauth service (e.g. Monitor) will have to register for two accounts. It also means that only metrics for Chinese oauth users will show up in the datasets described below; any sync-related measures will not. At present, you must contact those responsible for maintaining the FxA stack in China for metrics on Chinese sync users.

Metrics Background

Unlike most telemetry described in these docs, FxA metrics are logged server-side. There are many FxA "servers" that handle different aspects of account authentication and management. The metrics of most interest to data analysts are logged by the FxA auth server, content server and oauth server. Each server writes their metrics into their log stream, and some post-processing scripts combine the metrics events from all three servers into datasets that are available in Databricks, BigQuery, STMO and Amplitude.

In general, metrics logged by the FxA auth server reflect authentication events such as account creation, logins to existing accounts, etc. Metrics logged by the FxA content server reflect user interaction and progression through the FxA web UI - form views, form engagement, form submission, etc. The FxA oauth server logs metrics events when oauth clients (Monitor, Lockwise, etc) create and check authentication tokens.

Metrics Taxonomies

There are two overlapping taxonomies or sets of FxA event metrics.

Flow Metrics: these are an older set of metrics events that can be queried through redshift and via the FxA Activity Metrics data source in re:dash. The re:dash import jobs are run once a day. See this documentation for detailed description of the types of flow events that are logged and the tables that contain them (note this documentation does not contain an exhaustive list of all flow metrics but is generally still accurate about the ones that are described). Note there are 50% and 10% sampled versions of the major tables, which contain more historical data than their complete counterparts. Complete tables go back 3 months, 50% tables go back 6 months, and 10% tables go back 24 months. Sampling is done at the level of the FxA user id uid (i.e. integer(uid) % 100).

Amplitude Events: FxA started to send metrics events to amplitude circa October 2017. The code responsible for batching events to amplitude over HTTP is run in more-or-less real-time. Amplitude events can be queried through the amplitude UI as well as various tables in BigQuery that maintain copies of the events that are sent to Amplitude. moz-fx-data-derived-datasets.telemetry.fxa_content_auth_events_v1 is probably the easiest BigQuery table to use, though it does not contain email bounce events and (at the time of writing) only contains data starting at 2019-03-01.

Note that the BigQuery ETL jobs run daily while real-time data is accessible through the amplitude UI.

FxA's amplitude metrics were originally just re-configured and re-named versions of the flow metrics. However things have since diverged a bit and there are now metrics events that only have an amplitude version but no corresponding flow event, and vice-versa. If you are wondering whether a certain event is logged its likely you will have to check both data sources.

Generally speaking, one should first try to use the amplitude metrics rather than the flow events for these reasons:

  1. For quick answers to simple questions the amplitude UI is often more efficient than writing SQL.
    • The caveat here is that is can sometimes be too easy to build a chart in amplitude - it doesn't exactly encourage the careful consideration that having to write a query out by hand implicitly encourages.
  2. By-country data is currently not available in redshift.
  3. There have been outages in the redshift data that have not affected the amplitude data.
  4. Querying redshift is (generally) slower.

It is also possible to query the FxA server logs directly through BigQuery (ask an FxA team member for access), though virtually all analytics-related questions are more easily answered using the data sources described above.

Attribution of Firefox Accounts

Table of Contents

Introduction

Users can create or login to an account through an increasingly large number of relying services and entrypoints. This article describes how we attribute authentications to their point of origin, and documents some of the most frequently trafficked entrypoints (it would not be feasible to list them all, but we will try to update this document when there are substantial changes).

Types of Attribution

We can attribute accounts to the service that they sign up for, as well as the entrypoint that they use to begin the authentication flow. Each service typically has many entrypoints; sync, for example, has web-based entrypoints and browser-based entrypoints (see below).

Service Attribution

There is a variable called service that we use to (1) attribute users to the relying services of FxA that they have authenticated with and (2) attribute individual events to the services they are associated with. Except in the case of sync, service is a mapping from the oauth client_id of the relying service/product to a human readable string. Note that this mapping is currently maintained by hand, and is done after the events have been logged by the server. Currently, mapping to the human-readable service variable is only done for amplitude metrics, where it is treated as a user property. There is also a service variable in the activity_events and flow_metadata re:dash tables (FxA Activity Metrics data source), however it only contains the opaque oauth client_id, not the human-readable string. A table of some of the most common oauth client_ids along with their corresponding service mapping is shown below. This is not a complete list.

serviceoauth client_idDescription
lockboxe7ce535d93522896Lockwise App for Android
lockbox98adfa37698f255bLockwise App for iOS
fenixa2270f727f45f648Sync implementation for Fenix
fx-monitor802d56ef2a9af9faFirefox Monitor (website)
send1f30e32975ae5112Firefox Send (website)
send20f7931c9054d833Firefox Send (android app)
pocket-mobile7377719276ad44eePocket Mobile App
pocket-web749818d3f2e7857fPocket Website
firefox-addons3a1f53aabe17ba32addons.mozilla.org
amo-weba4907de5fa9d78fcaddons.mozilla.org (still unsure how this differs from firefox-addons)
screenshots5e75409a5a3f096dFirefox Screenshots (website, no longer supported)
notesa3dbd8c5a6fd93e2Firefox Notes (desktop extension)
notes7f368c6886429f19Firefox Notes (android app)
fxa-contentea3ca969f8c6bb0dOauth ID used when a user is signing in with cached credentials (i.e. does not have to re-enter username/password) and when the user is logging into the FxA settings page.
mozilla-email-preferencesc40f32fd2938f0b6Oauth ID used when a user is signing in to modify their marketing email preferences (e.g., to opt-out.)

In amplitude, there is also a fxa_services_used user property which maintains an array of all the services a user has authenticated with.

Some amplitude charts segmenting by service can be found here.

Funnel Attribution (entrypoint and utm parameters)

We can also attribute users to where they began the authentication process, be it from a website or an application. Attribution is done through query parameters appended to links that point at accounts.firefox.com (which hosts the actual authentication process). These parameters are logged along with with any metrics events that the user generates during the authentication flow. The table below lists the query parameters that are currently in use, along with the values associated with some of the most common funnels. Note that only entrypoint is typically logged for flows beginning within the browser. Web-based entrypoints are listed first, followed by entrypoints that are found within the browser chrome itself.

See this documentation for more implementational detail on utm/entrypoint parameters.

entrypointutm parametersDescription & Notes
activity-stream-firstrunutm_source = activity-stream, utm_campaign = firstrun, utm_medium = referral or emailThe about:welcome page that is shown to new profiles on browser firstrun. utm_term is sometimes used to track variations for experiments.
firstrun (not supported for current versions)utm_source = firstrunThis is the old version of the firstrun page that was hosted on the web as part of mozilla.org (example). Starting with Firefox version 62, it was replaced by an in-browser version (see row above). Although it is not used for newer versions, it is still hosted for the sake of e.g. profiles coming through the dark funnel on older versions.
mozilla.org-whatsnewXXutm_source = whatsnewXX, utm_campaign = fxa-embedded-form, utm_content = whatsnew, utm_medium = referral or emailWhere XX = the browser version, e.g. 67 (example). The "what's new" page that is shown to users after they upgrade browser versions. Important notes: (1) Users who are signed into a Firefox account have a different experience than those that are signed out. Signed-in users typically see a promotion of FxA-relying services, while signed-out users see a Call to Action to create an account. (2) The attribution parameters for this page were standardized starting on version 66. Previous values for entrypoint include whatsnew and mozilla.org-wnp64 - these values should be used when doing historical analysis of versions prior to 66.
new-install-page (current), firefox-new (previously)varies (can contain values passed through by referrals)example. The "install Firefox" page. This page doesn't always promote FxA and it will often only promote it to a certain % of traffic or to certain segments.
fxa-discoverability-nativeNAThe in-browser toolbar icon. This was introduced with version 67.0
menupanelNAThe in-browser account item in the "hamburger" menu on desktop (three-line menu in the upper right corner) as well as the sync/FxA menu item on android and iOS.
preferencesNAThe "sign into sync" button found in the sync section in desktop preferences.
synced-tabsNAThe "sign into sync" button found in synced-tabs section under the library menu.
sendtabNAThe "sign into sync" button found in the "send tab to device" menu accessible by right-clicking on a tab.
lockbox-addonNAThe "sign into sync" button found within the the Lockwise desktop extension. This is likely to change once Lockwise becomes fully integrated into the browser.

Example amplitude charts: registrations by entrypoint, logins by entrypoint, registrations by utm_source.

Firefox Account Funnels

Table of Contents

Introduction

There are two primary "funnels" that users step through when authenticating with FxA. The registration funnel reflects the steps required for a new FxA user (or more precisely, email address) to create an account. The login funnel reflects the steps necessary for an existing FxA user to sign into their account.

We are also in the process of developing funnels for paying subscribers. We will add documentation on that once the work is is closer to complete.

Registration Funnel

While there are some variations, the typical registration funnel is comprised of the steps described in the chart below. Except where noted, these events are emitted by the FxA content server.

StepAmplitude EventFlow EventDescription
1fxa_email_first - viewflow.enter-email.viewView (impression) of the form that the user enters their email address into to start the process. Note that this form can be hosted by FxA, or hosted by the relying party. In the latter case, the relier is responsible for handing the user's email address off to the FxA funnel. See "counting top of funnel events" below.
2fxa_reg - viewflow.signup.viewView of the registration form. If the user got to this step via step 1, FxA has detected that their email address is not present in the DB, and thus a new account can be created. The user creates their password and enters their age.
3fxa_reg - engageflow.signup.engageA user focuses/clicks on one of the registration form fields.
4fxa_reg - submitflow.signup.submitA user submits the registration form (could be unsuccessfully).
5fxa_reg - createdaccount.createdThis event is emitted by the auth server. It indicates that user has entered a valid email address and password, and that their account has been created and added to the DB. However, the account is still "unverified" at this point and therefore not accessible by the user.
6fxa_email - sent (email_type = registration)email.verification.sentAn email is sent to the user to verify their new account. Depending on the service, it either contains a verification link or a verification code that the user enters into the registration form to verify their email address.
7fxa_reg - cwts_viewflow.signup.choose-what-to-sync.viewUser views the "choose what to sync" screen which allows the users to select what types of browser data they want to synchronize. Note that the user is not required to submit this page - if they do not take any action then all the data types will be synced by default. Thus you may not want to include this (and the following two events) in your funnel analysis if you do not care about the user's actions here.
8fxa_reg - cwts_engageNot ImplementedUser clicks on the "choose what to sync" screen.
9fxa_reg - cwts_submitNot ImplementedUser submits the "choose what to sync" screen. See also the amplitude user property sync_engines which stores which data types the user selected.
10fxa_email - clickemail.verify_code.clickedA user has clicked on the verification link contained in the email sent in step 6. Note this only applies to cases where a clickable link is sent; for reliers that use activation codes, this event will not be emitted (so be aware of this when constructing your funnels).
11fxa_reg - email_confirmedaccount.verifiedThis event is emitted by the auth server. A user has successfully verified their account. They should now be able to use it.
12fxa_reg - completeflow.completeThe account registration process is complete. Note there are NO actions required of the user to advance from step 8 to step 9; there should be virtually no drop-off there. The flow event is identical for registration and login.

See this chart for an example of how this funnel can be constructed for the firstrun (about:welcome) page in amplitude. Here is a version in re:dash using the flow events.

The chart above provides the most detailed version of the registration funnel that can currently be constructed. However, it should not be considered the "canonical" version of the funnel - depending on the question it may make sense to omit some of the steps. For example, at the time of writing some browser entrypoints (e.g. menupanel) link directly to step 2 and skip the initial email form. Having both steps 7 and 8 may also be redundant in some cases, etc. Also, as noted above, you may want to omit the "choose what to sync" steps if you do not care about the users' actions there.

Login Funnel

The login funnel describes the steps required for an existing FxA user to login to their account. With some exceptions, most of the steps here are parallel to the registration funnel (but named differently).

Users must confirm their login via email in the following cases:

  1. A user is logging into sync with an account that is more than 4 hours old.
  2. A user is logging into an oauth relier that uses encryption keys (e.g., Firefox send), if the user had not logged into their account in the previous 72? (check this) hours.
StepAmplitude EventFlow EventDescription
1fxa_email_first - viewflow.enter-email.viewSimilar to the registration funnel, a view (impression) of the form that the user enters their email address into to start the process. Note that this form can be hosted by FxA, or hosted by the relying party. In the latter case, the relier is responsible for handing the user's email address off to the FxA funnel. See "counting top of funnel events" below.
2fxa_login - viewflow.signin.viewView of the login form. If the user got to this step via step 1, FxA has detected that their email address IS present in the DB, and thus an existing account can be logged into. The user enters their password on this form.
3fxa_login - engageflow.signup.engageA user focuses/clicks on the login form field
4fxa_login - submitflow.signup.submitA user submits the login form (could be unsuccessfully).
5fxa_login - successaccount.loginThis event is emitted by the auth server. It indicates that user has submitted the correct password. However, in some cases the user may still have to confirm their login via email (see above).
6fxa_email - sent (email_type = login)email.confirmation.sentAn email is sent to the user to confirm the login. Depending on the service, it either contains a confirmation link or a verification code that the user enters into the login form.
7fxa_email - clickemail.verify_code.clickedA user has clicked on the confirmation link contained in the email sent in step 6. Note this only applies to cases where a clickable link is sent; for reliers that use confirmation codes, this event will not be emitted (so be aware of this when constructing your funnels). Note that this event is identical to its counterpart in the registration funnel.
8fxa_login - email_confirmedaccount.confirmedThis event is emitted by the auth server. A user has successfully confirmed the login via email.
9fxa_login - completeflow.completeThe account registration process is complete. Note there are NO actions required of the user to advance from step 8 to step 9; there should be virtually no drop-off there. The flow event is identical for registration and login.

See this chart for an example of how this funnel can be constructed for the firstrun (about:welcome) page. Here is a version in re:dash using the flow events.

Note again that you may want to check whether the service you are analyzing requires email confirmation on login.

Branches off the Login Funnel: Password Reset, Account Recovery, 2FA.

Some additional funnels are "branches" off the main login funnel above:

  1. The password reset funnel
  • Optionally - the user resets their password with a recovery key
  1. Login with 2FA (TOTP)
  • Optionally - user uses a 2FA recovery code to login to their 2FA-enabled account (e.g. if they misplace their second factor.)

Password Reset and Recovery Codes

Users can click "Forgot Password?" during sign-in to begin the password reset process. The funnel is described in the chart below.

An important "FYI" here: passwords are used to encrypt accounts' sync data. This implies a bad scenario where a change of password can lead to loss of sync data, if there are no longer any devices that can connect to the account and re-upload/restore the data after the reset occurs. This would happen, for example, if you only had one device connected to sync, lost the device, then tried to login to a new device to access your synced data. If you do a password reset while logging into the second device, the remote copy of your sync data will be overwritten (with whatever happens to be on the second device).

Thus the recovery codes. If a user (1) sets up recovery codes via settings (and stores them somewhere accessible) (2) tries to reset their password and (3) enters a valid recovery code during the password reset process, sync data can be restored without risking the "bad scenario" above.

Password Reset Funnel Without Recovery Key

Note: There may be other places where a user can initiate the password reset process, but I think that its most common during login. In any case, the steps starting at 2 should all be the same.

StepAmplitude EventFlow EventDescription
1fxa_login - viewflow.signin.viewView of the login form, which contains the "Forgot Password" Link.
2fxa_login - forgot_passwordflow.signin.forgot-passwordUser clicks on the "Forgot Password" Link.
3Not Implementedflow.reset-password.viewView of the form asking the user to confirm that they want to reset.
4fxa_login - forgot_submitflow.reset-password.engage, flow.reset-password.submitUser clicks on the button confirming that they want to reset.
5fxa_email - delivered (email_template = recoveryEmail)email.recoveryEmail.deliveredDelivery of the PW reset link to the user via email.
5-aNot Implementedflow.confirm-reset-password.viewView of the screen telling the user to confirm the reset via email.
6Not Implementedflow.complete-reset-password.viewUser views the form to create a new password. (viewable after clicking the link in the email above)
7Not Implementedflow.complete-reset-password.engageUser clicks on the form to create a new password.
8Not Implementedflow.complete-reset-password.submitUser submits the form to create a new password.
9fxa_login - forgot_completeflow.complete (the auth server also emits account.reset)User has completed the password reset funnel.

Password Reset Funnel With Recovery Key

Note we still need to implement amplitude events for the recovery code part of this funnel. The funnel is identical to the one above up until step 6.

StepAmplitude EventFlow EventDescription
1fxa_login - viewflow.signin.viewView of the login form, which contains the "Forgot Password" Link.
2fxa_login - forgot_passwordflow.signin.forgot-passwordUser clicks on the "Forgot Password" Link.
3Not Implementedflow.reset-password.viewView of the form asking the user to confirm that they want to reset.
4fxa_login - forgot_submitflow.reset-password.engage, flow.reset-password.submitUser clicks on the button confirming that they want to reset.
5fxa_email - delivered (email_template = recoveryEmail)email.recoveryEmail.deliveredDelivery of the PW reset link to the user via email.
5-aNot Implementedflow.confirm-reset-password.viewView of the screen telling the user to confirm the reset via email.
6Not Implementedflow.account-recovery-confirm-key.viewUser views the form to enter their account recovery key. (viewable after clicking the link in the email above)
7Not Implementedflow.account-recovery-confirm-key.engageUser clicks on the form to enter their account recovery key.
8Not Implementedflow.account-recovery-confirm-key.submitUser submits the form to enter their account recovery key.
9Not Implementedflow.account-recovery-confirm-key.success or flow.account-recovery-confirm-key.invalidRecoveryKeyUser submitted a valid (success) or invalid recovery key.
10Not Implementedflow.account-recovery-reset-password.viewUser views the form to change their password after submitting a valid recovery key.
11Not Implementedflow.account-recovery-reset-password.viewUser clicks on the form to change their password after submitting a valid recovery key.
12Not Implementedflow.account-recovery-reset-password.viewUser submits the form to change their password after submitting a valid recovery key.
13fxa_login - forgot_completeflow.complete (the auth server also emits account.reset)User has completed the password reset funnel.

Login with 2FA (TOTP)

Users can setup two factor authentication (2FA) on account login. 2FA is implemented via time-based one-time password (TOTP). If a user has set up 2FA (via settings), they will be required to enter a pass code generated by their second factor whenever they login to their account.

Users are also provisioned a set of recovery codes as part of the 2FA setup process. These are one-time use codes that can be used to login to an account if a user loses access to their second factor. Note that these 2FA recovery codes are different than the account recovery keys described above.

Login with 2FA/TOTP Funnel (No Recovery Code)

This funnel starts after the fxa_login - success / account.login step of the login funnel

StepAmplitude EventFlow EventDescription
1fxa_login - totp_code_viewflow.signin-totp-code.viewView of the TOTP form.
2fxa_login - totp_code_engageflow.signin-totp-code.engageClick on the TOTP form.
3fxa_login - totp_code_submitflow.signin-totp-code.submitSubmission of the TOTP form.
4fxa_login - totp_code_successflow.signin-totp-code.successSuccessful submission of the TOTP form. Auth server also emits totpToken.verified

Login with 2FA/TOTP Funnel w/ Recovery Code

This funnel starts after user clicks to use a recovery code during the TOTP funnel.

StepAmplitude EventFlow EventDescription
1fxa_login - totp_code_viewflow.signin-totp-code.viewView of the TOTP form.
2Not Implementedflow.sign_in_recovery_code.viewView of the TOTP recovery code form.
3Not ImplementedrecoveryCode.verified (auth server)User submitted a valid recovery code.

Connect Another Device / SMS

Sync is most valuable to users who have multiple devices connected to their account. Thus after a user completes a sync login or registration funnel, they are shown the "connect another device" form. This Call to Action contains a form for a phone number, as well as links to the Google Play and Apple stores where users can download mobile versions of Firefox. If a user submits a valid phone number (associated with a country that our service supports), then we send them an SMS message with links to their mobile phone's app store.

At one point, at least for iOS, the SMS message contained a deep link that pre-filled the user's email address on the sign-in form once they installed the mobile browser. There is some uncertainty about whether this still works...

SMS Funnel

This funnel begins either (1) after a user has completed the login or registration funnel, or (2) if they click on "connect another device" from the FxA toolbar menu within the desktop browser (provided they are signed in). In the latter case the signin segment of the flow event will be omitted.

StepAmplitude EventFlow EventDescription
1fxa_connect_device - view (connect_device_flow = sms)flow.signin.sms.viewUser viewed the SMS form.
2fxa_connect_device - engage (connect_device_flow = sms)flow.signin.sms.engageUser clicked somewhere on the SMS form.
3fxa_connect_device - submit (connect_device_flow = sms)flow.signin.sms.submitUser submitted the SMS form.
4Not Implementedsms.region.{country_code}An SMS was sent to a number with the two letter country_code.
5Not Implementedflow.sms.sent.viewUser views the message confirming that the SMS has been sent.

The SMS form also contains app store links. If they are clicked, flow events flow.signin.sms.link.app-store.android or flow.signin.sms.link.app-store.ios will be logged.

Connect Another Device Funnel (Non-SMS)

StepAmplitude EventFlow EventDescription
1fxa_connect_device - view (connect_device_flow = cad)flow.signin.connect-another-device.viewUser viewed the CAD form.
2fxa_connect_device - view (connect_device_flow = cad)flow.signin.connect-another-device.link.app-store.(android\|ios)User clicked on either the android or iOS app store button. In amplitude, use the event property connect_device_os to disambiguate which link was clicked.

Settings

A variety of metrics are logged that reflect user interaction with the settings page (https://accounts.firefox.com/settings). The chart below outlines some of these events (this may not be an exhaustive list).

Amplitude EventFlow EventDescription
fxa_pref - viewflow.settings.viewUser viewed the settings page.
fxa_pref - engageflow.settings.*.engageUser clicked somewhere on the settings page.
fxa_pref - two_step_authentication_viewflow.settings.two-step-authentication.viewUser viewed 2FA settings.
Not Implementedflow.settings.two-step-authentication.recovery-codes.viewUser viewed their 2FA recovery codes. These are only viewable one time only, after a user sets up 2FA, or after they generate new codes.
Not Implementedflow.settings.two-step-authentication.recovery-codes.print-optionUser clicks to print their 2FA recovery codes.
Not Implementedflow.settings.two-step-authentication.recovery-codes.download-optionUser clicks to download their 2FA recovery codes.
Not Implementedflow.settings.two-step-authentication.recovery-codes.copy-optionUser clicks to copy their 2FA recovery codes to the clipboard (this is fired only when they click the copy button, not if they copy using e.g. a keyboard shortcut).
Not Implementedflow.settings.change-password.viewUser viewed the form to change their password.
fxa_pref - passwordsettings.change-password.successUser changed their password via settings.
fxa_pref - newsletter (see also user property newsletter_state)settings.communication-preferences.(optIn\|optOut).successUser changed their newsletter email preferences.
Not Implementedflow.settings.account_recovery.viewUser viewed account recovery settings.
Not Implementedflow.settings.account_recovery.engageUser clicked somewhere in account recovery settings.
Not Implementedflow.settings.account-recovery.confirm-password.viewUser viewed the password form prior to turning on account recovery. (user first has to verify their email address)
Not Implementedflow.settings.account-recovery.confirm-password.viewUser clicked the password form prior to turning on account recovery.
Not Implementedflow.settings.account-recovery.confirm-password.submitUser submitted the password form prior to turning on account recovery.
Not Implementedflow.settings.account-recovery.confirm-password.successUser successfully submitted the password form prior to turning on account recovery.
Not Implementedflow.settings.account-recovery.recovery-key.viewUser viewed their recovery key. This is viewable one time only, after a user sets up account recovery, or after they generate a new key.
Not Implementedflow.settings.account-recovery.recovery-key.print-optionUser clicks to print their recovery key.
Not Implementedflow.settings.account-recovery.recovery-key.download-optionUser clicks to download their recovery key.
Not Implementedflow.settings.account-recovery.recovery-key.copy-optionUser clicks to copy their recovery key to the clipboard (this is fired only when they click the copy button, not if they copy using e.g. a keyboard shortcut).
Not Implementedflow.settings.account-recovery.refreshUser generated a new recovery key.
Not Implementedflow.settings.clients.viewUser viewed the list of clients ("Devices & Apps") connected to their account. AKA the device manager.
Not Implementedflow.settings.clients.engageUser clicked somewhere the list of clients connected to their account.
Not Implementedflow.settings.clients.disconnect.viewUser viewed the dialog asking to confirm disconnection of a device.

FxA Email Metrics

Table of Contents

Introduction

Users must provide an email address when they sign up for a Firefox Account. Emails are sent to users to confirm authentication, alert them to new sign-ins, and to complete password resets. Users can also opt-in to marketing emails, however metrics for those are not covered in this article.

Events that we track relating to email:

  1. When the email is sent.
  2. If the email bounces.
  3. If the email contains a verification/confirmation link, whether the user clicked on it.

Metrics relating to emails also contain the following properties:

  1. The email service of the recipient
  2. The email_template - the template of the email that was sent (we currently only track this for sending events, not click events). This is more specific than the
  3. email_type, which is broader grouping of many email templates into related categories, see chart below.

Email Templates and Email Types

Only emails sent by the FxA auth server are represented in the table below. TBD on marketing emails.

email_templateemail_typeDescription & Notes
verifySyncEmailregistrationSent to users setting up a new sync account. Contains a verification link (user must click it for their account to become functional).
verifyEmailregistrationSent to users setting up a new NON-sync account. Contains a verification link (user must click it for their account to become functional).
postVerifyEmailregistrationSent after users confirm their email. Contains instructions for how to connect another device to sync.
verifyTrailheadEmailregistrationUpdated version of verifySyncEmail for the trailhead promotion.
postVerifyTrailheadEmailregistrationUpdated version of postVerifyEmail for the trailhead promotion.
verificationReminderFirstEmailregistrationIf a users does not verify their account within 24 hours, they receive this email with an additional verification link.
verificationReminderSecondEmailregistrationIf a users does not verify their account within 48 hours, they receive this email with an additional verification link.
verifyLoginEmailloginSent to existing accounts when they try to login to sync. User must click the verification link before the logged-in device can begin syncing.
newDeviceLoginEmailloginSent to existing accounts after they have logged into a device that FxA has not previously recognized.
verifyLoginCodeEmailloginSent to existing accounts when they try to login to sync, containing a code (rather than a link) the user must enter into the login form. Note that currently the use of confirmation codes is limited to some login contexts only - they are never used for registration.
recoveryEmailreset_passwordAfter a user opts to reset their password (during login, because they clicked "forgot password"), they receive this email with a link to reset their password (without using a recovery key).
passwordResetEmailreset_passwordSent to users after they reset their password (without using a recovery key).
postAddAccountRecoveryEmailaccount_recoverySent to users after they successfully add account recovery capabilities to their account (i.e. after generating recovery codes).
postRemoveAccountRecoveryEmailaccount_recoverySent to users after they successfully REMOVE account recovery capabilities from their account.
passwordResetAccountRecoveryEmailaccount_recoveryAfter a user resets their password using a recovery key, they receive this email telling them to generate a new recovery key.
passwordChangedEmailchange_passwordSent to users after they change their password via FxA settings (NOT during password reset; they must be logged in to do this).
verifyPrimaryEmailverifySent to users when they request to change their primary email address via settings (is sent to their new email).
postChangePrimaryEmailchange_emailSent to users after they successfully change their primary email address (is sent to their new email).
verifySecondaryEmailsecondary_emailSent to users when they add a secondary email address via account settings. Contains a verification link (sent to the secondary email address).
postVerifySecondaryEmailsecondary_emailSent to users after they successfully verified a secondary email address (sent to the secondary email address).
postRemoveSecondaryEmailsecondary_emailSent to users after they successfully remove a secondary email address (sent to the secondary email address).
postAddTwoStepAuthenticationEmail2faSent to users after they successfully add 2 factor authentication to their account (TOTP)
postRemoveTwoStepAuthenticationEmail2faSent to users after they successfully REMOVE 2 factor authentication from their account (TOTP)
postConsumeRecoveryCodeEmail2faSent to users after they successfully use a recovery code to login to their account after not being able to use their second factor.
postNewRecoveryCodesEmail2faSent to users after they successfully generate a new set of 2FA recovery codes.
lowRecoveryCodesEmail2faSent when a user is running low on 2FA recovery codes.

Telemetry Behavior Reference

A brief history of Firefox data collection

This section was originally included in the Project Smoot existing metrics report (Mozilla internal link).

blocklist.xml and Active Daily Installs (ADI)

The blocklist is a mechanism for informing Firefox clients about malicious add-ons, DLLs, and other extension content that should be blocked. The blocklist also notes when hardware acceleration features should be avoided on certain graphics cards. To be effective, the blocklist needs to be updated on a faster cadence than Firefox releases.

The blocklist was first implemented in 2006 for Firefox 2, and reported the app ID and version to the blocklist server.

Several additional variables, including OS version, locale, and distribution, were added to the URL for Firefox 3 in 2008. Being able to count users was already expressed as a priority in the bug comments.

A count of blocklist fetches was used to produce a metric called Active Daily Users, which was renamed to Active Daily Installs (ADI) by 2012.

Work is underway to replace blocklist.xml with a Remote Settings-based replacement, though work is blocked because of the risk of interrupting ADI measurement.

ADI is described in more detail in the next chapter.

Telemetry

The earliest telemetry infrastructure landed in Firefox 6, and was driven by engineering needs.

Telemetry was originally opt-out on the nightly and aurora channels, and opt-in otherwise. It originally lacked persistent client identifiers.

Firefox Health Report

The Firefox Health Report (FHR) was specified to enable longitudinal and retention analyses. FHR aimed to enable analyses that were not possible based on the blocklist ping, update ping, telemetry, Test Pilot and crash stats datasets that were already available.

FHR was first implemented in Firefox 20. It was introduced in blog posts by Mitchell Baker and Gilbert Fitzgerald.

To avoid introducing a persistent client identifier, FHR originally relied on a “document ID” system. The client would generate a new UUID (a random, unique ID) for each FHR document, and remember a list of its most recent previous document IDs. While uploading a new FHR document, the client would ask the server to remove its previous documents. The intent was that the server would end up holding at most one document from each user, and longitudinal metrics could be accumulated by the client. This approach proved fragile and was abandoned. A persistent client identifier was implemented for Firefox 30.

Telemetry today

FHR was retired and merged with telemetry to produce the current generation of telemetry data, distinguished as “v4 telemetry” or “unified telemetry.”

Instead of mapping FHR probes directly to telemetry, the unified telemetry design document describes how unified telemetry can answer the questions Mozilla had attempted to answer with FHR. The implementation of unified telemetry and opt-out delivery to the release channel was completed for Firefox 42, in 2015.

Telemetry payloads are uploaded in documents called pings. Several kinds of pings are defined, representing different kinds of measurement. These include:

  • main: activity, performance, technical, and other measurements; the workhorse of Firefox desktop telemetry
  • crash: information about crashes, including stack traces
  • opt-out: a farewell ping sent when a user disables telemetry
  • module: on Windows, records DLLs injected into the Firefox process

and others.

Browser sessions and subsessions are important concepts in telemetry. A session begins when the browser launches and ends—perhaps seconds or days later— when the parent browser process terminates.

A subsession ends

  • when its parent session ends, or
  • at local midnight, or
  • when the telemetry environment changes,

whichever comes first.

The telemetry environment describes the hardware and operating system of the client computer. It can change during a Firefox session when e.g. hardware is plugged into a laptop.

The subsession is the reporting unit for activity telemetry; each main ping describes a single subsession. Activity counters are reset once a subsession ends. Data can be accumulated for analysis by summing over a client’s pings.

Telemetry pings can contain several different types of measurements:

  • scalars are integers describing either an event count or a measurement that occurs only once during a subsession; simpleMeasurements are an older, less flexible scalar implementation in the process of being deprecated
  • histograms represent measurements that can occur repeatedly during a subsession; histograms report the count of measurements that fell into each of a set of predefined buckets (e.g. between zero and one, between one and two, etc).
  • events represent discrete events; the time and ordering of the events are preserved, which clarifies sequences of user actions

Data types are discussed in more depth in the telemetry data collection documentation.

Profile behavior

Profile Creation

Real World Usage

Profile History

Profile Creation - The technical part

What is a profile?

All of the changes a user makes in Firefox, like the home page, what toolbars you use, installed addons, saved passwords and your bookmarks, are all stored in a special folder, called a profile. Telemetry stores archived and pending pings in the profile directory as well as metadata like the client ID.

Every run of Firefox needs a profile. However a single installation can use multiple profiles for different runs. The profile folder is stored in a separate place from the Firefox program so that, if something ever goes wrong with Firefox, the profile information will still be there.

Firefox also comes with a Profile Manager, a different run mode to create, migrate and delete the profiles.

Profile Behaviors

In order to understand the behavior of users and base analysis on things like the profile creation date, it is essential to understand how a profile is created and identified by the browser. Also, it is important to understand how user actions with and within profiles affect our ability to reason about profiles from a data perspective. This includes resetting or deleting profiles or opting into or out of sending Telemetry data.

The different cases are described in more detail in the following sections.

Profile Creation

There are multiple ways a Firefox profile can be created. Some of these behave slightly differently.

Profiles can be created and managed by the Firefox Profile Manager:

  • New profile on first launch
  • New profile from Profile Manager
  • --createprofile command line argument

Profiles can be created externally and not be managed by the Firefox Profile Manager:

  • --profile command line argument

Managed: First use

When Firefox is opened for the first time after a fresh install, without any prior Firefox profile on disk visible to Firefox, it will create a new profile. Firefox uses "Default User" as the profile name, creates the profile's directory with a random suffix and marks the new profile as default for subsequent starts of Firefox. Read where Firefox stores your profile data.

Managed: Profile Manager creation

The user can create a new profile through the Profile Manager. This can either be done on about:profiles in a running Firefox or by starting Firefox with the --ProfileManager flag. The Profile Manager will ask for a name for the profile and picks a new directory for it. The Profile Manager allows the user to create a new profile from an existing directory (in which case any files will be included) or from scratch (in which case the directory will be created).

The --createprofile flag can be used from the command line and works the same as creating a profile through the Profile Manager.

Unmanaged: Command-line start

Firefox can be started on the command line with a path to a profile directory: firefox --profile path/to/directory. If the directory does not exist it will be created.

A profile created like this will not be picked up by the Profile Manager. Its data will persist after Firefox is closed, but the Profile Manager will not know about it. The profile will not turn up in about:profiles.

Profile Reset

A user can reset the profile (see Refresh Firefox - reset addons and settings). This will copy over most user data to a new directory, keeping things like the history, bookmarks and cookies, but will remove extensions, modified preferences and added search engines.

A profile reset will not change the Telemetry clientID. The date of the most recent profile reset is saved and will be contained in Telemetry pings in the profile.resetDate field.

Profile Deletion

A profile can be deleted through the Profile Manager, which will delete all stored data from disk. The profile can also be deleted by simply removing the profile's directory. We will never know about a deletion. We simply won't see that profile in new Telemetry data anymore.

Uninstalling the Firefox installation will not remove any profile data.

Note: Removing a profile's directory while it is in use is not recommended and will lead to a corrupt state.

Telemetry opt-out

The user can opt out of sending Telemetry data. When the user opts out, Telemetry sends one "optout" ping, containing an empty payload. The local clientID is reset to a fixed value.

When a user opts into sending Telemetry data, a new clientID is generated and used in subsequent pings. The profile itself and the profile creation date are unaffected by this.

Profile Creation Date

The profile creation date is the assumed date of initial profile creation. However it proved to be not reliable for all cases. There are multiple ways this date is determined.

Managed: During Profile Creation

When a profile is created explicitly the profile directory is created and a times.json containing a timestamp of the current time is stored inside that profile directory1. It is read at later times when the profile creation date is used.

graph TD
A[Start Firefox] -->B[Select profile dir, default or defined]
B --> C{Selected dir exist?}
C --> |No| D[Create directory]
C --> |Yes| E[Write times.json]
D --> E
E --> F[Show Browser]
F --> G[ProfileAge.jsm is called]
G --> J[Read time from times.json]
J --> S[Return creation date]

Unmanaged: Empty profile directory

When --profile path/to/directory is passed on the command line, the directory is created if it does not exist, but no times.json is written2. On the first access of the profile creation date (through ProfileAge.jsm) the module will detect that the times.json is missing. It will then iterate through all files in the current profile's directory, reading file creation or modification timestamps. The oldest of these timestamps is then assumed to be the profile creation date and written to times.json. Subsequent runs of Firefox will then use this date.

graph TD
A[Start Firefox --profile path/to/dir] -->H{path/to/dir exist?}
H --> |No| K[Create directory]
K --> F[Show Browser]
H --> |Yes| F
F --> O[ProfileAge.jsm is called]
O --> R{times.json exists?}
R -->|Yes| Q[Read time from times.json]
R -->|No| L[Scan profile dir for oldest file, write to times.json]
L --> S
Q --> S[Return creation date]
2

Relevant part in the code: nsAppRunner::SelectProfile creating the directory.

Real World Usage

This page backs away from our profile-focused data view and examines what Firefox Desktop usage looks like in the real world. There are many components and layers that exist between a user acquiring and running Firefox, and this documentation will illuminate what those are and how they can affect the meaning of a profile.

Real Life Components of Firefox Desktop Usage

The above image illustrates all the layers that sit between a user acquiring and running Firefox Desktop and the Telemetry pings we receive.

  • 1: The user
    • A human being presumably.
  • 2: The machine
    • The physical hardware running Firefox.
  • 3: The disk image / hard drive
    • A single machine could have separate partitions running different OSes.
    • Multiple machines could run copies of a single disk image
    • Disk images are also used as backups to restore a machine.
  • 4: OS user profile
    • Most operating systems allow users to log into different user profiles with separate user directories (such as a "Guest" account).
    • Usually, Firefox is installed into a system directory that all users profiles will share, but Firefox profiles are saved within the user directories, effectively segregating them.
  • 5: Firefox binary / installer
    • The downloaded binary package or stub installer which installs Firefox into the disk image. Users can get these from our website or one of our managed properties, but they can also acquire these from 3rd party sources as well.
    • Our website is instrumented with Google Analytics to track download numbers, but other properties (FTP) and 3rd party sources are not. Google Analytics data is not directly connected to Telemetry data.
    • A user can produce multiple installations from a single Firefox binary / installer. For example, if a user copies it to a USB stick or keeps it in cloud storage, they could install Firefox on multiple machines from a single binary / installer.
  • 6: Firefox installation
    • The installed Firefox program on a given disk image.
    • Since Firefox is usually installed in a system directory, the single installation of Firefox will be shared by all the OS user profiles in the disk image.
    • Stub installers are instrumented with pings to report new install counts, however, full binaries are not.
  • 7: Firefox profile
    • The profile Firefox uses during a user's session.
    • A user can create multiple Firefox profiles using the Firefox Profile Manager, or by specifying a custom directory to use at startup. More details here.
    • This is the entity that we see in Telemetry. Profiles send pings to Telemetry with a client ID as its identifier.

Desktop Browser Use Cases

Below are the rough categories of Firefox use cases that we know happen in the real world.

Note, these categories are rough approximations, and are not necessarily mutually exclusive.

Regular User

What we imagine a typical user to be. Someone who buys a computer, always uses a default OS user profile, downloads Firefox once, installs it, and continues using the default Firefox profile.

In Telemetry, this user would just show up as a single client ID.

Assuming they went through our normal funnel, they should show up once in Google Analytics as a download and once in stub installer pings as a new installation (if they used a stub installer).

Multi-Profile User

A more advanced user, who uses multiple Firefox profiles in their normal, everyday use, but otherwise is pretty 'normal' (uses the same OS user profile, etc.).

In Telemetry, this user would show up as 2 (or more) separate client IDs. We would have no way to know they came from the same computer and user without identifying that the subsessions are never overlapping and that large portions of the environment (CPU, GPU, Displays) are identical and that would be no guarantee.

Assuming they went through our normal funnel, they would show up once in Google Analytics as a download and once in stub installer pings as a new installation (if they used a stub installer).

However, any subsequent new Firefox profile creations would not have any corresponding downloads or installations. Since Firefox 55 however, any newly created profile will send a "new-profile" ping.

Shared Computer

A situation where there is a computer that is shared across multiple users and each user uses a different OS user profile. Since Firefox profiles live at the user directory level, each user would have a separate Firefox profile. Note, users logging in under a "Guest" account in most machines falls into this category.

In this case, every user who logged into this one computer with a different OS user profile would show up as a different client ID. We have no way of knowing they came from the same computer.

Furthermore, if the computer wiped the user directory after use, like Guest accounts and university computer labs often do, then they would show up as a new client ID every time they logged in, even if they have used the same computer multiple times. This use case could inflate new profile counts.

Similar to Multi-Profile Users, in this use case, there would be only one download event and install event (assuming normal funnel and stub installer), but multiple client ID's.

Cloned Machines

In this case, there are actually multiple users with computers that all share the same disk image at some point.

Think of the situation where the IT staff sets up the computer for a new hire at a company. Instead of going through to trouble of installing all the required programs and setting them up correctly for each computer, they'll do it once on one computer, save the disk image, and simply copy it over each time they need to issue a new machine.

Or think of the case where the IT staff of a library needs to set up 2 dozen machines at once.

In this case, depending on the state of the disk image when it was copied, we could see multiple client ID's for each user+machine, or we could see all the user+machines sharing the same client ID.

If the disk image was copied after a Firefox profile was created, then the old user+machine and new user+machine will share the same client ID, and be submitting pings to us concurrently.

If the disk image was copied after the Firefox installation but before an initial Firefox profile was created, then each user+machine will get their own Firefox profile and client ID when they run Firefox for the first time.

As with the Multi-Profile User and Shared Computer case, even though there could be multiple Firefox profiles in this use case, there will only be one download and install event.

Migrations

Type 1: Migrate Disk Image

A user has a backup of their disk image and when they switch to a new computer or their current computer crashes, they simply reboot from the old disk image.

In this case, the old machine and the new machine will just share the same client ID (assuming that the disk was copied after a Firefox profile was created). In fact, it will look exactly like the Cloned Machines case, except that instead of sending pings concurrently, they'll be sending us pings first from the old machine and then from the new machine.

Also, it should be noted that their Firefox profile will 'revert' back to the state that it was in when the disk image was copied, essentially starting over from the past, and any unsent pings on the image (if they exist) will be resent. For instance, we will see another ping with the profile_subsession_count (the count of how many subsessions a profile has seen in its history) we previously saw some time before.

Again, there will only be one download and install associated with this use case (assuming normal funnel and stub installer).

Type 2: Migrate OS User Directory

A user has a backup of their OS user directory and copies it to a new machine.

This is similar to Type 1 migration, but instead of copying the entire disk, the user only copies the OS user directory. Since the Firefox profile lives in the OS user directory, the old machine and new machine will share the same client ID.

The only difference is since the Firefox Installation lives in system directories, the client might have to re-download and re-install the browser. However, if they also copy the Firefox binary / installer, there will not be a download event, only an install event.

Type 3: Migrate Firefox Binary / Installer

A user has the Firefox binary or installer saved on their old machine and copies it over to a new machine to install Firefox.

In this case, there will not be a second download event, but there will be an install event and the new and old machines will have separate client ID's.

Type 4: Migrate Firefox Profile

A user copies their old Firefox profile from their old machine to a new computer, and runs Firefox using the copied Firefox profile.

In this case, since the Firefox profile is being copied over, both the new and the old machine will have profiles with the same client ID. Again, the profile on the new computer will revert back to the point in its history where it was copied. And since the profile contains any unsent Telemetry pings, we may receive duplicated submissions of pings from the same client ID.

If the Firefox binary / installer was downloaded, there will be a download and install event. If it was migrated via USB stick, it will only have an install event.

Profile History

A profile's history is simply the progression of that profile's subsessions over its lifetime. We can see this in our main pings by checking:

  • profile_subsession_counter
    • A counter which starts at 1 on the very first run of a profile and increments for each subsession. This counter will be reset to 1 if a user resets / refreshes their profile.
  • subsession_start_date
    • The date and time the subsession starts in, truncated to hours. This field is not always reliable due to local clock skew.
  • previous_subsession_id
    • The ID of the previous subsession. Will be null for the very first subsession, or the first subsession after a user resets / refreshes their profile.
  • subsession_id
    • The ID of the current subsession.
  • submission_date_s3
    • The date we received the ping. This date is sourced from the server's time and reliable.
  • profile_reset_date
    • The date the profile was reset. Will be null if the profile was not reset.

This is a nice clean example of profile history. It has a clear starting ping and it progresses linearly, with each subsession connecting to the next via subsession_id. However, due to the fact that profiles can be shared across machines, and restored manually, etc. strange behaviors can arise (see Real World Usage).

Profile History Start Conditions

Under normal assumptions, we expect to see the starting ping in a profile's history in our telemetry data. The starting ping in the profile's history is the ping from their very first subsession. We expect this ping to have profile_subsession_counter = 1 and previous_subsession_id is null and profile_reset_date is null.

However, not all profiles appear in our data with a starting ping and instead appear to us mid-history.

History Has Beginning

As you can see, this profile starts with a ping where profile_subsession_counter = 1 and previous_subsession_id is null.

History Has No Beginning

In this example, the profile simply appears in our data mid-history, with presumably the 25th subsession in it's history. Its previous history is a mystery.

Profile History Progression Events

After a profile appears, in 'normal' conditions, there should be a linear, straightforward progression with each subsession linking to the next.

However, the following abnormal events can occur.

History Gap

There is a gap in the profile history.

It's possible this behavior is due to dropped pings.

Here, we see a gap between the 30th ping and the 41st ping and the 44th ping.

History Splits

The history of a profile splits, and after a single subsession, there are two (or more) subsessions that link back to it.

This is probably due to cloned machines or disk image restores. Note, after the profile splits, the two branches might continue concurrently or one branch might die while the other continues. It is very hard to distinguish between the different branches of the same profile.

  • Profile begins

  • Profile splits: branch 1

  • Profile splits: branch 2

In this example, the profile history starts normally, but on the 5th ping, the history splits into two branches that seem to progress concurrently.

History Restarts

The history of a profile suddenly starts over, with a brand new starting ping.

  • Profile begins

  • Profile restarts

Here, we see the profile start their history normally, but then they begin a new, totally unconnected branch with a starting ping that is not the same as the original starting ping (different subsession_ids).

History Reruns

(Work in Progress)

How to Order History

(Work in Progress)

Channel Behavior

Telemetry Channel Behavior

In every ping there are two channels:

  • App Update Channel
  • Normalized Channel

Expected Channels

The traditional channels we expect are:

App Update Channel

This is the channel reported by the application directly. This could really be anything, but is usually one of the expected release channels listed above.

For BigQuery tables corresponding to Telemetry Ping types, such as main, crash or event, the field here is called app_update_channel and is found in metadata.uri. For example:

SELECT
  metadata.uri.app_update_channel
FROM
  telemetry.main
WHERE
  DATE(submission_timestamp) = '2019-09-01'
LIMIT
  10

Normalized Channel

This field is a normalization of the directly reported channel, and replaces unusual and unexpected values with the string Other. There are a couple of exceptions, notably that variations on nightly-cck-* become nightly. See the relevant code here.

Normalized channel is available in the Telemetry Ping tables as a top-level field called normalized_channel. For example:

SELECT
  normalized_channel
FROM
  telemetry.crash
WHERE
  DATE(submission_timestamp) = '2019-09-01'
LIMIT
  10

Censuses

This section was originally included in the Project Smoot existing metrics report (Mozilla internal link).

ADI and DAU are oft-discussed censuses. This chapter discusses their history and definition.

ADI / Active Daily Installs (blocklist fetches)

ADI, one of Firefox’s oldest client censuses, is computed as the number of conforming requests to the Firefox blocklist endpoint. ADI data is available since July 13, 2008.

It is not possible to opt-out of the blocklist using the Firefox UI, but users can disable the update mechanism by changing preference values.

A blocklist is shipped in each release and updated when Firefox notices that more than 24 hours have elapsed since the last update.

The blocklist request does not contain the telemetry client_id or any other persistent identifiers. Some data about the install are provided as URI parameters:

  • App ID
  • App version
  • Product name
  • Build ID
  • Build target
  • Locale
  • Update channel
  • OS version
  • Distribution
  • Distribution version
  • Number of pings sent by this client for this version of Firefox (stored in the pref extensions.blocklist.pingCountVersion)
  • Total ping count (stored in the pref extensions.blocklist.pingCountTotal)
  • Number of full days since last ping

so subsets of ADI may be queried along these dimensions.

The blocklist is kept up-to-date locally using the UpdateTimerManager facility; the update is scheduled in a manifest and performed by Blocklist#notify.

Upon browser startup, after a delay (30 seconds by default), UpdateTimerManager checks whether any of its scheduled tasks are ready. At each wakeup, the single most-overdue task is triggered, if one exists. UpdateTimerManager then sleeps at least two minutes or until the next task is scheduled.

Failures are ignored.

A visualization of real and detrended ADI is available at Desktop API Details: Long-term trend and decomposition. ADI is also plotted in the Mozilla Data Collective.

Telemetry only reports whether blocklist checking is enabled or disabled on the client; there is no data in telemetry about blocklist fetches, age, or update failures.

DAU / Daily Active Users

Firefox DAU is currently computed as the number of unique client_ids observed in main pings received on a calendar day. The DAU count excludes users who have opted out of telemetry.

Each main ping describes a single subsession of browser activity.

When and how a ping is sent depends on the reason the subsession ends:

Table 1: When main pings are sent, and why.
Reason Trigger Percent of subsessions [1] Mechanism
shutdown Browser is closed 77% For Firefox 55 or later, sent by Pingsender on browser close unless the OS is shutting down. Otherwise, sent by `TelemetrySendImpl.setup` on the following browser launch.
environment-change The telemetry environment changed 13% Sent when change is detected by `TelemetrySession._onEnvironmentChange`
daily more than 24 hours have elapsed since the last ping was sent and the time is local midnight 8% Sent at local midnight after a random 0-60 min delay
aborted-session A session terminates uncleanly (e.g. crash or lost power) 3% Sent by the browser on the next launch; the payload to send is written to disk every 5 minutes during an active session and removed by a clean shutdown

Coverage pings

The coverage ping (announcement) is a periodic census intended to estimate telemetry opt-out rates.

We estimate that 93% of release channel profiles have telemetry enabled (and are therefore included in DAU).

Engagement metrics

This section was originally included in the Project Smoot existing metrics report (Mozilla internal link).

A handful of metrics have been adopted as engagement metrics, either as censuses of the population or to describe user activity within a session. This chapter aims to describe what those metrics are and how they’re defined.

Engagement metrics

active_ticks

The active_ticks probe is specified to increment once in every 5-second window that a user performs an action that could interact with content or chrome, including mousing over the window while it lacks focus. One additional tick is recorded after the activity stops.

Main pings provide two measurements of active_ticks: a simpleMeasurement and a scalar.

The simpleMeasurement was implemented in Firefox 37 before the launch of unified telemetry, and had previously been implemented for FHR.

The simpleMeasurement was discovered to be resetting incorrectly, which was fixed in Firefox 62.

The scalar (which was not affected by the same bug) was implemented in Firefox 56. The scalar is aggregated into main_summary, but should always be identical to the simpleMeasurement.

subsession_length

subsession_length is the wall-clock duration of a subsession. subsession_length includes time that the computer was asleep for Windows, but not for OS X or Linux; there is a long-outstanding bug to include sleep time on all platforms.

There is another bug to count only time that the computer is not in sleep.

subsession_length was first implemented with the advent of subsessions, which came with unified telemetry.

total_uri_count

total_uri_count was implemented for Firefox 50.

total_uri_count is intended to capture the number of distinct navigation events a user performs. It includes changes to the URI fragment (i.e. anchor navigation) on the page. It excludes XmlHttpRequest fetches and iframes.

It works by attaching an instance of URICountListener as a TabsProgressListener which responds to onLocationChange events.

Some filters are applied to onLocationChange events:

  • Error pages are excluded.
  • Only top-level pageloads (where webProgress.isTopLevel, documented inline, is true) are counted – i.e, not navigations within a frame.
  • Tab restore events are excluded.
  • URIs visited in private browsing mode are excluded unless browser.engagement.total_uri_count.pbm is true. (The pref has been flipped on for small populations in a couple of short studies, but, for now remains false by default.)

unfiltered_uri_count

The unfiltered count, implemented for Firefox 51, differs only in that it includes URIs using protocol specs other than HTTP and HTTPS. It excludes some (but not all) about: pages – the set of “initial pages” defined in browser.js are excluded, but e.g. about:config and about:telemetry are included.

No applications of unfiltered_uri_count have been identified.

  1. Ping reason for main pings observed from Firefox 65 release channel users on February 21, 2019.

About this documentation

Contributing

Structure

Contributing

Documentation is critical to making a usable data platform. When surveying our users, their most common complaint has been our lack of documentation. It's important that we improve our documentation as often as possible.

Bug reports

If you see an error in the documentation or want to extend a chapter, please file a bug.

Getting the Raw Documentation

The documentation is intended to be read as HTML at docs.telemetry.mozilla.org. However, we store the documentation in raw text files in the firefox-data-docs repo. To begin contributing to the docs, fork the firefox-data-docs repo.

Building the Documentation

The documentation is rendered with mdBook.

To build the documentation locally, you'll need additional preprocessors:

Download releases for your system, unpack it and place the binary in a directory of your $PATH.

If you have rustc already installed, you can install a pre-compiled binary directly:

curl -LSfs https://japaric.github.io/trust/install.sh | sh -s -- --git badboy/mdbook-toc
curl -LSfs https://japaric.github.io/trust/install.sh | sh -s -- --git badboy/mdbook-mermaid

This will place mdbook-toc and mdbook-mermaid into ~/.cargo/bin. Make sure this directory is in your $PATH or copy it to a directory of your $PATH.

You can also build and install the preprocessors:

cargo install mdbook-toc
cargo install mdbook-mermaid

You can then serve the documentation locally with:

mdbook serve

The complete documentation for the mdBook toolchain is available online at https://rust-lang.github.io/mdBook/. If you run into any technical limitations, let @harterrt or @badboy know. We are happy to change the tooling to make it as much fun as possible to write.

Adding a new article

Be sure to link to your new article from SUMMARY.md, or mdBook will not render the file.

The structure of the repository is outlined in this article.

This documentation is under active development, so we may already be working on the documentation you need. Take a look at this bug component to check.

Style Guide

Articles should be written in Markdown. mdBook uses the CommonMark dialect.

Limit lines to 100 characters where possible. Try to split lines at the end of sentences, or use Semantic Line Breaks. This makes it easier to reorganize your thoughts later.

This documentation is meant to be read digitally. Keep in mind that people read digital content much differently than other media. Specifically, readers are going to skim your writing, so make it easy to identify important information.

Use visual markup like bold text, code blocks, and section headers. Avoid long paragraphs. Short paragraphs that describe one concept each makes finding important information easier.

Spell checking

Articles should use proper spelling, and pull requests will be automatically checked for spelling errors.

Technical articles often contain words that are not recognized by common dictionaries, if this happens you may either put specialized terms in code blocks, or you may add an exception to the .spelling file in the code repository.

For things like dataset names or field names, code blocks should be preferred. Things like project names or common technical terms should be added to the .spelling file.

To run the spell checker locally, install the markdown-spellcheck library, then run the scripts/spell_check.sh script from the root of the repository.

You may also remove the --report parameter to begin an interactive fixing session. In this case, it is highly recommended to also add the --no-suggestions parameter, which greatly speeds things up.

Link checking

Any web links should be valid. A dead link might not be your fault, but you will earn a lot of good karma by fixing a dead link!

To run the link checker locally, install the markdown-link-check library, then run the scripts/link_check.sh script from the root of the repository.

Supported Plugins

Mermaid

You may use mermaid.js diagrams in code blocks:

graph LR
  you -->|write|docs
  docs --> profit!

Which will be rendered as:

graph LR
  you -->|write|docs
  docs --> profit!

Review

Once you're happy with your contribution, please open a PR and flag @harterrt for review. Please squash your changes into meaningful commits and follow these commit message guidelines.

Publishing

The documentation is hosted on Github Pages.

Updates to the documentation are automatically published to docs.telemetry.mozilla.org when changes are merged.

To publish to your own fork of this repo, changes need to be pushed manually. Use the deploy script to publish new changes.

This script depends on ghp-import.

Keep in mind that this will deploy the docs to your origin repo. If you're working from a fork (which you should be), deploy.sh will update the docs hosted from your fork - not the production docs.

Colophon

This document's structure is heavily influenced by Django's Documentation Style Guide.

You can find more context for this document in this blog post.

Documentation Structure

The directory structure is meant to feel comfortable for those familiar with the data platform:

.
|--src
   |--datasets - contains dataset level documentation
   |--tools - contains tool level documentation
   |--concepts - contains tutorials meant to introduce a new concept to the reader
   |--cookbooks - focused code examples for reference

The prose documentation is meant to take the reader from beginner to expert. To this end, the rendered documentation has an order different from the directory structure:

  • Getting Started: Get some simple analysis completed so the user understands the amount of work involved / what the product feels like
  • Tutorials
    • Data Tutorials: tutorials meant to give the reader a complete understanding of a specific dataset. Start with a high level overview, then move on to completely document the data including Data source, Sampling, Common Issues, and where the reader can find the code.
    • Tools tutorials: Tutorials meant to introduce a single data tool or analysis best practice.
  • Cookbooks
  • Reference material - TBD