This document will help you find the best data source for a given analysis.
This guide focuses on descriptive datasets and does not cover experimentation. For example, this guide will help if you need to answer questions like:
- How many users do we have in Germany, how many crashes we see per day?
- How many users have a given addon installed?
If you're interested in figuring out whether there's a causal link between two events take a look at our tools for experimentation.
- Raw Pings
- Main Ping Derived Datasets
- Other Datasets
- Obsolete Datasets
We receive data from our users via pings. There are several types of pings, each containing different measurements and sent for different purposes. To review a complete list of ping types and their schemata, see this section of the Mozilla Source Tree Docs.
Pings are also described by a JSONSchema specification which can be found in the
There are a few pings that are central to delivering our core data collection primitives (Histograms, Events, Scalars) and for keeping an eye on Firefox behaviour (Environment, New Profiles, Updates, Crashes).
For instance, a user's first session in Firefox might have four pings like this:
The "main" ping is the workhorse of the Firefox Telemetry system. It delivers the Telemetry Environment as well as Histograms and Scalars for all process types that collect data in Firefox. It has several variants each with specific delivery characteristics:
|shutdown||Firefox session ends cleanly||Accounts for about 80% of all "main" pings. Sent by Pingsender immediately after Firefox shuts down, subject to conditions: Firefox 55+, if the OS isn't also shutting down, and if this isn't the client's first session. If Pingsender fails or isn't used, the ping is sent by Firefox at the beginning of the next Firefox session.|
|daily||It has been more than 24 hours since the last "main" ping, and it is around local midnight||In long-lived Firefox sessions we might go days without receiving a "shutdown" ping. Thus the "daily" ping is sent to ensure we occasionally hear from long-lived sessions.|
|environment-change||Telemetry Environment changes||Is sent immediately when triggered by Firefox (Installing or removing an addon or changing a monitored user preference are common ways for the Telemetry Environment to change)|
|aborted-session||Firefox session doesn't end cleanly||Sent by Firefox at the beginning of the next Firefox session.|
It was introduced in Firefox 38.
The "first-shutdown" ping is identical to the "main" ping with reason "shutdown" created at the end of the user's first session, but sent with a different ping type. This was introduced when we started using Pingsender to send shutdown pings as there would be a lot of first-session "shutdown" pings that we'd start receiving all of a sudden.
It is sent using Pingsender.
It was introduced in Firefox 57.
The "event" ping provides low-latency eventing support to Firefox Telemetry. It delivers the Telemetry Environment, Telemetry Events from all Firefox processes, and some diagnostic information about Event Telemetry. It is sent every hour if there have been events recorded, and up to once every 10 minutes (governed by a preference) if the maximum event limit for the ping (default to 1000 per process, governed by a preference) is reached before the hour is up.
It was introduced in Firefox 62.
Firefox Update is the most important means we have of reaching our users with the latest fixes and features. The "update" ping notifies us when an update is downloaded and ready to be applied (reason: "ready") and when the update has been successfully applied (reason: "success"). It contains the Telemetry Environment and information about the update.
It was introduced in Firefox 56.
When a user starts up Firefox for the first time, a profile is created. Telemetry marks the occasion with the "new-profile" ping which sends the Telemetry Environment. It is sent either 30 minutes after Firefox starts running for the first time in this profile (reason: "startup") or at the end of the profile's first session (reason: "shutdown"), whichever comes first. "new-profile" pings are sent immediately when triggered. Those with reason "startup" are sent by Firefox. Those with reason "shutdown" are sent by Pingsender.
It was introduced in Firefox 55.
The "crash" ping provides diagnostic information whenever a Firefox process exits abnormally. Unlike the "main" ping with reason "aborted-session", this ping does not contain Histograms or Scalars. It contains a Telemetry Environment, Crash Annotations, and Stack Traces.
It was introduced in Firefox 40.
It was introduced in Firefox 72, replacing the "optout" ping (which was in turn introduced in Firefox 63).
Pingsender is a small application shipped with Firefox which attempts to send pings even if Firefox is not running. If Firefox has crashed or has already shut down we would otherwise have to wait for the next Firefox session to begin to send pings.
Pingsender was introduced in Firefox 54 to send "crash" pings. It was expanded to send "main" pings of reason "shutdown" in Firefox 55 (excepting the first session). It sends the "first-shutdown" ping since its introduction in Firefox 57.
The large majority of analyses can be completed using only the main ping. This ping includes histograms, scalars, and other performance and diagnostic data.
Few analyses actually rely directly on any raw ping data. Instead, we provide derived datasets which are processed versions of these data, made to be:
- Easier and faster to query
- Organized to make the data easier to analyze
- Cleaned of erroneous or misleading data
Before analyzing raw ping data, check to make sure there isn't already a derived dataset made for your purpose. If you do need to work with raw ping data, be aware that the volume of data can be high. Try to limit the size of your data by controlling the date range, and start off using a sample.
The main ping contains most of the measurements used to track performance and health of Firefox in the wild. This ping includes histograms, scalars, and events.
This section describes the derived datasets we provide to make analyzing this data easier.
clients_daily table is intended as the first stop for asking questions
about how people use Firefox. It should be easy to answer simple questions.
Each row in the table is a (
submission_date) and contains a
number of aggregates about that day's activity.
Many questions about Firefox take the form "What did clients with
characteristics X, Y, and Z do during the period S to E?" The
clients_daily table is aimed at answer those questions.
clients_daily table is accessible through re:dash using the
Telemetry (BigQuery) data source.
Here's an example query.
clients_last_seen dataset is useful for efficiently determining exact
user counts such as DAU and MAU.
It does not use approximates, unlike the HyperLogLog algorithm used in the
and it includes the most recent values in a 28 day window for all columns in
This dataset should be used instead of
submission_date this dataset contains one row per
that appeared in
clients_daily in a 28 day window including
submission_date and preceding days.
days_since_seen column indicates the difference between
and the most recent
clients_daily where the
appeared. A client observed on the given
submission_date will have
days_since_seen = 0.
days_since_ columns use the most recent date in
a certain condition was met. If the condition was not met for a
a 28 day window
NULL is used. For example
days_since_visited_5_uri uses the
scalar_parent_browser_engagement_total_uri_count_sum >= 5. These
columns can be used for user counts where a condition must be met on any day
in a window instead of using the most recent values for each
days_seen_bits field stores the daily history of a client in the 28 day
window. The daily history is converted into a sequence of bits, with a
the days a client is in
clients_daily and a
0 otherwise, and this sequence
is converted to an integer. A tutorial on how to use these bit patterns to
create filters in SQL can be found in
The rest of the columns use the most recent value in
User counts generated using
days_since_seen only reflect the most recent
clients_daily for each
client_id in a 28 day window. This means
as defined cannot be efficiently calculated using
days_since_seen because if
client_id appeared every day in February and only on February 1st had
scalar_parent_browser_engagement_total_uri_count_sum >= 5 then it would only
be counted on the 1st, and not the 2nd-28th. Active MAU can be efficiently and
correctly calculated using
MAU can be calculated over a
GROUP BY submission_date[, ...] clause using
COUNT(*), because there is exactly one row in the dataset for each
client_id in the 28 day MAU window for each
User counts generated using
days_since_seen can use
SUM to reduce groups,
because a given
client_id will only be in one group per
if MAU were calculated by
channel, then the sum of the MAU for
country would be the same as if MAU were calculated only by
The data is available in Re:dash and BigQuery. Take a look at this full running example query in Re:dash.
Note that since the introduction of BigQuery, we are able to represent the
main ping structure in a table, available as
New analyses should avoid
main_summary, which exists only for compatibility.
main_summary table contains one row for each ping.
Each column represents one field from the main ping payload,
though only a subset of all main ping fields are included.
This dataset does not include most histograms.
This table is massive, and due to its size, it can be difficult to work with.
Instead, we recommend using the
If you do need to query this table, make use of the
sample_id field and
limit to a short submission date range.
main_summary table is accessible through re:dash.
Here's an example query.
first_shutdown_summary table is a summary of the
Ping latency was reduced through the shutdown ping-sender mechanism in Firefox 55. To maintain consistent historical behavior, the first main ping is not sent until the second start up. In Firefox 57, a separate first-shutdown ping was created to evaluate first-shutdown behavior while maintaining backwards compatibility.
In many cases, the first-shutdown ping is a duplicate of the main ping. The first-shutdown summary can be used in conjunction with the main summary by taking the union and deduplicating on the
The data can be accessed as
The data is backfilled to 2017-09-22, the date of its first nightly appearance. This data should be available to all releases on and after Firefox 57.
client_count_daily dataset is useful for estimating user counts over a few
client_count_daily dataset is similar to the deprecated
except that is aggregated by submission date and not activity date.
This dataset includes columns for a dozen factors and an HLL variable.
hll column contains a
variable, which is an approximation to the exact count.
The factor columns include submission date and the dimensions listed
Each row represents one combinations of the factor columns.
It's important to understand that the
hll column is not a standard count.
hll variable avoids double-counting users when aggregating over multiple days.
The HyperLogLog variable is a far more efficient way to count distinct elements of a set,
but comes with some complexity.
To find the cardinality of an HLL use
cardinality(cast(hll AS HLL)).
To find the union of two HLL's over different dates, use
merge(cast(hll AS HLL)).
The Firefox ER Reporting Query
is a good example to review.
Finally, Roberto has a relevant write-up
The data is available in Re:dash. Take a look at this example query.
I don't recommend accessing this data from ATMO.
Public crash statistics for Firefox are available through the Data Platform in a
The crash data in Socorro is sanitized and made available to ATMO and STMO.
A nightly import job converts batches of JSON documents into a columnar format using the associated JSON Schema.
The dataset is available in parquet at
It is also indexed with Athena and Presto with the table name
heavy_users table provides information about whether a given
considered a "heavy user" on each day (using submission date).
heavy_users table contains one row per client-day, where day is
submission_date. A client has a row for a specific
they were active at all in the 28 day window ending on that
A user is a "heavy user" as of day N if, for the 28 day period ending
on day N, the sum of their
active_ticks is in the 90th percentile (or
above) of all clients during that period. For more analysis on this,
and a discussion of new profiles, see
- Data starts at 20170801. There is technically data in the table before
this, but the
NULLfor those dates because it needed to bootstrap the first 28 day window.
- Because it is top the 10% of clients for each 28 day period, more
than 10% of clients active on a given
submission_datewill be considered heavy users. If you join with another data source (
main_summary, for example), you may see a larger proportion of heavy users than expected.
- Each day has a separate, but related, set of heavy users. Initial investigations show that approximately 97.5% of heavy users as of a certain day are still considered heavy users as of the next day.
- There is no "fixing" or weighting of new profiles - days before the
profile was created are counted as zero
active_ticks. Analyses may need to use the included
profile_creation_datefield to take this into account.
The data is available both via
sql.t.m.o and Spark.
SELECT * FROM heavy_users LIMIT 3
The code responsible for generating this dataset is here
retention table provides client counts relevant to client retention at a
1-day granularity. The project is tracked in Bug 1381840
retention table contains a set of attribute columns used to specify a
cohort of users and a set of metric columns to describe cohort activity. Each
row contains a permutation of attributes, an approximate set of clients in a
cohort, and the aggregate engagement metrics.
This table uses the HyperLogLog (HLL) sketch to create an approximate set of
clients in a cohort. HLL allows counting across overlapping cohorts in a single
pass while avoiding the problem of double counting. This data-structure has the
benefit of being compact and performant in the context of retention analysis,
at the expense of precision. For example, calculating a 7-day retention period
can be obtained by aggregating over a week of retention data using the union
operation. With SQL primitive, this requires a recalculation of COUNT DISTINCT
client_id's in the 7-day window.
- The data starts at 2017-03-06, the merge date where Nightly started to
track Firefox 55 in Mozilla-Central. However, there was
not a consistent view into the behavior of first session profiles until the
new_profileping. This means much of the data is inaccurate before 2017-06-26.
- This dataset uses 4 day reporting latency to aggregate at least 99% of the data in a given submission date. This figure is derived from the telemetry-health measurements on submission latency, with the discussion in Bug 1407410. This latency metric was reduced Firefox 55 with the introduction of the shutdown ping-sender mechanism.
- Caution should be taken before adding new columns. Additional attribute columns will grow the number of rows exponentially.
- The number of HLL bits chosen for this dataset is 13. This means the default size of the HLL object is 2^13 bits or 1KiB. This maintains about a 1% error on average. See this table from Algebird's HLL implementation for more details.
The data is primarily available through Re:dash on STMO via the Presto source. This service has been configured to use predefined HLL functions.
The column should first be cast to the HLL type. The scalar
cardinality(<hll_column>) function will approximate the number of unique
items per HLL object. The aggregate
merge(<hll_column>) function will perform
the set union between all objects in a column.
Example: Cast the count column into the appropriate type.
SELECT cast(hll as HLL) as n_profiles_hll FROM retention
Count the number of clients seen over all attribute combinations.
SELECT cardinality(cast(hll as HLL)) FROM retention
Group-by and aggregate client counts over different release channels.
SELECT channel, cardinality(merge(cast(hll AS HLL)) FROM retention GROUP BY channel
Also see the
The churn dataset tracks the 7-day churn rate of telemetry profiles. This dataset is generally used for analyzing cohort churn across segments and time.
Churn is the rate of attrition defined by
(clients seen in week N)/(clients seen in week 0)
for groups of clients with some shared attributes. A group of clients with
shared attributes is called a cohort. The cohorts in this dataset are created
every week and can be tracked over time using the
acquisition_date and the
weeks since acquisition or
The following example demonstrates the current logic for generating this dataset. Each column represents the days since some arbitrary starting date.
All three clients are part of the same cohort. Client A is retained for weeks 0 and 1 since there is activity in both periods. A client only needs to show up once in the period to be counted as retained. Client B is acquired in week 0 and is active frequently but does not appear in following weeks. Client B is considered churned on week 1. However, a client that is churned can become retained again. Client C is considered churned on week 1 but retained on week 2.
The following table summarizes the above daily activity into the following view where every column represents the current week since acquisition date..
The clients are then grouped into cohorts by attributes. An attribute describes a property about the cohort such as the country of origin or the binary distribution channel. Each group also contains descriptive aggregates of engagement. Each metric describes the activity of a cohort such as size and overall usage at a given time instance.
The original concept for churn is captured in this Mana
The original derived data-set was created in bug
1198537. The first
major revision (
this data-set added attribution, search, and uri counts. The second major
additional clients through the
new-profile ping and adjusted the collection
window from 10 to 5 days.
- Each row in this dataset describes a unique segment of users
- The number of rows is exponential with the number of dimensions
- New fields should be added sparing to account for data-set size
- The dataset lags by 10 days in order account for submission latency
- This value was determined to be time for 99% of main pings to arrive at the
server. With the shutdown-ping sender, this has been reduced to 4 days.
churn_v3still tracks releases older than Firefox 55.
- This value was determined to be time for 99% of main pings to arrive at the server. With the shutdown-ping sender, this has been reduced to 4 days. However,
- The start of the period is fixed to Sundays. Once it has been aggregated, the
period cannot be shifted due to the way clients are counted.
- A supplementary 1-day
retentiondataset using HyperLogLog for client counts is available for counting over arbitrary retention periods and date offsets. Additionally, calculating churn or retention over specific cohorts is tractable in STMO with
- A supplementary 1-day
churn is available in Re:dash under Athena and Presto. The data is also
available in parquet for consumption by columnar data engines at
error_aggregates_v2 table represents counts of errors counted from main and crash
pings, aggregated every 5 minutes. It is the dataset backing the main mission
control view, but may also be queried
error_aggregates_v2 table contains counts of various error measures (for
example: crashes, "the slow script dialog showing"), aggregated across each
unique set of dimensions (for example: channel, operating system) every 5
minutes. You can get an aggregated count for any particular set of dimensions
by summing using SQL.
It's important to note that when this dataset is written, pings from clients participating in an experiment
are aggregated on the
experiment_branch dimensions corresponding to what experiment and branch
they are participating in. However, they are also aggregated with the rest of the population where the values of
these dimensions are null.
Therefore care must be taken when writing aggregating queries over the whole population - in these cases one needs to
experiment_id is null and
experiment_branch is null in order to not double-count pings from experiments.
You can access the data via re:dash. Choose
Athena and then select the
The code responsible for generating this dataset is here.
There are several tables owned by the mobile team documented here: