Choosing a Desktop Product Dataset

This document will help you find the best data source for a given analysis. It focuses on descriptive datasets and does not cover anything attempting to explain why something is observed. This guide will help if you need to answer questions like:

  • How many Firefox users are active in Germany?
  • How many crashes occur each day?
  • How many users have installed a specific add-on?

If you want to know whether a causal link occurs between two events, you can learn more at tools for experimentation.

Table of Contents

Raw Pings

We receive data from our users via pings. There are several types of pings, each containing different measurements and sent for different purposes. To review a complete list of ping types and their schemata, see this section of the Mozilla Source Tree Docs.

Pings are also described by a JSONSchema specification which can be found in the mozilla-pipeline-schemas repository.

There are a few pings that are central to delivering our core data collection primitives (Histograms, Events, Scalars) and for keeping an eye on Firefox behaviour (Environment, New Profiles, Updates, Crashes).

For instance, a user's first session in Firefox might have four pings like this:

Flowchart of pings in the user's first session

"main" ping

The "main" ping is the workhorse of the Firefox Telemetry system. It delivers the Telemetry Environment as well as Histograms and Scalars for all process types that collect data in Firefox. It has several variants each with specific delivery characteristics:

ReasonSent whenNotes
shutdownFirefox session ends cleanlyAccounts for about 80% of all "main" pings. Sent by Pingsender immediately after Firefox shuts down, subject to conditions: Firefox 55+, if the OS isn't also shutting down, and if this isn't the client's first session. If Pingsender fails or isn't used, the ping is sent by Firefox at the beginning of the next Firefox session.
dailyIt has been more than 24 hours since the last "main" ping, and it is around local midnightIn long-lived Firefox sessions we might go days without receiving a "shutdown" ping. Thus the "daily" ping is sent to ensure we occasionally hear from long-lived sessions.
environment-changeTelemetry Environment changesIs sent immediately when triggered by Firefox (Installing or removing an addon or changing a monitored user preference are common ways for the Telemetry Environment to change)
aborted-sessionFirefox session doesn't end cleanlySent by Firefox at the beginning of the next Firefox session.

It was introduced in Firefox 38.

"first-shutdown" ping

The "first-shutdown" ping is identical to the "main" ping with reason "shutdown" created at the end of the user's first session, but sent with a different ping type. This was introduced when we started using Pingsender to send shutdown pings as there would be a lot of first-session "shutdown" pings that we'd start receiving all of a sudden.

It is sent using Pingsender.

It was introduced in Firefox 57.

"event" ping

The "event" ping provides low-latency eventing support to Firefox Telemetry. It delivers the Telemetry Environment, Telemetry Events from all Firefox processes, and some diagnostic information about Event Telemetry. It is sent every hour if there have been events recorded, and up to once every 10 minutes (governed by a preference) if the maximum event limit for the ping (default to 1000 per process, governed by a preference) is reached before the hour is up.

It was introduced in Firefox 62.

"update" ping

Firefox Update is the most important means we have of reaching our users with the latest fixes and features. The "update" ping notifies us when an update is downloaded and ready to be applied (reason: "ready") and when the update has been successfully applied (reason: "success"). It contains the Telemetry Environment and information about the update.

It was introduced in Firefox 56.

"new-profile" ping

When a user starts up Firefox for the first time, a profile is created. Telemetry marks the occasion with the "new-profile" ping which sends the Telemetry Environment. It is sent either 30 minutes after Firefox starts running for the first time in this profile (reason: "startup") or at the end of the profile's first session (reason: "shutdown"), whichever comes first. "new-profile" pings are sent immediately when triggered. Those with reason "startup" are sent by Firefox. Those with reason "shutdown" are sent by Pingsender.

It was introduced in Firefox 55.

"crash" ping

The "crash" ping provides diagnostic information whenever a Firefox process exits abnormally. Unlike the "main" ping with reason "aborted-session", this ping does not contain Histograms or Scalars. It contains a Telemetry Environment, Crash Annotations, and Stack Traces.

It was introduced in Firefox 40.

"deletion-request" ping

In the event a user opts out of Telemetry, we send one final "deletion-request" ping to let us know. It contains only the common ping data and an empty payload.

It was introduced in Firefox 72, replacing the "optout" ping (which was in turn introduced in Firefox 63).

Pingsender

Pingsender is a small application shipped with Firefox which attempts to send pings even if Firefox is not running. If Firefox has crashed or has already shut down we would otherwise have to wait for the next Firefox session to begin to send pings.

Pingsender was introduced in Firefox 54 to send "crash" pings. It was expanded to send "main" pings of reason "shutdown" in Firefox 55 (excepting the first session). It sends the "first-shutdown" ping since its introduction in Firefox 57.

Analysis

The large majority of analyses can be completed using only the main ping. This ping includes histograms, scalars, and other performance and diagnostic data.

Few analyses actually rely directly on any raw ping data. Instead, we provide derived datasets which are processed versions of these data, made to be:

  • Easier and faster to query
  • Organized to make the data easier to analyze
  • Cleaned of erroneous or misleading data

Before analyzing raw ping data, check to make sure there isn't already a derived dataset made for your purpose. If you do need to work with raw ping data, be aware that the volume of data can be high. Try to limit the size of your data by controlling the date range, and start off using a sample.

Accessing the Data

Ping data lives in BigQuery and is accessible in re:dash; see the BigQuery cookbook section for more information.

Further Reading

You can find the complete ping documentation. To augment our data collection, see Collecting New Data and the Data Collection Policy.

Main Ping Derived Datasets

The main ping includes most of the measurements that track the performance and health of Firefox in the wild. This ping includes histograms, scalars, and events.

In its raw form, the main ping can be a bit difficult to work with. To make analyzing data easier, some datasets have been provided that simplify and aggregate information provided by the main ping.

clients_daily

The clients_daily table is intended as the first stop for asking questions about how people use Firefox. It should be easy to answer simple questions. Each row in the table is a (client_id, submission_date) and contains a number of aggregates about that day's activity.

Contents

Many questions about Firefox take the form "What did clients with characteristics X, Y, and Z do during the period S to E?" The clients_daily table is aimed at answer those questions.

Accessing the Data

The clients_daily table is accessible through re:dash using the Telemetry (BigQuery) data source.

Here's an example query.

clients_last_seen

The clients_last_seen dataset is useful for efficiently determining exact user counts such as DAU and MAU. It can also allow efficient calculation of other windowed usage metrics like retention via its bit pattern fields.

It does not use approximates, unlike the HyperLogLog algorithm used in the client_count_daily dataset, and it includes the most recent values in a 28 day window for all columns in the clients_daily dataset.

This dataset should be used instead of client_count_daily.

Content

For each submission_date this dataset contains one row per client_id that appeared in clients_daily in a 28 day window including submission_date and preceding days.

The days_since_seen column indicates the difference between submission_date and the most recent submission_date in clients_daily where the client_id appeared. A client observed on the given submission_date will have days_since_seen = 0.

Other days_since_ columns use the most recent date in clients_daily where a certain condition was met. If the condition was not met for a client_id in a 28 day window NULL is used. For example days_since_visited_5_uri uses the condition scalar_parent_browser_engagement_total_uri_count_sum >= 5. These columns can be used for user counts where a condition must be met on any day in a window instead of using the most recent values for each client_id.

The days_seen_bits field stores the daily history of a client in the 28 day window. The daily history is converted into a sequence of bits, with a 1 for the days a client is in clients_daily and a 0 otherwise, and this sequence is converted to an integer. A tutorial on how to use these bit patterns to create filters in SQL can be found in this notebook.

The rest of the columns use the most recent value in clients_daily where the client_id appeared.

Background and Caveats

User counts generated using days_since_seen only reflect the most recent values from clients_daily for each client_id in a 28 day window. This means Active MAU as defined cannot be efficiently calculated using days_since_seen because if a given client_id appeared every day in February and only on February 1st had scalar_parent_browser_engagement_total_uri_count_sum >= 5 then it would only be counted on the 1st, and not the 2nd-28th. Active MAU can be efficiently and correctly calculated using days_since_visited_5_uri.

MAU can be calculated over a GROUP BY submission_date[, ...] clause using COUNT(*), because there is exactly one row in the dataset for each client_id in the 28 day MAU window for each submission_date.

User counts generated using days_since_seen can use SUM to reduce groups, because a given client_id will only be in one group per submission_date. So if MAU were calculated by country and channel, then the sum of the MAU for each country would be the same as if MAU were calculated only by channel.

Accessing the Data

The data is available in Re:dash and BigQuery. Take a look at this full running example query in Re:dash.

main_summary

Contents

Note that since the introduction of BigQuery, we are able to represent the full main ping structure in a table, available as telemetry.main. New analyses should avoid main_summary, which exists only for compatibility.

The main_summary table contains one row for each ping. Each column represents one field from the main ping payload, though only a subset of all main ping fields are included. This dataset does not include most histograms.

Background and Caveats

This table is massive, and due to its size, it can be difficult to work with.

Instead, we recommend using the clients_daily or clients_last_seen dataset where possible.

If you do need to query this table, make use of the sample_id field and limit to a short submission date range.

Accessing the Data

The main_summary table is accessible through re:dash. Here's an example query.

first_shutdown_summary

The first_shutdown_summary table is a summary of the first-shutdown ping.

Contents

The first shutdown ping contains first session usage data. The dataset has rows similar to the telemetry_new_profile_parquet, but in the shape of main_summary.

Background and Caveats

Ping latency was reduced through the shutdown ping-sender mechanism in Firefox 55. To maintain consistent historical behavior, the first main ping is not sent until the second start up. In Firefox 57, a separate first-shutdown ping was created to evaluate first-shutdown behavior while maintaining backwards compatibility.

In many cases, the first-shutdown ping is a duplicate of the main ping. The first-shutdown summary can be used in conjunction with the main summary by taking the union and deduplicating on the document_id.

Accessing the Data

The data can be accessed as first_shutdown_summary.

The data is backfilled to 2017-09-22, the date of its first nightly appearance. This data should be available to all releases on and after Firefox 57.

client_count_daily

The client_count_daily dataset is useful for estimating user counts over a few pre-defined dimensions.

The client_count_daily dataset is similar to the deprecated client_count dataset except that is aggregated by submission date and not activity date.

Content

This dataset includes columns for a dozen factors and an HLL variable. The hll column contains a HyperLogLog variable, which is an approximation to the exact count. The factor columns include submission date and the dimensions listed here. Each row represents one combinations of the factor columns.

Background and Caveats

It's important to understand that the hll column is not a standard count. The hll variable avoids double-counting users when aggregating over multiple days. The HyperLogLog variable is a far more efficient way to count distinct elements of a set, but comes with some complexity. To find the cardinality of an HLL use cardinality(cast(hll AS HLL)). To find the union of two HLL's over different dates, use merge(cast(hll AS HLL)). The Firefox ER Reporting Query is a good example to review. Finally, Roberto has a relevant write-up here.

Accessing the Data

The data is available in Re:dash. Take a look at this example query.

Further Reading

Other Datasets

Public crash statistics for Firefox are available through the Data Platform in a socorro_crash dataset. The crash data in Socorro is sanitized and made available to STMO. A nightly import job converts batches of JSON documents into a columnar format using the associated JSON Schema.

Contents

Accessing the Data

The dataset is available in parquet at s3://telemetry-parquet/socorro_crash/v2. It is also indexed with Athena and Presto with the table name socorro_crash.