Choosing a Desktop Product Dataset

This document will help you find the best data source for a given analysis. It focuses on descriptive datasets and does not cover anything attempting to explain why something is observed. This guide will help if you need to answer questions like:

  • How many Firefox users are active in Germany?
  • How many crashes occur each day?
  • How many users have installed a specific add-on?

If you want to know whether a causal link occurs between two events, you can learn more at tools for experimentation.

There are two types of datasets that you might want to use: those based on raw pings and those derived from them.

Raw Ping Datasets

We receive data from Firefox users via pings: small JSON payloads sent by clients at specified intervals. There are many types of pings, each containing different measurements and sent for different purposes.

These pings are then aggregated into ping-level datasets that can be retrieved using BigQuery. Pings can be difficult to work with and expensive to query: where possible, you should use a derived dataset to answer your question.

For more information on pings and how to use them, see Raw Ping Data.

Derived Datasets

Derived datasets are built using the raw ping data above with various transformations to make them easier to work with and help you avoid the pitfall of pseudo-replication. You can find a full list of them in the derived datasets section, but two commonly used ones are "Clients Daily" and "Clients Last Seen".

Clients Daily

Many questions about Firefox take the form "What did clients with characteristics X, Y, and Z do during the period S to E?" The clients_daily table aims to answer these questions. Each row in the table is a (client_id, submission_date) and contains a number of aggregates about that day's activity.

See the clients_daily reference for more information.

Clients Last Seen

The clients_last_seen dataset is useful for efficiently determining exact user counts such as DAU and MAU. It can also allow efficient calculation of other windowed usage metrics like retention via its bit pattern fields. It includes the most recent values in a 28 day window for all columns in the clients_daily dataset.

See the clients_last_seen reference for more information.