GLAM datasets
GLAM provides aggregated telemetry data in a way that makes it easy to understand how a given probe or metric has been changing over subsequent builds. GLAM aggregations are statistically validated by data scientists to ensure an accurate picture of the observed behavior of telemetry data.
GLAM data is also meant to be explored by itself: GLAM aggregation tables are useful for accessing the data that drives GLAM if more digging is required. Please read through the next section to learn more!
GLAM final tables (Aggregates dataset)
The following datasets are split in three categories: Firefox Desktop Glean, Firefox Desktop Legacy and Firefox for Android. The tables contain the final aggregated data that powers GLAM.
Each link below points to the dataset's page on Mozilla's Data Catalog where you can find the dataset's full documentation.
NOTE: You may realize that the Aggregates dataset does not have the dimensions you need. For example, the dataset does not contain client-level or day-by-day aggregations. If you need to dive deeper or aggregate on a field that isn't included in the Aggregates dataset, you will need to write queries against raw telemetry tables. Should that be your quest we don't want you to start from scratch, this is why GLAM has the
View SQL Query
->Telemetry SQL
feature, which gives you a query that already works so you can tweak it. The feature is accessible once you pick a metric or probe. Additionally, you can read other material such as Visualizing Percentiles of a Main Ping Exponential Histogram in order to learn how to write queries that can give you what you need. Finally, #data-help on slack is a place where all questions related to data are welcome.
Firefox Desktop (Glean)
moz-fx-data-shared-prod.glam_etl.glam_fog_nightly_aggregates
moz-fx-data-shared-prod.glam_etl.glam_fog_beta_aggregates
moz-fx-data-shared-prod.glam_etl.glam_fog_release_aggregates
Firefox for Android
moz-fx-data-shared-prod.glam_etl.glam_fenix_nightly_aggregates
moz-fx-data-shared-prod.glam_etl.glam_fenix_beta_aggregates
moz-fx-data-shared-prod.glam_etl.glam_fenix_release_aggregates
Firefox Desktop (Legacy Telemetry)
moz-fx-data-shared-prod.glam_etl.glam_desktop_nightly_aggregates
moz-fx-data-shared-prod.glam_etl.glam_desktop_beta_aggregates
moz-fx-data-shared-prod.glam_etl.glam_desktop_release_aggregates
In addition to the above tables, the GLAM ETL saves intermediate data transformed after each step. The next section provides an overview of each of the steps with the dataset they produce.
ETL Pipeline
Scheduling
Most of the GLAM ETL is scheduled to run daily via Airflow. There are separate ETL pipelines for computing GLAM datasets:
- Firefox Desktop on Glean
- Runs daily
- Only
daily_
(first "half") jobs for release are processed
- Firefox Desktop on Glean (release)
- Runs weekly
- Second "half" of release ETL is processed
- Firefox Desktop legacy
- Runs daily
- Firefox for Android
- Runs daily
Source Code
The ETL code base lives in the bigquery-etl repository and is partially generated. The scripts for generating ETL queries for Firefox Desktop Legacy currently live here while the GLAM logic for Glean apps lives here.
Steps
GLAM has a separate set of steps and intermediate tables to aggregate scalar and histogram probes.
latest_versions
- This task pulls in the most recent version for each channel from https://product-details.mozilla.org/1.0/firefox_versions.json
clients_daily_histogram_aggregates_<process>
- The set of steps that load data to this table are divided into different processes (
parent
,content
,gpu
) plus a keyed step for keyed histograms. - The parent job creates or overwrites the partition corresponding to the
logical_date
, and other processes append data to that partition. - The process uses
telemetry.buildhub2
to select rows with validbuild_ids
. - Aggregations are done per client, per day, and include a line for each
submission_date
,client_id
,os
,app_version
,build_id
, andchannel
. - The aggregation is done by adding histogram values with the same key for the dimensions listed above.
- The queries for the different steps are generated and run as part of each step.
- The "keyed" step includes all Keyed Histogram probes, regardless of process (
parent
,content
,gpu
). - As a result of the subdivisions in this step, it generates different rows for each process and keyed/non-keyed metric, which will be grouped together in the
clients_histogram_aggregates
step. - Clients that are on the release channel of the Windows operating system get sampled to reduce the data size.
- The partitions are set to expire after 7 days.
clients_histogram_aggregates_new
- This step groups together all rows that have the same
submission_date
andlogical_date
from different processes and keyed and non-keyed sources, and combines them into a single row in thehistogram_aggregates
column. It sums the histogram values with the same key. - This process is only applied to the last three versions.
- The table is overwritten at every execution of this step.
clients_histogram_aggregates
- This is the most important histogram table in the intermediate dataset, where each row represents a
client_id
with its cumulative sum of the histograms for the last 3 versions of all metrics. - New entries from
clients_histogram_aggregates_new
are merged with the 3 last versions of previous day’s partition and written to the current day’s partition. - This table only holds the most recent
submission_date
, which marks the most recent date of data ingestion. A check before running this jobs ensures that the ETL does not skip days. In other words, the ETL only processes dated
if the last date processed wasd-1
. - In case of failures in the GLAM ETL this table must be backfilled one day at a time.
clients_histogram_buckets_counts
- This process creates wildcards for
os
andapp_build_id
, which are needed for aggregating values across os and build IDs later on. - It then adds a normalized histogram per client, while keeping a non-normalized histogram.
- Finally, it removes the
client_id
dimension by breaking histograms into key/value pairs and doing theSUM
all values of the same key for the same metric/os/version/build.
clients_histogram_probe_counts
- This process uses the
metric_type
to select the algorithm to build histograms using the broken down buckets from the previous step. Histograms can belinear
,exponential
orcustom
. - It then aggregates metrics per wildcards (
os
,app_build_id
). - Finally, it rebuilds histograms using the Dirichlet Distribution, normalized using the number of clients that contributed to that histogram in the
clients_histogram_buckets_counts
step.
clients_daily_scalar_aggregates
- The set of steps that load data to this table are divided into non-keyed
scalar
,keyed_boolean
andkeyed_scalar
. The non-keyed scalar job creates or overwrites the partition corresponding to thelogical_date
, and other processes append data to that partition. - The process uses
telemetry.buildhub2
to select rows with validbuild_ids
. - Aggregations are done per client, per day and include a line for each
client
,os
,app_version
,build_id
, andchannel
. - The queries for the different steps are generated and run as part of each step. All steps include probes regardless of process (
parent
,content
,gpu
). - As a result of the subdivisions in this step, it generates different rows for each keyed/non-keyed, boolean/scalar metric, which will be grouped together in
clients_scalar_aggregates
. - Clients that are on the release channel of the Windows operating system get sampled to reduce the data size.
- Partitions expire in 7 days.
clients_scalar_aggregates
- This process groups all rows with the same
submission_date
andlogical_date
fromclients_daily_scalar_aggregates
and combines them into a single row in thescalar_aggregates
column. - If the
agg_type
iscount
,sum
,true
, orfalse
, the process will sum the values. - If the
agg_type
ismax
, it will take the maximum value, and if it ismin
, it will take the minimum value. - This process is only applied to the last three versions.
- The table is partitioned by
submission_date
. The partitions expire in 7 days.
client_scalar_probe_counts
- This step processes booleans and scalars, although booleans are not supported by GLAM.
- For boolean metrics the process aggregates their values with the following rule: "never" if all values for a metric are false, "always" if all values are true, and "sometimes" if there's a mix.
- For scalar and
keyed_scalar
probes the process starts by building the buckets per metric, then it generates wildcards for os andapp_build_id
. It then aggregates all submissions from the sameclient_id
under one row and assigns theuser_count
column to it with the following rule: 10 if os is "Windows" and channel is "release", 1 otherwise. After that it finishes by aggregating the rows per metric, placing the scalar values in their appropriate buckets and summing up alluser_count
values for that metric.
glam_sample_counts
- This process calculates the
total_sample
column.