GLAM datasets
GLAM aims to answer a majority of the "easy" questions of how a probe or metric has changed over time. The GLAM aggregation tables are useful for accessing the data that drives GLAM if more exploration is required. Exploring the GLAM tables could take that a little farther, but still has some limitations. If you need to dive deeper or aggregate on a field that isn't included here, consider reading Visualizing Percentiles of a Main Ping Exponential Histogram.
The GLAM tables:
- Are aggregated at the client level, not the submission ping level
- Provide a set of dimensions for subsets: channel, OS, process or ping type
- Are aggregated by build ID and version
- For each aggregation, the distribution and percentiles over time are calculated
- Have the last 3 versions of data aggregated every day
- Retain data for up to 10 major versions
Firefox Desktop
Data source table
moz-fx-data-shared-prod.telemetry.client_probe_counts
Data reference
os
: One of Windows, Mac, Linux, or NULL for all OSesapp_version
: Integer representing the major versionapp_build_id
: The full build ID, or NULL if aggregated by major versionchannel
: One of nightly, beta, or releasemetric
: The name of the metricmetric_type
: The type of metric, e.g.histogram-enumerated
key
: The key if the metric is a keyed metricprocess
: The processclient_agg_type
: The type of client aggregation used, e.g.summed_histogram
agg_type
: One of histogram or percentiles representing what data will be in theaggregates
columntotal_users
: The number of users that submitted data for the combination of dimensionsaggregates
: The data as a key/value record, either percentiles or histogram
Sample query
SELECT
os,
app_version,
app_build_id,
channel,
metric,
metric_type,
key,
process,
client_agg_type,
agg_type,
total_users,
mozfun.glam.histogram_cast_json(aggregates) AS aggregates
FROM
`moz-fx-data-shared-prod.telemetry.client_probe_counts`
WHERE
metric="checkerboard_severity"
AND channel="nightly"
AND os IS NULL
AND process="parent"
AND app_build_id IS NULL
Notes:
- To query all OSes, use:
WHERE os IS NULL
- To query by build ID, use:
WHERE app_build_id IS NOT NULL
- To query by version, use:
WHERE app_build_id IS NULL
Firefox for Android
Data source tables
org_mozilla_fenix_glam_release__view_probe_counts_v1
org_mozilla_fenix_glam_beta__view_probe_counts_v1
org_mozilla_fenix_glam_nightly__view_probe_counts_v1
Data reference
os
: Just "Android" for nowapp_version
: Integer representing the major versionapp_build_id
: The full build ID, or "*" if aggregated by major versionchannel
: Always "*", use the different source tables to select a channelmetric
: The name of the metricmetric_type
: The type of metric, e.g.timing_distribution
key
: The key if the metric is a keyed metricping_type
: The type of ping, or "*" for all ping typesclient_agg_type
: The type of client aggregation used, e.g.summed_histogram
agg_type
: One of histogram or percentiles representing what data will be in theaggregates
columntotal_users
: The number of users that submitted data for the combination of dimensionsaggregates
: The data as a key/value record, either percentiles or histogram
Sample query
SELECT
ping_type,
os,
app_version,
app_build_id,
metric,
metric_type,
key,
client_agg_type,
agg_type,
total_users,
mozfun.glam.histogram_cast_json(aggregates) AS aggregates,
FROM
`moz-fx-data-shared-prod.glam_etl.org_mozilla_fenix_glam_release__view_probe_counts_v1`
WHERE
metric="performance_time_dom_complete"
AND os="Android"
AND ping_type="*"
AND app_build_id!="*"
Notes:
- To query all ping types, use:
WHERE ping_type = "*"
- To query by build ID, use:
WHERE app_build_id != "*"
- To query by version, use:
WHERE app_build_id = "*"
GLAM Intermediate Tables
In addition to the above tables, the GLAM ETL produces intermediate tables that can be useful outside of the GLAM ETL in some cases. These tables include the client ID and could be joined with other tables to filter by client based data (e.g. specific hardware).
Firefox Desktop
Data sources:
moz-fx-data-shared-prod.telemetry.clients_daily_histogram_aggregates
moz-fx-data-shared-prod.telemetry.clients_daily_scalar_aggregates
These tables are:
- Preprocessed from main telemetry to intermediate data with one row per client per metric per day, then aggregated normalizing across clients.
- Clients daily aggregates analogous to clients daily with:
- all metrics aggregated
- each scalar includes min, max, average, sum, and count aggregations
- each histogram aggregated over all client data per day
- each date is further aggregated over the dimensions: channel, os, version, build ID
ETL Pipeline
Scheduling
GLAM is scheduled to run daily via Airflow. There are two separate ETL pipelines for computing GLAM datasets for Firefox Desktop legacy, Fenix and Firefox on Glean.
Source Code
The ETL code base lives in the bigquery-etl repository and is partially generated. The scripts for generating ETL queries for Firefox Desktop Legacy currently live here while the GLAM logic for Glean apps lives here.
Steps
GLAM has a separate set of steps and intermediate tables to aggregate scalar and histogram probes.
latest_versions
- This task pulls in the most recent version for each channel from https://product-details.mozilla.org/1.0/firefox_versions.json
clients_daily_histogram_aggregates_<process>
- The set of steps that load data to this table are divided into different processes (
parent
,content
,gpu
) plus a keyed step for keyed histograms. - The parent job creates or overwrites the partition corresponding to the
logical_date
, and other processes append data to that partition. - The process uses
telemetry.buildhub2
to select rows with validbuild_ids
. - Aggregations are done per client, per day, and include a line for each
submission_date
,client_id
,os
,app_version
,build_id
, andchannel
. - The aggregation is done by adding histogram values with the same key for the dimensions listed above.
- The queries for the different steps are generated and run as part of each step.
- The "keyed" step includes all Keyed Histogram probes, regardless of process (
parent
,content
,gpu
). - As a result of the subdivisions in this step, it generates different rows for each process and keyed/non-keyed metric, which will be grouped together in the
clients_histogram_aggregates
step. - Clients that are on the release channel of the Windows operating system get sampled to reduce the data size.
- The partitions are set to expire after 7 days.
clients_histogram_aggregates_new
- This step groups together all rows that have the same
submission_date
andlogical_date
from different processes and keyed and non-keyed sources, and combines them into a single row in thehistogram_aggregates
column. It sums the histogram values with the same key. - This process is only applied to the last three versions.
- The table is overwritten at every execution of this step.
clients_histogram_aggregates
- New entries from
clients_histogram_aggregates_new
are merged with the 3 last versions of previous day’s partition and written to the current day’s partition. - The most recent partition contains the current snapshot of the last three versions of data.
- The partitions expire in 7 days.
clients_histogram_buckets_counts
- This process starts by creating wildcards for
os
andapp_build_id
which are needed for aggregating values across os and build IDs later on. - It then filters out builds that have less than 0.5% of WAU (which can vary per channel). This is referenced in https://github.com/mozilla/glam/issues/1575#issuecomment-946880387.
- The process then normalizes histograms per client - it sets the sum of histogram values for each client for a given metric to 1.
- Finally, it removes the
client_id
dimension by aggregating all histograms for a given metric and adding the clients' histogram values.
clients_histogram_probe_counts
- This process generates buckets - which can be linear or exponential - based on the
metric_type
. - It then aggregates metrics per wildcards (
os
,app_build_id
). - Finally, it rebuilds histograms using the Dirichlet Distribution, normalized using the number of clients that contributed to that histogram in the
clients_histogram_buckets_counts
step.
histogram_percentiles
- Uses
mozfun.glam.percentile
UDF to build histogram percentiles, from [0.1 to 99.9]
clients_daily_scalar_aggregates
- The set of steps that load data to this table are divided into non-keyed
scalar
,keyed_boolean
andkeyed_scalar
. The non-keyed scalar job creates or overwrites the partition corresponding to thelogical_date
, and other processes append data to that partition. - The process uses
telemetry.buildhub2
to select rows with validbuild_ids
. - Aggregations are done per client, per day and include a line for each
client
,os
,app_version
,build_id
, andchannel
. - The queries for the different steps are generated and run as part of each step. All steps include probes regardless of process (
parent
,content
,gpu
). - As a result of the subdivisions in this step, it generates different rows for each keyed/non-keyed, boolean/scalar metric, which will be grouped together in
clients_scalar_aggregates
. - Clients that are on the release channel of the Windows operating system get sampled to reduce the data size.
- Partitions expire in 7 days.
clients_scalar_aggregates
- The process starts by taking the
clients_daily_scalar_aggregates
as the primary source. - It then groups all rows that have the same
submission_date
andlogical_date
from the keyed and non-keyed, scalar and boolean sources, and combines them into a single row in thescalar_aggregates
column. - If the
agg_type
iscount
,sum
,true
, orfalse
, the process will sum the values. - If the
agg_type
ismax
, it will take the maximum value, and if it ismin
, it will take the minimum value. - This process is only applied to the last three versions.
- The partitions expire in 7 days.
scalar_percentiles
- This process produces a user count and percentiles for scalar metrics.
- It generates wildcard combinations of
os
andapp_build_id
and merges all submissions from a client for the sameos
,app_version
,app_build_id
and channel into thescalar_aggregates
column. - The
user_count
column is computed taking sampling into account. - Finally it splits the aggregates into percentiles from [0.1 to 99.9]
client_scalar_probe_counts
- This step processes booleans and scalars, although booleans are not supported by GLAM.
- For boolean metrics the process aggregates their values with the following rule: "never" if all values for a metric are false, "always" if all values are true, and sometimes if there's a mix.
- For scalar and
keyed_scalar
probes the process starts by building the buckets per metric, then it generates wildcards for os andapp_build_id
. It then aggregates all submissions from the sameclient_id
under one row and assigns theuser_count
column to it with the following rule: 10 if os is "Windows" and channel is "release", 1 otherwise. After that it finishes by aggregating the rows per metric, placing the scalar values in their appropriate buckets and summing up alluser_count
values for that metric.
glam_user_counts
- Combines both aggregated scalar and histogram values.
- This process produces a user count for each combination of
os
,app_version
,app_build_id
, channel. - It builds a client count from the union of histograms and scalars, including all combinations in which
os
,app_version
,app_build_id
, andchannel
are wildcards.
glam_sample_counts
- This process calculates the
total_sample
column by adding up all theaggregates
values. - This works because in the primary sources the values also represent a count of the samples that registered their respective keys
extract_user_counts
- This step exports user counts in its final shape to GCS as a CSV.
- It first copies a deduplicated version of the primary source to a temporary table, removes the previously exported CSV files from GCS, then exports the temporary table to GCS as CSV files.