Main Ping Tables
As described in the pipeline schemas deployment docs,
data engineering has a process of generating schemas for pings and deploying them to BigQuery. The Main Ping table (telemetry.main
)
is one of those generated tables.
As the number of telemetry probes defined in Firefox grows, so does the number of columns in telemetry.main
. As of January 2021, we have nearly 10,000 columns, and we ingest many terabytes of main ping data per day. This combination of a very wide schema and a high data volume means that BigQuery has to reference metadata for a very large number of files each time it runs a query, even if it only ends up needing to read a small fraction of those files. This has led to a problematic experience for iterative analysis use cases.
To reduce the time for querying main ping data, we have included two new tables: telemetry.main_1pct
, and telemetry.main_nightly
. These can return results for simple queries in a matter of seconds where a logically equivalent query on telemetry.main
may take minutes.
Main Ping Sample: telemetry.main_1pct
This table includes a 1% sample across all channels from telemetry.main
(sample_id = 0
).
It includes 6 months of history.
It includes an additional subsample_id
field that is similar to sample_id
and allows
efficient sampling for even smaller population sizes. The following query would reflect
a 0.01% sample (one thousandth) of the total desktop Firefox population:
SELECT
COUNT(*) AS n
FROM
mozdata.telemetry.main_1pct
WHERE
subsample_id = 0
AND DATE(submission_timestamp) = '2021-01-01'
The choice of implementation for subsample_id
is not particularly well vetted;
it's simply chosen to be a hash that's stable, has a reasonable
avalanche effect, and is different from sample_id
.
The definition is MOD(ABS(FARM_FINGERPRINT(client_id)), 100) AS subsample_id
which is the same approach we use for choosing id_bucket
in Exact MAU tables.
Nightly Main Ping Data: telemetry.main_nightly
This table includes only data from the nightly release channel (normalized_channel = 'nightly'
).
It includes 6 months of history.