cross_sectional dataset provides descriptive statistics
for each client_id in a 1% sample of main ping data.
This dataset simplifies the longitudinal table by replacing
the longitudinal arrays with summary statistics.
This is the most useful dataset for describing our user base.
Each row in the
cross_sectional dataset represents one
which is approximately a user.
Each column is a summary statistic describing the client_id.
For example, the longitudinal table has a row called
which contains an array of country codes.
For the same
has columns called
containing single summary statistics for
the modal country and the number of distinct countries in the array.
|2||array<"DE", "DE" "US">||"DE"||2|
This table is much easier to work with than the longitudinal dataset because you don't need to work with arrays. This table has a limited number of pre-computed summary statistics so you're metric may not be included.
Note that this dataset is a summary of the longitudinal dataset, so it is also a 1% sample of all client_ids.
All summary statistics are computed over the last 180 days, so this dataset can be insensitive to changes over time.
The cross_sectional dataset is available in re:dash. Here's an example query.
The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.
This query calculates relative OS frequencies for different channels:
SELECT os_name_mode, COUNT(1) FROM cross_sectional -- Can't limit by date or app_name in the cross_sectional WHERE os_name_mode IN ('Windows_NT', 'Darwin', 'Linux') AND normalized_channel = 'release' GROUP BY 1 ORDER BY 2 DESC
cross_sectional dataset is derived from the
and uses the exact same sampling.
See the longitudinal documentation
cross_sectional dataset is generated shortly after the
dataset every Sunday.
The job is scheduled on Airflow and can be found
This dataset is generated by telemetry-batch-view. Refer to this repository for information on how to run or augment the dataset.