Cross Sectional Reference

Introduction

The cross_sectional dataset provides descriptive statistics for each client_id in a 1% sample of main ping data. This dataset simplifies the longitudinal table by replacing the longitudinal arrays with summary statistics. This is the most useful dataset for describing our user base.

Content

Each row in the cross_sectional dataset represents one client_id, which is approximately a user. Each column is a summary statistic describing the client_id.

For example, the longitudinal table has a row called geo_country which contains an array of country codes. For the same client_id the cross_sectional table has columns called geo_country_mode and geo_country_configs containing single summary statistics for the modal country and the number of distinct countries in the array.

client_id geo_country geo_country_mode geo_country_configs
1 array<"US"> "US" 1
2 array<"DE", "DE" "US"> "DE" 2

Background and Caveats

This table is much easier to work with than the longitudinal dataset because you don't need to work with arrays. This table has a limited number of pre-computed summary statistics so you're metric may not be included.

Note that this dataset is a summary of the longitudinal dataset, so it is also a 1% sample of all client_ids.

All summary statistics are computed over the last 180 days, so this dataset can be insensitive to changes over time.

Accessing the Data

The cross_sectional dataset is available in re:dash. Here's an example query.

The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.

s3://telemetry-parquet/cross_sectional/v1/

Further Reading

The cross_sectional dataset is generated by this code. Take a look at this query for a schema.

Data Reference

Example Queries

This query calculates relative OS frequencies for different channels:

SELECT
    os_name_mode,
    COUNT(1)
FROM cross_sectional
-- Can't limit by date or app_name in the cross_sectional
WHERE os_name_mode IN ('Windows_NT', 'Darwin', 'Linux')
  AND normalized_channel = 'release'
GROUP BY 1
ORDER BY 2 DESC

Sampling

The cross_sectional dataset is derived from the longitudinal dataset, and uses the exact same sampling. See the longitudinal documentation for details.

Scheduling

The cross_sectional dataset is generated shortly after the longitudinal dataset every Sunday. The job is scheduled on Airflow and can be found here.

Schema

TODO(harter): https://bugzilla.mozilla.org/show_bug.cgi?id=1361862

Code Reference

This dataset is generated by telemetry-batch-view. Refer to this repository for information on how to run or augment the dataset.

results matching ""

    No results matching ""