- Data Reference
- Code Reference
longitudinal dataset is a 1% sample of main ping data
organized so that each row corresponds to a
If you're not sure which dataset to use for your analysis,
this is probably what you want.
Each row in the
longitudinal dataset represents one
which is approximately a user.
Each column represents a field from the main ping.
Most fields contain arrays of values, with one value for each ping associated with a
Using arrays give you access to the raw data from each ping,
but can be difficult to work with from SQL.
Here's a query showing some sample data (
to help illustrate.
Background and Caveats
Think of the longitudinal table as wide and short.
The dataset contains more columns than
and down-samples to 1% of all clients to reduce query computation time and save resources.
In summary, the longitudinal table differs from
main_summary in two important ways:
- The longitudinal dataset groups all data so that one row represents a
- The longitudinal dataset samples to 1% of all
Please note that this dataset only contains release (or opt-out) histograms and scalars.
Accessing the Data
longitudinal is available in STMO,
though it can be difficult to work with the array values in SQL.
Take a look at
The data is stored as a parquet table in S3 at the following address.
Pings Within Last 6 Months
longitudinal filters to
main pings from within the last 6 months.
The longitudinal dataset samples down to 1% of all clients in the above sample. The sample is generated by the following process:
- hash the
client_idfor each ping from the last 6 months.
- project that hash onto an integer from 1:100, inclusive
- filter to pings with
client_ids matching a 'magic number' (in this case 42)
This process has a couple of nice properties:
- The sample is consistent over time.
longitudinaldataset is regenerated weekly. The clients included in each run are very similar with this process. The only change will come from never-before-seen clients, or clients without a ping in the last 180 days.
- We don't need to adjust the sample as new clients enter or exit our pool.
the sample is created by filtering to pings with
main_summary.sample_id == 42.
If you're working with
you can recreate this sample by doing this filter manually.
longitudinal job is run weekly, early on Sunday morning UTC.
The job is scheduled on Airflow.
The DAG is here.
This dataset is generated by telemetry-batch-view. Refer to this repository for information on how to run or augment the dataset.