- Introduction * Contents * Background and Caveats * Accessing the Data
- Data Reference
- Code Reference
longitudinal dataset is a 1% sample of main ping data
organized so that each row corresponds to a
If you're not sure which dataset to use for your analysis,
this is probably what you want.
Each row in the
longitudinal dataset represents one
which is approximately a user.
Each column represents a field from the main ping.
Most fields contain arrays of values, with one value for each ping associated with a
Using arrays give you access to the raw data from each ping,
but can be difficult to work with from SQL.
Here's a query showing some sample data
to help illustrate.
Think of the longitudinal table as wide and short.
The dataset contains more columns than
and down-samples to 1% of all clients to reduce query computation time and save resources.
In summary, the longitudinal table differs from
main_summary in two important ways:
- The longitudinal dataset groups all data so that one row represents a
- The longitudinal dataset samples to 1% of all
Please note that this dataset only contains release (or opt-out) histograms and scalars.
longitudinal is available in re:dash,
though it can be difficult to work with the array values in SQL.
Take a look at this example query.
The data is stored as a parquet table in S3 at the following address.
longitudinal filters to
main pings from within the last 6 months.
The longitudinal dataset samples down to 1% of all clients in the above sample. The sample is generated by the following process:
- hash the
client_idfor each ping from the last 6 months.
- project that hash onto an integer from 1:100, inclusive
- filter to pings with
client_ids matching a 'magic number' (in this case 42)
This process has a couple of nice properties:
- The sample is consistent over time.
longitudinaldataset is regenerated weekly. The clients included in each run are very similar with this process. The only change will come from never-before-seen clients, or clients without a ping in the last 180 days.
- We don't need to adjust the sample as new clients enter or exit our pool.
the sample is created by filtering to pings with
main_summary.sample_id == 42.
If you're working with
you can recreate this sample by doing this filter manually.
This dataset is generated by telemetry-batch-view. Refer to this repository for information on how to run or augment the dataset.