Longitudinal Reference

Introduction
Data Reference
Code Reference

Introduction

The longitudinal dataset is a 1% sample of main ping data organized so that each row corresponds to a client_id. If you're not sure which dataset to use for your analysis, this is probably what you want.

Each row in the longitudinal dataset represents one client_id, which is approximately a user. Each column represents a field from the main ping. Most fields contain arrays of values, with one value for each ping associated with a client_id. Using arrays give you access to the raw data from each ping, but can be difficult to work with from SQL. Here's a query showing some sample data (STMO#4188) to help illustrate.

Background and Caveats

Think of the longitudinal table as wide and short. The dataset contains more columns than main_summary and down-samples to 1% of all clients to reduce query computation time and save resources.

In summary, the longitudinal table differs from main_summary in two important ways:

The longitudinal dataset groups all data so that one row represents a client_id
The longitudinal dataset samples to 1% of all client_ids

Please note that this dataset only contains release (or opt-out) histograms and scalars.

Accessing the Data

The longitudinal is available in STMO, though it can be difficult to work with the array values in SQL. Take a look at STMO#4189.

The data is stored as a parquet table in S3 at the following address.

s3://telemetry-parquet/longitudinal/

Data Reference

Sampling

Pings Within Last 6 Months

The longitudinal filters to main pings from within the last 6 months.

1% Sample

The longitudinal dataset samples down to 1% of all clients in the above sample. The sample is generated by the following process:

hash the client_id for each ping from the last 6 months.
project that hash onto an integer from 1:100, inclusive
filter to pings with client_ids matching a 'magic number' (in this case 42)

This process has a couple of nice properties:

The sample is consistent over time. The longitudinal dataset is regenerated weekly. The clients included in each run are very similar with this process. The only change will come from never-before-seen clients, or clients without a ping in the last 180 days.
We don't need to adjust the sample as new clients enter or exit our pool.

More practically, the sample is created by filtering to pings with main_summary.sample_id == 42. If you're working with main_summary, you can recreate this sample by doing this filter manually.