Longitudinal Reference

Introduction

The longitudinal dataset is a 1% sample of main ping data organized so that each row corresponds to a client_id. If you're not sure which dataset to use for your analysis, this is probably what you want.

Contents

Each row in the longitudinal dataset represents one client_id, which is approximately a user. Each column represents a field from the main ping. Most fields contain arrays of values, with one value for each ping associated with a client_id. Using arrays give you access to the raw data from each ping, but can be difficult to work with from SQL. Here's a query showing some sample data to help illustrate. Take a look at the longitudinal examples if you get stuck.

Background and Caveats

Think of the longitudinal table as wide and short. The dataset contains more columns than main_summary and down-samples to 1% of all clients to reduce query computation time and save resources.

In summary, the longitudinal table differs from main_summary in two important ways:

  • The longitudinal dataset groups all data so that one row represents a client_id
  • The longitudinal dataset samples to 1% of all client_ids

Accessing the Data

The longitudinal is available in re:dash, though it can be difficult to work with the array values in SQL. Take a look at this example query.

The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.

s3://telemetry-parquet/longitudinal/

Data Reference

Example Queries

Take a look at the Longitudinal Examples Cookbook.

Sampling

Pings Within Last 6 Months

The longitudinal filters to main pings from within the last 6 months.

1% Sample

The longitudinal dataset samples down to 1% of all clients in the above sample. The sample is generated by the following process:

  • hash the client_id for each ping from the last 6 months.
  • project that hash onto an integer from 1:100, inclusive
  • filter to pings with client_ids matching a 'magic number' (in this case 42)

This process has a couple of nice properties:

  • The sample is consistent over time. The longitudinal dataset is regenerated weekly. The clients included in each run are very similar with this process. The only change will come from never-before-seen clients, or clients without a ping in the last 180 days.
  • We don't need to adjust the sample as new clients enter or exit our pool.

More practically, the sample is created by filtering to pings with main_summary.sample_id == 42. If you're working with main_summary, you can recreate this sample by doing this filter manually.

Scheduling

The longitudinal job is run weekly, early on Sunday morning UTC. The job is scheduled on Airflow. The DAG is here.

Schema

TODO(harter): https://bugzilla.mozilla.org/show_bug.cgi?id=1361862

Code Reference

This dataset is generated by telemetry-batch-view. Refer to this repository for information on how to run or augment the dataset.

results matching ""

    No results matching ""