Jetstream datasets
- Statistics tables
- Client-window aggregate tables
- Enrollment tables
- Scheduling
- Code reference
- Documentation
Statistical summaries of telemetry data from experiments run in Mozilla products are provided by Jetstream. These summaries are published to BigQuery and serve both as the substrate for the result visualization platform and as a resource for data scientists.
Jetstream runs as part of the nightly ETL job (see Scheduling below).
Jetstream is also run after pushes to the jetstream-config
repository.
Jetstream publishes tables to the dataset moz-fx-data-experiments.mozanalysis
.
Experiments are analyzed using the concept of analysis windows. Analysis windows describe an interval marked from each client's day of enrollment. The "day 0" analysis window aggregates data from the days that each client enrolled in the experiment. Because the intervals are demarcated from enrollment, they are not calendar dates; for some clients in an experiment, day 0 could be a Tuesday, and for others a Saturday.
The week 0 analysis window aggregates data from each client's days 0 through 6, the week 1 window aggregates data from days 7 through 13, and so on.
Clients are given a fixed amount of time, specified in Experimenter and often a week long, to enroll. Final day 0 results are available for reporting at the end of the enrollment period, after the last eligible client has enrolled, and week 0 results are available a week after the enrollment period closes. Results for each window are published as soon as complete data is available for all enrolled clients.
The "overall" window, published after the experiment has ended, is a window beginning on each client's day 0 that spans the longest period for which all clients have complete data.
Jetstream computes statistics over several metrics by default, including
for any features associated with the experiment in Experimenter. Data
scientists can provide configuration to add additional metrics. Advice
on configuring Jetstream can be found at the jetstream-config
repository.
Statistics tables
The statistics tables contain statistical summaries of their corresponding aggregate tables. These tables are suitable for plotting directly without additional transformations.
Statistics tables are named like:
statistics_<slug>_{day, week, overall}_<index>
A view is also created that concatenates all statistics tables for an experiment of a given period type, named like:
statistics_<slug>_{daily, weekly, overall}
Statistics tables have the schema:
Column name | Type | Description |
---|---|---|
segment | STRING | The segment of the population being analyzed. "all" for the entire population. |
metric | STRING | The slug of the metric, like active_ticks or retained |
statistic | STRING | The slug of the statistic that was used to summarize the metric, like "mean" or "deciles" |
parameter | NUMERIC (decimal) | A statistic-dependent quantity. For two-dimensional statistics like "decile," this represents the x axis of the plot. For one-dimensional statistics, this is NULL. |
comparison | STRING | If this row represents a comparison between two branches, this row describes what kind of comparison, like difference or relative_uplift . If this row represents a measurement of a single branch, then this column is NULL. |
comparison_to_branch | STRING | If this row represents a comparison between two branches, this row describes which branch is being compared to. For simple A/B tests, this will be "control." |
ci_width | FLOAT64 | A value between 0 and 1 describing the width of the confidence interval represented by the lower and upper columns. Valued at 0.95 for 95% confidence intervals. |
point | FLOAT64 | The point estimate of the statistic for the metric given the parameter. |
lower | FLOAT64 | The lower bound of the confidence interval for the estimate. |
upper | FLOAT64 | The upper bound of the confidence interval for the estimate. |
window_index | INT64 | (views only) A base-1 index reflecting the analysis window from which the row is drawn (i.e. day 1, day 2, …). |
analysis_basis | STRING | Analysis basis statistic result is based on. Currently, analysis_basis can be either enrollments or exposures . |
Each combination of (segment, metric, statistic, parameter, comparison, comparison_to_branch, ci_width, analysis_basis)
uniquely describes a single data
point.
The available segments in a table should be derived from inspection of the table.
Jetstream's Github wiki has a description of each statistic and comparison.
Examples
To extract the mean of active_hours
for each branch from a weekly
statistics view with a name like statistics_bug_12345_slug_weekly
,
you could run the query:
SELECT
segment,
window_index AS week,
branch,
point,
lower,
upper
FROM `moz-fx-data-experiments`.mozanalysis.statistics_bug_12345_slug_weekly
WHERE
metric = "active_hours"
AND statistic = "mean"
AND comparison IS NULL
This query would return a row for each user segment, for each week of
the experiment, for each branch, with the mean of the active_hours
metric.
To see whether the absolute difference of the mean of active_hours
was
different between the control and treatment branches, you could run:
SELECT
window_index AS week,
branch,
point,
lower,
upper
FROM `moz-fx-data-experiments`.mozanalysis.statistics_bug_12345_slug_weekly
WHERE
metric = "active_hours"
AND statistic = "mean"
AND comparison = "difference"
AND branch = "treatment"
AND comparison_to_branch = "control"
AND segment = "all"
This query would return a row for each week of the experiment containing an estimate of the absolute difference between the treatment and control branches for the segment containing all users.
Client-window aggregate tables
The aggregate tables contain one row per enrolled client_id
. An
aggregate table is written for each analysis window. The statistics
tables are derived from the aggregate tables. The aggregate tables are
less useful without additional processing but they may be useful for
diagnostics.
Aggregate tables are named like:
<slug>_<analysis_basis>_{day,week,overall}_<index>
Aggregate tables have flexible schemas. Every table contains the columns:
Column name | Type | Description |
---|---|---|
client_id | STRING | Client's telemetry client_id |
branch | STRING | Branch client enrolled in |
enrollment_date | DATE | First date that the client enrolled in the branch |
exposure_date | DATE | First date that the client saw the exposure event (Optional) |
num_enrollment_events | INT64 | Number of times a client enrolled in the given branch |
num_exposure_events | INT64 | Number of times a client has seen the exposure event |
analysis_window_start | INT64 | The day after enrollment that this analysis window began; day 0 is the day of enrollment |
analysis_window_end | INT64 | The day after enrollment that this analysis window terminated (inclusive) |
The combination of (client_id, branch)
is unique.
Each metric associated with the experiment defines an additional (arbitrarily-typed) column.
Each data source associated with the experiment defines additional
<data_source>_has_contradictory_branch
and
<data_source>_has_non_enrolled_data
columns, which respectively
indicate whether client_id
reported data from more than one branch or
without any tagged branch in that dataset over that analysis window.
Each segment associated with the experiment defines an additional boolean column.
Enrollment tables
Enrollment tables contain enrollment information per client_id
for which
an enroll
event has been received. An enrollment table for a specific experiment
is created once after the enrollment period has completed. The enrollment table
is then re-used in sub-sequent analysis runs.
Enrollment tables are named like:
enrollments_<slug>
Enrollment tables have flexible schemas, but every table contains the columns:
Column name | Type | Description |
---|---|---|
client_id | STRING | Client's telemetry client_id |
branch | STRING | Branch client enrolled in |
enrollment_date | DATE | First date that the client enrolled in the branch |
num_enrollment_events | INT64 | Number of times a client enrolled in the given branch |
The combination of (client_id, branch)
is unique.
Each segment defines an additional non-NULL boolean column per segment which is set to
true
if the client is in the segment and false
otherwise.
Scheduling
Jetstream is updated nightly by telemetry-airflow.
It is invoked by the jetstream
DAG.
Code reference
Jetstream's datasets are generated by invoking Jetstream.
Data scientists can configure Jetstream or trigger a Jetstream invocation
by interacting with the jetstream-config
repository.
Documentation
Additional documentation about Jetstream can be found in the Jetstream Experimenter Docs.