SSL Ratios
Introduction
The public SSL dataset publishes the percentage of page loads Firefox users have performed that were conducted over SSL. This dataset is used to produce graphs like Let's Encrypt's to determine SSL adoption on the Web over time.
Content
The public SSL dataset is a table where each row is a distinct set of dimensions, with their
associated SSL statistics. The dimensions are submission_date
, os
, and country
. The
statistics are reporting_ratio
, normalized_pageloads
, and ratio
.
Background and Caveats
- We're using normalized values in
normalized_pageloads
to obscure absolute page load counts. - This is across the entirety of release, not per-version, because we're looking at Web health, not Firefox user health.
- Any dimension tuple (any given combination of
submission_date
,os
, andcountry
) with fewer than 5000 page loads is omitted from the dataset. - This is hopefully just a temporary dataset to stopgap release aggregates going away until we can come up with a better way to publicly publish datasets.
Accessing the Data
For details on accessing the data, please look at bug 1414839.
Data Reference
Combining Rows
This is a dataset of ratios. You can't combine ratios if they have different bases. For example, if 50% of 10 loads (5 loads) were SSL and 5% of 20 loads (1 load) were SSL, you cannot calculate that 20% (6 loads) of the total loads (30 loads) were SSL unless you know that the 50% was for 10 and the 5% was for 20.
If you're reluctant, for product reasons, to share the numbers 10 and 20, this gets tricky.
So what we've done is normalize the whole batch of 30 down to 1.0. That means we tell you that 50% of one-third of the loads (0.333...) was SSL and 5% of the other two-thirds of the loads (0.666...) was SSL. Then you can figure out the overall 20% figure by this calculation:
(0.5 * 0.333 + 0.05 * 0.666) / (0.333 + 0.666) = 0.2
Notice that you must divide by the sum of the normalized pageloads (0.333 + 0.666) in order to "unnormalize" the result into the true ratio. (In this toy example we're summing across all dimensions so the sum of all included normalized pageloads was 1.0.)
For this dataset the same system applies. To combine rows' ratios (to, for example, see what the
SSL ratio was across all os
and country
for a given submission_date
), you must first
multiply them by the rows' normalized_pageviews
values. Then you must divide them by the sum
of the rows' normalized_pageviews
values to "unnormalize" and get the true ratio.
Or, in JavaScript:
let rows = query_result.data.rows;
let normalizedRatioForDateInQuestion = rows
.filter((row) => row.submission_date == dateInQuestion)
.reduce((row, acc) => acc + row.normalized_pageloads * row.ratio, 0);
let normalizedPageloadSumForDateInQuestion = rows
.filter((row) => row.submission_date == dateInQuestion)
.reduce((row, acc) => acc + row.normalized_pageloads, 0);
let trueRatio = normalizedRatioForDateInQuestion / normalizedPageloadSumForDateInQuestion;
Remember that the normalization in this dataset is done across all dimensions
(os
, country
) per submission_date
. Summing ratio
(or reporting_ratio
)
across different submission_date
values will not give correct information.
Schema
The data is output in STMO API format:
"query_result": {
"retrieved_at": <timestamp>,
"query_hash": <hash>,
"query": <SQL>,
"runtime": <number of seconds>,
"id": <an id>,
"data_source_id": 26, // Athena
"data_scanned": <some really large number, as a string>,
"data": {
"data_scanned": <some really large number, as a number>,
"columns": [
{"friendly_name": "submission_date", "type": "datetime", "name": "submission_date"},
{"friendly_name": "os", "type": "string", "name": "os"},
{"friendly_name": "country", "type": "string", "name": "country"},
{"friendly_name": "reporting_ratio", "type": "float", "name": "reporting_ratio"},
{"friendly_name": "normalized_pageloads", "type": "float", "name": "normalized_pageloads"},
{"friendly_name": "ratio", "type": "float", "name": "ratio"}
],
"rows": [
{
"submission_date": "2017-10-24T00:00:00", // date string, day resolution
"os": "Windows_NT", // operating system family of the clients reporting the pageloads. One of "Windows_NT", "Linux", or "Darwin".
"country": "CZ", // ISO 639 two-character country code, or "??" if we have no idea. Determined by performing a geo-IP lookup of the clients that submitted the pings.
"reporting_ratio": 0.006825266611977031, // the ratio of pings that reported any pageloads at all. A number between 0 and 1. See [bug 1413258](https://bugzilla.mozilla.org/show_bug.cgi?id=1413258).
"normalized_pageloads": 0.00001759145263985348, // the proportion of total pageloads in the dataset that are represented by this row. Provided to allow combining rows. A number between 0 and 1.
"ratio": 0.6916961976822144 // the ratio of the pageloads that were performed over SSL. A number between 0 and 1.
}, ...
]
}
}
Scheduling
The dataset updates every 24 hours.
Public Data
The data is publicly available on BigQuery: mozilla-public-data.telemetry_derived.ssl_ratios_v1
.
Data can also be accessed through the public HTTP endpoint: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files
Code Reference
You can find the query that generates the SSL dataset
STMO#49323
.