SSL Ratios

Introduction

The public SSL dataset publishes the percentage of page loads Firefox users have performed that were conducted over SSL. This dataset is used to produce graphs like Let's Encrypt's to determine SSL adoption on the Web over time.

Content

The public SSL dataset is a table where each row is a distinct set of dimensions, with their associated SSL statistics. The dimensions are submission_date, os, and country. The statistics are reporting_ratio, normalized_pageloads, and ratio.

Background and Caveats

  • We're using normalized values in normalized_pageloads to obscure absolute page load counts.
  • This is across the entirety of release, not per-version, because we're looking at Web health, not Firefox user health.
  • Any dimension tuple (any given combination of submission_date, os, and country) with fewer than 5000 page loads is omitted from the dataset.
  • This is hopefully just a temporary dataset to stopgap release aggregates going away until we can come up with a better way to publicly publish datasets.

Accessing the Data

For details on accessing the data, please look at bug 1414839.

Data Reference

Combining Rows

This is a dataset of ratios. You can't combine ratios if they have different bases. For example, if 50% of 10 loads (5 loads) were SSL and 5% of 20 loads (1 load) were SSL, you cannot calculate that 20% (6 loads) of the total loads (30 loads) were SSL unless you know that the 50% was for 10 and the 5% was for 20.

If you're reluctant, for product reasons, to share the numbers 10 and 20, this gets tricky.

So what we've done is normalize the whole batch of 30 down to 1.0. That means we tell you that 50% of one-third of the loads (0.333...) was SSL and 5% of the other two-thirds of the loads (0.666...) was SSL. Then you can figure out the overall 20% figure by this calculation:

(0.5 * 0.333 + 0.05 * 0.666) / (0.333 + 0.666) = 0.2

Notice that you must divide by the sum of the normalized pageloads (0.333 + 0.666) in order to "unnormalize" the result into the true ratio. (In this toy example we're summing across all dimensions so the sum of all included normalized pageloads was 1.0.)

For this dataset the same system applies. To combine rows' ratios (to, for example, see what the SSL ratio was across all os and country for a given submission_date), you must first multiply them by the rows' normalized_pageviews values. Then you must divide them by the sum of the rows' normalized_pageviews values to "unnormalize" and get the true ratio.

Or, in JavaScript:

let rows = query_result.data.rows;
let normalizedRatioForDateInQuestion = rows
  .filter((row) => row.submission_date == dateInQuestion)
  .reduce((row, acc) => acc + row.normalized_pageloads * row.ratio, 0);
let normalizedPageloadSumForDateInQuestion = rows
  .filter((row) => row.submission_date == dateInQuestion)
  .reduce((row, acc) => acc + row.normalized_pageloads, 0);
let trueRatio = normalizedRatioForDateInQuestion / normalizedPageloadSumForDateInQuestion;

Remember that the normalization in this dataset is done across all dimensions (os, country) per submission_date. Summing ratio (or reporting_ratio) across different submission_date values will not give correct information.

Schema

The data is output in STMO API format:

"query_result": {
  "retrieved_at": <timestamp>,
  "query_hash": <hash>,
  "query": <SQL>,
  "runtime": <number of seconds>,
  "id": <an id>,
  "data_source_id": 26, // Athena
  "data_scanned": <some really large number, as a string>,
  "data": {
    "data_scanned": <some really large number, as a number>,
    "columns": [
      {"friendly_name": "submission_date", "type": "datetime", "name": "submission_date"},
      {"friendly_name": "os", "type": "string", "name": "os"},
      {"friendly_name": "country", "type": "string", "name": "country"},
      {"friendly_name": "reporting_ratio", "type": "float", "name": "reporting_ratio"},
      {"friendly_name": "normalized_pageloads", "type": "float", "name": "normalized_pageloads"},
      {"friendly_name": "ratio", "type": "float", "name": "ratio"}
    ],
    "rows": [
      {
        "submission_date": "2017-10-24T00:00:00", // date string, day resolution
        "os": "Windows_NT", // operating system family of the clients reporting the pageloads. One of "Windows_NT", "Linux", or "Darwin".
        "country": "CZ", // ISO 639 two-character country code, or "??" if we have no idea. Determined by performing a geo-IP lookup of the clients that submitted the pings.
        "reporting_ratio": 0.006825266611977031, // the ratio of pings that reported any pageloads at all. A number between 0 and 1. See [bug 1413258](https://bugzilla.mozilla.org/show_bug.cgi?id=1413258).
        "normalized_pageloads": 0.00001759145263985348, // the proportion of total pageloads in the dataset that are represented by this row. Provided to allow combining rows. A number between 0 and 1.
        "ratio": 0.6916961976822144 // the ratio of the pageloads that were performed over SSL. A number between 0 and 1.
      }, ...
    ]
  }
}

Scheduling

The dataset updates every 24 hours.

Public Data

The data is publicly available on BigQuery: mozilla-public-data.telemetry_derived.ssl_ratios_v1. Data can also be accessed through the public HTTP endpoint: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files

Code Reference

You can find the query that generates the SSL dataset STMO#49323.