Making Datasets Publicly Available

Currently, only datasets and query results that are available in BigQuery and defined in the bigquery-etl repository can be made publicly available. See the bigquery-etl documentation for information on how to create and schedule datasets. Before data can be published, a data review is required.

To make query results publicly available, a metadata.yaml file must be added alongside the query in bigquery-etl. For example:

friendly_name: SSL Ratios description: >- Percentages of page loads Firefox users have performed that were conducted over SSL broken down by country. owners: - example@mozilla.com labels: application: firefox incremental: true # incremental queries add data to existing tables schedule: daily # scheduled in Airflow to run daily public_json: true public_bigquery: true review_bug: 1414839 # Bugzilla bug ID of data review incremental_export: false # non-incremental JSON export writes all data to a single location

The following options define how data is published:

  • public_json: data is available through the public HTTP endpoint
  • public_bigquery: data is publicly available on BigQuery
    • tables will get published in the mozilla-public-data GCP project which is accessible by everyone, also external users
  • incremental_export: determines how data gets split up
    • true: data for each submission_date gets exported into separate directories (e.g. files/2020-04-15, files/2020-04-16, ...)
      • false: all data gets exported into one files/ directory
  • incremental: indicates how data gets updated based on the query and Airflow configuration
    • true: data gets incrementally updated
    • false: the entire table data gets updated
  • review_bug: Bugzilla bug number to the data review

Data will get published when the query is executed in Airflow. Metadata of available public data on Cloud Storage is updated daily through a separate Airflow task.

More information about accessing public data can be found in Accessing Public Data.