Generated Schemas

Overview

Schemas describe the structure of ingested data. They are used in the pipeline to validate the types and values of data, and to define a table schema in a data store. We use a repository of JSON Schemas to sort incoming data into decoded and error datasets. We also generate BigQuery table schemas on business days from the JSON Schemas: you can see the current status of this job on the mozilla-pipeline-schemas deploy dashboard.

mozilla-pipeline-schemas
git clone
used by
used by
scheduled by
run nightly
defines table
defines is valid
defines normalization
inserts into
main
generated-schemas
mozilla-schema-generator
jsonschema-transpiler
probe-scraper
telemetry-airflow
gcp-ingestion
BigQuery

Figure: An overview of generated schemas. Click on a node to navigate to the relevant repository or documentation.

Schema deploys FAQ

This section answers some basic questions about the schema deployment pipeline.

How do I make changes to a schema?

This is dependent on what application you are working on.

If you are working on Firefox Telemetry and are adding a new probe, then you don't have to do anything. Changes are automatically picked up by the probe-scraper from the histograms.json and scalars.yaml files in mozilla-central. Non-probe changes (for example, modifications to the telemetry environment) will require you to make changes to mozilla-pipeline-schemas.

If you are working on an application using the Glean SDK, then the probe-scraper will automatically pick up changes from metrics.yaml.

When will I see new changes to the schema?

Schema deploys happen on business days around UTC+04 when new changes are found in the generated-schemas branch of mozilla-pipeline-schemas. This means that any changes merged after UTC+04 on Friday will not propagate until Monday UTC+04. See the mozilla-pipeline-schemas deploy dashboard for up-to-date information on the most recent deploys.

What does it mean when a schema deploy is blocked?

The schema deployment pipeline has a hard dependency on the [probe-scraper], a service that scours repositories for new metrics to include in generated schemas. When the probe-scraper fails, it will prevent the mozilla-schema-generator from running. If there are new changes to the main branch of mozilla-pipeline-schemas, then they will not be added to the generated-schemas branch until the failure has been resolved. Similarly, new probes and pings in either Telemetry or Glean will not be picked up until the probe-scraper failures are resolved.

If a new schema field is not registered in the schema repository before collection begins, it will be available in the additional_properties field of the generated table. If a new schema for a ping is not registered before collection begins, then it will be sorted into the error stream. Please file a bug or reach out if you believe your data may be affected by blocked schema deploys.

Schema Repository

mozilla-pipeline-schemas
origin/main
origin/generated-schemas
cmake
mozilla-schema-generator
schemas
schemas
templates

Figure: Template schemas are built locally to generate static JSON Schema. On a regular basis, the Mozilla Schema Generator is run to generate BigQuery schemas.

Refer to Sending a Custom Ping for an in-depth guide for adding new schemas to the repository.

Schema Transpiler

The structure validated in JSON Schema can be mapped to BigQuery columns. This is done by the jsonschema-transpiler, a Rust application for translating between schema formats. Data normalization as part of decoding is required before inserting into BigQuery e.g. snake casing and type casting. These workarounds are based transformations that are done when importing Avro into BigQuery.

output
input
Avro schemas
BigQuery schemas
JSON Schemas
jsonschema-transpiler

Mozilla Schema Generator

The schema generator will populate schemas with metadata and insert generated sub-schemas at certain paths. It generates JSON Schemas that are translated into BigQuery schemas, but not used for validation. It uses the probe information service to enumerate map-type fields. These fields are converted into a structured column that can be accessed in BigQuery with Standard SQL. Metadata includes fields added during data ingestion including fields like submission_timestamp and sample_id.

In addition to generating BigQuery schemas, schemas are aliased in several locations. For example, the first_shutdown ping is a copy of the main_ping. Schemas are also altered in the generator to accommodate various edge-cases in the data. For example, a field that validates both boolean and integer types may be altered to assume a boolean type.

The main entry-point is a script that merges and generates *.schema.json under the schemas directory, then translates these to *.bq. It commits the schema to the generated-schemas branch, with a change-log referencing commits in the main branch.

Data Ingestion

Validation

Data that fails validation is sent to the payload_bytes_error table. Each row contains an information about the error that caused it, as well as the name of the job associated with it.

SELECT document_namespace, document_type, document_version, error_message, error_type, exception_class, job_name FROM `moz-fx-data-shared-prod`.payload_bytes_error.telemetry WHERE submission_timestamp > TIMESTAMP_SUB(current_timestamp, INTERVAL 1 hour) AND exception_class = 'org.everit.json.schema.ValidationException' LIMIT 5
ColumnExample ValueNotes
document_namespacetelemetry
document_typemain
document_versionnullThe version in the telemetry namespace is generated after validation
error_messageorg.everit.json.schema.ValidationException: #/environment/system/os/version: #: no subschema matched out of the total 1 subschemas
error_typeParsePayloadThe ParsePayload type is associated with schema validation or corrupt data
exception_classorg.everit.json.schema.ValidationException Java JSON Schema Validator library
job_namedecoder-0-0121192636-9c56ac6aName of the Dataflow job that can be used to determine the version of the schema artifact

Decoding

The BigQuery schemas are used to normalize relevant payload data and determine additional properties. Normalization involves renaming field names and transforming certain types of data. Snake casing is employed across all schemas and ensures a consistent querying experience. Some data must be transformed before insertion, such as map-types (a.k.a. dictionaries in Python), due to limitations in BigQuery data representation. Other data may not be specified in the schema, and instead placed into a specially constructed column named additional_properties.

Name Normalization

A reference Python implementation of the snake casing algorithm is ensured to be compatible with the implementations in the decoder and transpiler using a shared test-suite. To illustrate the transformation, consider the a11y.theme keyed scalar in the main ping. In the JSON document, as seen in about:telemetry, it is accessed as follows:

# Python/Javascript syntax ping["payload"]["processes"]["parent"]["keyedScalars"]["a11y.theme"]

The decoder will normalize the path with snake casing in BigQuery:

SELECT payload.processes.parent.keyed_scalars.a11y_theme FROM `moz-fx-data-shared-prod`.telemetry.main WHERE date(submission_timestamp) = date_sub(current_date, interval 1 day) LIMIT 1

Data Structure Normalization

Thee decoder is also responsible for transforming the data to accommodate BigQuery limitations in data representation. All transformations are defined in ingestion-beam under com.mozilla.telemtry.transforms.PubsubMessageToTableRow.

The following transformations are currently applied:

TransformationDescription
Map TypesJSON objects that contain an unbounded number of keys with a shared value type are represented as a repeated structure containing a key and value column.
Nested ArraysNested lists are represented using a structure containing a repeated list column.
Tuples to Anonymous StructuresA tuple of items is represented as an anonymous structure with column names starting at _0 up to _{n} where n is the length of the tuple.
JSON to String coercionA sub-tree in a JSON document will be coerced to string if specified in the BigQuery schema. One example is of transformation is to represent histograms in the main ping.
Boolean to Integer coercionA boolean may be cast into an integer type.

Additional properties are fields within the ingested JSON document that are not found in the schema. When all transformations are completed, any fields that were not traversed in the schema will be reconstituted into the top-level additional_properties field.

Deploying to BigQuery

In this section, we discuss deployment of generated schemas to BigQuery. Refer to Table Layout and Naming for details about the resulting structure of the projects.

Tables are updated on every push to generated-schemas. The schemas must be backwards compatible, otherwise the checks in the staging Dataflow and BigQuery instances will fail. This must be resolved by pushing a new tip to the generated-schemas branch in the schema repository. Valid changes to schemas include relaxing a column mode from REQUIRED to NULLABLE or adding new columns.

Each table is tagged with the revision of schema repository attached. Consider the org_mozilla_fenix namespace:

$ bq ls --max_results=3 moz-fx-data-shared-prod:org_mozilla_fenix_stable tableId Type Labels Time Partitioning Clustered Fields ------------------- ------- --------------------------------------- ----------------------------------- ------------------------------- activation_v1 TABLE schema_id:glean_ping_1 DAY (field: submission_timestamp) normalized_channel, sample_id schemas_build_id:202001230145_be1f11e baseline_v1 TABLE schema_id:glean_ping_1 DAY (field: submission_timestamp) normalized_channel, sample_id schemas_build_id:202001230145_be1f11e bookmarks_sync_v1 TABLE schema_id:glean_ping_1 DAY (field: submission_timestamp) normalized_channel, sample_id

The schema_id is derived from the value of the $schema property of each JSON Schema. The schemas_build_id label contains an identifier that includes the timestamp of the generated schema. This label may be used to trace the last deployed commit from generated-schemas.

Updating generated-schemas

mozilla-pipeline-schemas
workflow.tmo
run now
run at midnight UTC
git pull
git push
main
generated-schemas
manual
scheduled
mozilla-schema-generator

A new push to the generated-schemas branch is made every time the probe-scraper.schema_generator task is run by Airflow. mozilla-schema-generator runs in a container that commits snapshots of generated schemas to the remote repository. Generated schemas may change when probe-scraper finds new probes in defined repositories e.g. hg.mozilla.org or glean. It may also change when the main branch contains new or updated schemas under the schemas/ directory.

To manually trigger a new push, clear the state of a single task in the workflow admin UI. To update the schedule and dependencies, update the DAG definition.

Deploying schemas to production

moz-fx-data-shar-nonprod-efed
mozilla-pipeline-schemas
labeled and archived
run terraform
run terraform
drain and submit
Update BigQuery tables
Update BigQuery views
Redeploy Dataflow ingestion
generated-schemas
Generate schema artifact
Deploy prod

Jenkins is used to automate deploys of the pipeline in the nonprod and prod projects. Jenkins polls the generated-schemas branch for new commits. The tip of the branch will be labeled and archived into an artifact that is used during deploys. The artifact is first used update the table schemas in the nonprod project. This staging step will stop on schema incompatible changes, such as removing a schema or a column in a schema. Once the tables are up to date, the Dataflow job will be drained and redeployed so it is writing to the updated tables. Once schemas have successfully deployed to the nonprod project, then it may be manually promoted to production by an operator.