For a more gentle introduction to the data platform, please read the Pipeline Overview article.
This article goes into more depth about the architecture and flow of data in the platform.
The full detail of the platform can get quite complex, but at a high level the structure is fairly simple.
Each of these high-level parts of the platform are described in more detail below.
By far most data handled by the Data Platform is produced by Firefox. There are other producers, though, and the eventual aim is to generalize data production using a client SDK or set of standard tools.
Most data is submitted via HTTP POST, but data is also produced in the form of service logs and
If you would like to locally test a new data producer, the
gzipServer project provides a simplified server that makes it easy to inspect submitted messages.
Data arrives as an HTTP POST of an optionally gzipped payload of JSON. See the common Edge Server specification for details.
Submissions hit a load balancer which handles the SSL connection, then forwards to a "tee" server, which may direct some or all submissions to alternate backends. In the past, the tee was used to manage the cutover between different versions of the backend infrastructure. It is implemented as an
From there, the
mozingest HTTP Server receives submissions from the tee and batches and stores data durably on Amazon S3 as a fail-safe (we call this "Landfill"). Data is then passed along via Kafka for validation and further processing. If there is a problem with decoding, validation, or any of the code described in the rest of this section, data can be re-processed from this fail-safe store. The
mozingest server is implemented as an
Validation, at a minimum, ensures that a payload is valid JSON (possibly compressed). Many document types also have a JSONSchema specification, and are further validated against that.
Invalid messages are redirected to a separate "errors" stream for debugging and inspection.
Valid messages proceed for further decoding and processing. This involves things like doing GeoIP lookup and discarding the IP address, and attaching some HTTP header info as annotated metadata.
Validated and annotated messages become available for stream processing.
They are also batched and stored durably for later batch processing and ad-hoc querying.
See also the "generic ingestion" proposal which aims to make ingestion, validation, storage, and querying available as self-serve for platform users.
Hindsight is used for ingestion of logs from applications and services, it supports parsing of log lines and appending similar metadata as the HTTP ingestion above (timestamp, source, and so on).
Statsd messages are ingested in the usual way.
Amazon S3 forms the backbone of the platform storage layer. The primary format used in the Data Lake is parquet, which is a strongly typed columnar storage format that can easily be read and written by Spark, as well as being compatible with SQL interfaces such as Hive and Presto. Some data is also stored in Heka-framed protobuf format. This custom format is usually reserved for data where we do not have a complete JSONSchema specification.
Using S3 for storage avoids the need for an always-on cluster, which means that data at rest is inexpensive. S3 also makes it very easy to automatically expire (delete) objects after a certain period of time, which is helpful for implementing data retention policies.
Once written to S3, the data is typically treated as immutable - data is not appended to existing files, nor is data normally updated in place. The exception here is when data is back-filled, in which case previous data may be overwritten.
There are a number of other types of storage used for more specialized applications, including relational databases (such as PostgreSQL for the Telemetry Aggregates) and NoSQL databases (DynamoDB is used for a backing store for the TAAR project). Reading data from a variety of RDBMS sources is also supported via Re:dash.
Parquet data can be read and written natively from Spark, and many datasets are indexed in a Hive Metastore, making them available through a SQL interface on Re:dash and in notebooks via Spark SQL. Many other SQL data sources are also made available via Re:dash, see this article for more information on accessing data using SQL.
There is a separate data store for self-serve Analysis Outputs, intended to keep ad-hoc, temporary data out of the Data Lake. This is implemented as a separate S3 location, with personal output locations prefixed with each person's user id, similar to the layout of the
/home directory on a Unix system. See the Working with Parquet data cookbook for more details.
Analysis outputs can also be made public using the Public Outputs bucket. This is a web-accessible S3 location for powering public dashboards. This public data is available at
Hindsight allows you to run plugins written in Lua inside a sandbox. This gives a safe, performant way to do self-serve streaming analysis. See this article for an introduction. Hindsight plugins do the initial data validation and decoding, as well as writing out to long-term storage in both Heka-framed protobuf and parquet forms.
Datasets in parquet format can be read natively by Spark, either using Spark SQL or by reading data directly from S3.
Data produced by production jobs go into the Data Lake, while output from ad-hoc jobs go into Analysis Outputs.
Job scheduling and dependency management is done using Airflow. Most jobs run once a day, processing data from "yesterday" on each run. A typical job launches a cluster, which fetches the specified ETL code as part of its bootstrap on startup, runs the ETL code, then shuts down upon completion. If something goes wrong, a job may time out or fail, and in this case it is retried automatically.
Most of the data analysis tooling has been developed with the goal of being "self-serve". This means that people should be able to access and analyze data on their own, without involving data engineers or operations. Thus can data access scale beyond a small set of people with specialized knowledge of the entire pipeline.
The use of these self-serve tools is described in the Getting Started article. This section focuses on how these tools integrate with the platform infrastructure.
ATMO is a service for managing Spark clusters for data analysis. Clusters can be launched on demand, or can be scheduled to run a job on an ongoing basis. These clusters can read from the Data Lake described in the Storage section above, and can write results to either public (web-accessible) or private output locations.
Jupyter or Zeppelin notebooks are the usual interface to getting work done using a cluster, though you get full SSH access to the cluster.
Clusters launched via ATMO are automatically killed after a user-defined period of time (by default, 8 hours), though their lifetime can be extended as needed. Each cluster has a user-specific
EFS volume mounted on the
/home/hadoop directory, which means that data stored locally on the cluster persists from one cluster to the next. This volume is shared by all clusters launched by a given ATMO user.
STMO is a customized Re:dash installation that provides self-serve access to a a variety of different datasets. From here, you can query data in the Parquet Data Lake as well as various RDBMS data sources.
STMO interfaces with the data lake using both Presto and Amazon Athena. Each has its own data source in Re:dash. Since Athena does not support user-defined functions, datasets with HyperLogLog columns, such as
client_count_daily, are only available via Presto..
Different Data Sources in STMO connect to different backends, and each backend might use a slightly different flavor of SQL. You should find a link to the documentation for the expected SQL variant next to the Data Sources list.
Queries can be run just once, or scheduled to run periodically to keep data up-to-date.
There is a command-line interface to STMO called St. Mocli, if you prefer writing SQL using your own editor and tools.
Databricks offers another notebook interface for doing analysis in Scala, SQL, Python and R.
Databricks provides an always-on shared server which is nice for quick data investigations. ATMO clusters take some time to start up, usually on the order of tens of minutes. The shared server allows you to avoid this start-up cost. Prefer ATMO for heavy analyses since you will have dedicated resources.
TMO provides easy visualizations of histogram and scalar measures over time. Time can be in terms of either builds or submission dates. This is the most convenient interface to the Telemetry data, as it does not require any custom code.
There are a number of visualization libraries and tools being used to display data.
The landing page at
telemetry.mozilla.org is a good place to look for existing graphs, notably the measurement dashboard which gives a lot of information about histogram and scalar measures collected on pre-release channels.
Databricks notebooks are also an option.
Re:dash lets you query the data using SQL, but it also supports a number of useful visualizations.
Hindsight's web interface has the ability to visualize time-series data.
Mission Control gives a low-latency view into release health.
Many bespoke visualizations are built using the Metrics Graphics library as a display layer.
There are multiple layers of monitoring and alerting.
At a low level, the system is monitored to ensure that it is functioning as expected. This includes things like machine-level resources (network capacity, disk space, available RAM, CPU load) which are typically monitored using DataDog.
Next, we monitor the "transport" functionality of the system. This includes monitoring incoming submission rates, payload sizes, traffic patterns, schema validation failure rates, and alerting if anomalies are detected. This type of anomaly detection and alerting is handled by Hindsight.
Once data has been safely ingested and stored, we run some automatic regression detection on all Telemetry histogram measures using Cerberus. This code looks for changes in the distribution of a measure, and emails probe owners if a significant change is observed.
Production ETL jobs are run via Airflow, which monitors batch job progress and alerts if there are failures in any job. Self-serve batch jobs run via ATMO also generate alerts upon failure.
Scheduled Re:dash queries may also be configured to generate alerts, which is used to monitor the last-mile user facing status of derived datasets. Re:dash may also be used to monitor and alert on high-level characteristics of the data, or really anything you can think of.
Data is exported from the pipeline to a few other tools and systems. Examples include integration with Amplitude for mobile and product analytics, publishing reports and visualizations to the Mozilla Data Collective, and shipping data to other parts of the Mozilla organization.
There are also a few data sets which are made publicly available, such as the Firefox Hardware Report.
Finally, here is a more detailed view of the entire platform. Some connections are omitted for clarity.