Data Stack Overview

This is a quick overview of the tooling and components in our data stack:

Data Platform
Collection
  • gcp-ingestion - Mozilla's telemetry ingestion system deployed to Google Cloud Platform (GCP)
  • Data Sources:
    • Glean apps (including apps using glean.js)
    • Firefox legacy telemetry clients (Firefox desktop)
    • Fivetran
    • Custom integrations
    • Server side data
Data Warehouse BigQuery
ETL
Orchestration
Observability Custom tooling for data validation/data checks as part of bigquery-etl.
Analsyis and Business Intelligence
  • Looker
    • For most reporting, summaries, and ad-hoc data exploration by non-full-time-data people
  • Redash
    • For running ad-hoc SQL queries
Reverse ETL None; We don’t send a lot of data out, but when we do it’s been with custom integrations, using APIs, etc.
Experimentation Nimbus/Experimenter
Governance
  • Firefox Data Governance: Complex and designed for a specific use case that may not be generally applicable. Currently being revisited.
  • Access Control: Support for relatively fine-grained access control at the BigQuery and Looker level, access management and approval process exists.
  • Transparency and Docs: Automated inventory+docs: Glean Dictionary
Data Catalog Acryl - for data lineage and finding data sets