This is a quick overview of the tooling and components in our data stack:
|
Data Platform |
Collection |
- gcp-ingestion - Mozilla's telemetry ingestion system deployed to Google Cloud Platform (GCP)
- Data Sources:
- Glean apps (including apps using glean.js)
- Firefox legacy telemetry clients (Firefox desktop)
- Fivetran
- Custom integrations
- Server side data
|
Data Warehouse |
BigQuery |
ETL |
|
Orchestration |
|
Observability |
Custom tooling for data validation/data checks as part of bigquery-etl.
|
Analsyis and Business Intelligence |
- Looker
- For most reporting, summaries, and ad-hoc data exploration by non-full-time-data people
- Redash
- For running ad-hoc SQL queries
|
Reverse ETL |
None; We don’t send a lot of data out, but when we do it’s been with custom integrations, using APIs, etc. |
Experimentation |
Nimbus/Experimenter |
Governance |
- Firefox Data Governance: Complex and designed for a specific use case that may not be generally applicable. Currently being revisited.
- Access Control: Support for relatively fine-grained access control at the BigQuery and Looker level, access management and approval process exists.
- Transparency and Docs: Automated inventory+docs: Glean Dictionary
|
Data Catalog |
Acryl - for data lineage and finding data sets
|