Connecting Sheets and External Data to BigQuery

Google Sheets and other external data sources can be connected to our Data Warehouse and made available as tables via bigquery-etl. Updates made to the connected data source will be instantly available in the BigQuery table. The created tables can be made available in Looker and restricted in access using Data Access Workgroups.

Connecting Sheets

To connect a Google Sheet to BigQuery, the following steps need to be followed:

  1. Clone the bigquery-etl repository locally. Setting up the bqetl CLI tooling is optional, all the steps here can be done manually.
  2. Create a new destination table configuration in the project and dataset the table should be created in BigQuery:
    • If the spreadsheet should be accessible through the table moz-fx-data-shared-prod.telemetry_derived.insightful_spreadsheet_v1 then create a new folder insightful_spreadsheet_v1 under sql/moz-fx-data-shared-prod/telemetry_derived.
  3. Create a metadata.yaml file.
    • For insightful_spreadsheet_v1 the file would need be created under sql/moz-fx-data-shared-prod/telemetry_derived/insightful_spreadsheet_v1/metadata.yaml
  4. Open the metadata.yaml file and specify the configuration similar to the following:
friendly_name: Insightful Spreadsheet
description: >
  A description of what the data represents
owners:
  - example@mozilla.com
external_data:
  format: google_sheets
  source_uris:
    - https://docs.google.com/spreadsheets/d/Avakdiasl341kdasdf # URL to the spreadsheet
  options:
    skip_leading_rows: 1 # number of rows that should be skipped, e.g if there are header rows
workgroup_access: # the workgroup_access is optional, used for restricting data access
  - role: roles/bigquery.dataViewer
    members:
      - workgroup:secret/gp
  • It is possible to provide multiple URLs to spreadsheets. These spreadsheets need to have the same structure (same columns and types) and will be combined (aka UNIONed) in the BigQuery table.
  • workgroup_access is optional and does not need to be specified for data accessible by Mozilla employees. It only needs to be specified if a subset of people, belonging to a specific workgroup, should have access
  1. Create a schema.yaml file.
    • For insightful_spreadsheet_v1 the file would need be created under sql/moz-fx-data-shared-prod/telemetry_derived/insightful_spreadsheet_v1/schema.yaml
  2. Open the schema.yaml file and specify the structure of the spreadsheet (aka schema) similar to the following:
fields:
  - mode: NULLABLE
    name: first_column_name # this will be the column name in the BigQuery table for the first spreadsheet column
    type: STRING # this will be the data type used for this column in BigQuery
  - mode: NULLABLE
    name: second_column_name
    type: DATE
  - mode: NULLABLE
    name: third_column_name
    type: FLOAT64
  1. Go to the spreadsheet, click on "Share" and add the following service account as Editor: jenkins-node-default@moz-fx-data-terraform-admin.iam.gserviceaccount.com
    • This is necessary to ensure the correct access permissions get applied to the spreadsheet
  2. Open a pull-request against bigquery-etl and tag someone for review.
  3. Once the PR has been reviewed and merged, the table will be available the next day in BigQuery.
    • If the table should be made available immediately, then go to the bqetl_artifact_deployment Airflow DAG and clear the publish_new_tables task. This might need to be done by a data engineer or someone who has permissions to trigger and clear tasks in Airflow. The table will be available as soon as the task finishes.

For confidential data it is generally recommended to add these configurations to private-bigquery-etl. The process and configurations are the same, the only difference is the repository which is not publicly accessible.