Scheduling BigQuery Queries in Airflow

Queries in bigquery-etl can be scheduled in Airflow to be run regularly with the results written to a table.

In bigquery-etl

In the bigquery-etl project, queries are written in /sql. The directory structure is based on the destination table: /sql/{dataset_id}/{table_name}. For example, /sql/telemetry_derived/core_clients_last_seen_v1/query.sql is a query that will write results to the core_clients_last_seen_v1 table in the telemetry_derived dataset.

If we want to create a new table of just client_id's each day called client_ids in the example dataset, we should create /sql/example/client_ids/query.sql:

SELECT
  DISTINCT(client_id),
  submission_date
FROM
    telemetry_derived.main_summary_v4
WHERE
  submission_date = @submission_date

@submission_date is a parameter that will be filled in by Airflow.

To schedule your query, create a /sql/example/client_ids/metadata.yaml file with the following content:

friendly_name: Client IDs
description: Unique client IDs, partitioned by day.
owners:
  - example@mozilla.com
labels:
  application: firefox
  incremental: true     # incremental queries add data to existing tables
scheduling:
  dag_name: bqetl_client_ids

The scheduling section schedules the query as a task that is part of the bqetl_clients_ids DAG. Make sure that the bqetl_clients_ids DAG is actually defined in dags.yaml and has the right scheduling interval, for example:

bqetl_clients_ids:   # name of the DAG; must start with bqetl_
  schedule_interval: 0 2 * * *    # query schedule; every day at 2am
  default_args:
    owner: example@mozilla.com
    start_date: '2020-04-05'  # YYYY-MM-DD
    email: ['example@mozilla.com']
    retries: 2    # number of retries if the query execution fails
    retry_delay: 30m

In this example, the bqetl_clients_ids DAG and the created query will be executed on a daily basis at 02:00 UTC.

Run ./script/generate_airflow_dags to generate the Airflow DAG. Task dependencies that are defined in bigquery-etl and dependencies to stable tables are determined automatically. Generated DAGs are written to the dags/ directory and will be automatically detected and scheduled by Airflow once the changes are committed to master in bigquery-etl.

Other considerations

  • The Airflow task will overwrite the destination table partition
    • Destination table should be partitioned by submission_date
    • date_partition_parameter can be set to null to overwrite the whole table in the scheduling section of the metadata.yaml file