Scheduling BigQuery Queries in Airflow
Queries in bigquery-etl
can be scheduled in
Airflow to be run regularly with the results written to a table.
In bigquery-etl
In the bigquery-etl
project, queries are written in /sql
.
The directory structure is based on the destination table: /sql/{project_id}/{dataset_id}/{table_name}
.
For example, /sql/moz-fx-data-shared-prod/telemetry_derived/core_clients_last_seen_v1/query.sql
is a query that will write results to the core_clients_last_seen_v1
table in the telemetry_derived
dataset.
If we want to create a new table of just client_id
's each day called client_ids
in the example
dataset
in the moz-fx-data-shared-prod
project, we should create /sql/moz-fx-data-shared-prod/example/client_ids/query.sql
:
SELECT
DISTINCT(client_id),
submission_date
FROM
telemetry_derived.main_summary_v4
WHERE
submission_date = @submission_date
@submission_date
is a parameter that will be filled in by Airflow.
To schedule your query, create a /sql/moz-fx-data-shared-prod/example/client_ids/metadata.yaml
file with the following content:
friendly_name: Client IDs
description: Unique client IDs, partitioned by day.
owners:
- example@mozilla.com
labels:
application: firefox
incremental: true # incremental queries add data to existing tables
scheduling:
dag_name: bqetl_client_ids
The scheduling
section schedules the query as a task that is part of the bqetl_clients_ids
DAG. Make sure that
the bqetl_clients_ids
DAG is actually defined in dags.yaml
and has the right scheduling interval, for example:
bqetl_clients_ids: # name of the DAG; must start with bqetl_
schedule_interval: 0 2 * * * # query schedule; every day at 2am
default_args:
owner: example@mozilla.com
start_date: "2020-04-05" # YYYY-MM-DD
email: ["example@mozilla.com"]
retries: 2 # number of retries if the query execution fails
retry_delay: 30m
In this example, the bqetl_clients_ids
DAG and the created query will be executed on a daily basis at 02:00 UTC.
Run ./script/generate_airflow_dags
to generate the Airflow DAG or generate the DAG using the bqetl
CLI:
bqetl dag generate bqetl_clients_ids
. Task dependencies that are defined in bigquery-etl and
dependencies to stable tables are determined automatically. Generated DAGs are written to the dags/
directory and
will be automatically detected and scheduled by Airflow once the changes are committed to master in bigquery-etl
.
Other considerations
- The Airflow task will overwrite the destination table partition
- Destination table should be partitioned by
submission_date
date_partition_parameter
can be set tonull
to overwrite the whole table in thescheduling
section of themetadata.yaml
file
- Destination table should be partitioned by