Accessing and working with BigQuery

This guide will give you a quick introduction to working with data stored in BigQuery

BigQuery uses a columnar data storage format called Capacitor which supports semi-structured data.

There is a cost associated with using BigQuery based on operations. As of right now we pay an on-demand pricing for queries based on how much data a query scans. To minimize costs see Query Optimizations. More detailed pricing information can be found here.

As we transition to GCP we will use BigQuery as our primary data warehouse and SQL Query engine. BigQuery will eventually replace our previous SQL Query Engines, Presto and Athena, and our Parquet data lake.

Table of Contents

Access

There are multiple ways to access BigQuery. For most users the primary interface will be re:dash.

See below for additional interfaces. All other interfaces will require access to be provisioned.

Interfaces

BigQuery datasets and tables can be accessed by the following methods:

Access Request

For access to BigQuery via GCP Console and API please file a bug here. As part of this request we will add you to the appropriate Google Groups and provision a GCP Service Account.

From re:dash

All Mozilla users will be able to access BigQuery via re:dash through the following Data Sources:

  • BigQuery (Beta)
  • BigQuery Search (Beta)
    • This group is restricted to users in the re:dash search group.

Access via re:dash is read-only. You will not be able to create views or tables via re:dash.

GCP BigQuery Console

  • File a bug with Data Operations for access to GCP Console.
  • Visit GCP BigQuery Console
  • Switch to the project provided to you during your access request e.g moz-fx-data-bq-<team-name>

See Using the BigQuery web UI in the GCP Console for more details.

GCP BigQuery API Access

  • File a bug with Data Operations for access to GCP BigQuery API Access.

A list of supported BigQuery client libraries can be found here.

Detailed REST reference can be found here.

From bq Command-line Tool

  • Install the GCP SDK
  • Authorize gcloud with either your user account or provisioned service account. See documentation here.
    • gcloud auth login
  • Set your google project to your team project
    • gcloud config set project moz-fx-data-bq-<team-name>
    • project name will be provided for you when your account is provisioned.

bq Examples

List tables in a BigQuery dataset

bq ls moz-fx-data-derived-datasets:analysis

Query a table

bq query --nouse_legacy_sql 'select count(*) from `moz-fx-data-derived-datasets.telemetry.main_summary_v4` where submission_date_s3 = "2019-03-01"'

Additional examples and documentation can be found here.

From client SDKs

Client SDKs for various programming languages don't access credentials the same way as the gcloud and bq command-line tools. The client SDKs generally assume that the machine is configured with a service account and will look for JSON-based credentials in several well-known locations rather than looking for user credentials.

If you have service account credentials, you can point client SDKs at them by setting:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/creds.json

If you don't have appropriate service account credentials, but your GCP user account has sufficient access, you can have your user credentials mimic a service account by running:

gcloud auth application-default login

Once you've followed the browser flow to grant access, you should be able to, for example, access BigQuery from Python:

pip install google-cloud-bigquery
python -c 'from google.cloud import bigquery; print([d.dataset_id for d in bigquery.Client().list_datasets()])'

From Spark / Databricks

There are two Spark connectors you can use:

Standard SQL connector

This connector is based on spotify/spark-bigquery. It allows to submit arbitrary SQL queries and get results in a DataFrame. It's worth noting that filtering on a DataFrame level won't be pushed down to BigQuery - this is quite a big difference to what you might be used to if you were using Parquet as your data source. Therefore offloading as much filtering as possible to the query is a great way to speed up and lower cost of your data loads.

Connector library is added to shared_serverless_python3 cluster on Databricks - see example Scala and Python notebooks.

Storage API connector

Storage API Connector uses BigQuery Storage API. Although both API and connector are in Beta, this is the recommended way for accessing BigQuery from Spark as it offers better performance and is more cost-effective than the standard connector. This connector has some limited support for pushing down filtering predicates, it can also be used together with BigQuery client which allows to run arbitrary SQL queries and load results to Spark.

Connector library is added to shared_serverless_python3 cluster on Databricks - see example Python notebook for more details on the topics mentioned above.

From Spark / Dataproc

Querying BigQuery from Dataproc will be faster than from Databricks because it will not involve cross-cloud data transfers.

You can spin up a Dataproc cluster with Jupyter using the following command. Insert your values for cluster-name, bucket-name, and project-id there. Your notebooks will be stored in Cloud Storage under gs://bucket-name/notebooks/jupyter:

gcloud beta dataproc clusters create cluster-name \
    --optional-components=ANACONDA,JUPYTER \
    --image-version=1.4 \
    --enable-component-gateway \
    --properties=^#^spark:spark.jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar,gs://spark-bigquery-dev-test/spark-bigquery-spotify-connector/spark-bigquery-assembly-0.3.0-SNAPSHOT.jar \
    --num-workers=5 \
    --max-idle=3h \
    --bucket bucket-name \
    --project project-id

Jupyter URL can be retrieved with the following command:

gcloud beta dataproc clusters describe cluster-name --project project-id | grep Jupyter

After you've finished your work, it's a good practice to delete your cluster:

gcloud beta dataproc clusters delete cluster-name --project project-id

From Colaboratory

Colaboratory is Jupyter notebook environment, managed by Google and running in the cloud. Notebooks are stored in Google Drive and can be shared in a similar way to Google Docs.

Colaboratory can be used to easily access BigQuery and perform interactive analyses. See Telemetry Hello World notebook.

Note: this is very similar to API Access, so you will need access to your team's GCP project - file a request as described above.

Querying Tables

Projects, Datasets and Tables in BigQuery

In GCP a project is a way to organize cloud resources. We use multiple projects to maintain our BigQuery datasets. BigQuery datasets are a top-level container used to organize and control access to tables and views.

Caveats

  • The date partition field (e.g. submission_date_s3, submission_date) is mostly used as a partitioning column, but it has changed from YYYYMMDD form to the more standards-friendly YYYY-MM-DD form.
  • Unqualified queries can become very costly very easily. We've placed restrictions on large tables from accidentally querying "all data for all time", namely that you must make use of the date partition fields for large tables (like main_summary or clients_daily).
  • Please read Query Optimizations section that contains advice on how to reduce cost and improve query performance.
  • re:dash BigQuery data sources will have a 10 TB data scanned limit per query. Please let us know in #fx-metrics on Slack if you run into issues!
  • There are no other partitioning fields in BigQuery versions of parquet datasets (e.g. sample_id is no longer a partitioning field and will not necessarily reduce data scanned).
  • There is no native map support in BigQuery. Instead, we are using structs with fields [key, value]. We have provided convenience functions to access these like key-value maps (described below.)

Projects with BigQuery datasets

ProjectDatasetPurpose
moz-fx-data-derived-datasetsImported parquet data, BigQuery ETL, ad-hoc analysis
analysisUser generated tables for analysis
backfillTemporary staging area for back-fills
blpadiBlocklist ping derived data(restricted)
searchSearch data imported form parquet (restricted)
staticStatic data for use with analysis
telemetryUser-facing views on imported parquet data and tables generated from BigQuery ETL
telemetry_rawImported parquet data
tmpTemporary staging area for parquet data loads
validationTemporary staging area for validation
moz-fx-data-shar-nonprod-efedData produced by stage structured ingestion infrastructure

BigQuery table data in moz-fx-data-derived-datasets is loaded daily via Airflow

Writing Queries

To query a BigQuery table you will need to specify the dataset and table name. It is good practice to specify the project however depending on which project the query originates from this is optional.

SELECT
    col1,
    col2
FROM
    `project.dataset.table`
WHERE
    -- data_partition_field will vary based on table
    date_partition_field >= DATE_SUB(CURRENT_DATE, INTERVAL 1 MONTH)

An example query from Clients Last Seen Reference

SELECT
    submission_date,
    os,
    COUNT(*) AS count
FROM
    telemetry.clients_last_seen
WHERE
    submission_date >= DATE_SUB(CURRENT_DATE, INTERVAL 1 WEEK)
    AND days_since_seen = 0
GROUP BY
    submission_date,
    os
HAVING
    count > 10 -- remove outliers
    AND lower(os) NOT LIKE '%windows%'
ORDER BY
    os,
    submission_date DESC

Check out the BigQuery Standard SQL Functions & Operators for detailed documentation.

Writing query results to a permanent table

You can write query results to a BigQuery table you have access via GCP BigQuery Console or GCP BigQuery API Access

  • Use moz-fx-data-derived-datasets.analysis dataset.
    • Prefix your table with your username. If your username is username@mozilla.com create a table with username_my_table.
  • See Writing query results documentation for detailed steps.

Creating a View

You can create views in BigQuery if you have access via GCP BigQuery Console or GCP BigQuery API Access.

  • Use moz-fx-data-derived-datasets.analysis dataset.
    • Prefix your view with your username. If your username is username@mozilla.com create a table with username_my_view.
  • See Creating Views documentation for detailed steps.

Accessing map-like fields

BigQuery currently lacks native map support and our workaround is to use a STRUCT type with fields named [key, value]. We've created a user-defined function (UDFs) that provides key-based access with the signature: udf.get_key(<struct>, <key>). The example below generates a count per reason key in the event_map_values field in the telemetry events table for Normandy unenrollment events from yesterday.

SELECT udf.get_key(event_map_values, 'reason') AS reason,
       COUNT(*) AS EVENTS
FROM telemetry.events
WHERE submission_date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
  AND event_category='normandy'
  AND event_method='unenroll'
GROUP BY 1
ORDER BY 2 DESC

Query Optimizations

To improve query performance and minimize the cost associated with using BigQuery please see the following query optimizations:

  • Avoid SELECT * by selecting only the columns you need
    • Using SELECT * is the most expensive way to query data. When you use SELECT * BigQuery does a full scan of every column in the table.
    • Applying a LIMIT clause to a SELECT * query might not affect the amount of data read, depending on the table structure.
      • Many of our tables are configured to use clustering in which case a LIMIT clause does effectively limit the amount of data that needs to be scanned.
      • Tables that include a sample_id field will usually have that as one of the clustering fields and you can efficiently scan random samples of users by specifying WHERE sample_id = 0 (1% sample), WHERE sample_id < 10 (10% sample), etc. This can be especially helpful with main_summary, clients_daily, and clients_last_seen which are very large tables and are all clustered on sample_id.
      • To check whether your LIMIT and WHERE clauses are actually improving performance, you should see a lower value reported for actual "Data Scanned" by a query compared to the prediction ("This query will process X bytes") in STMO or the BigQuery UI.
    • If you are experimenting with data or exploring data, use one of the data preview options instead of SELECT *.
      • Preview support is coming soon to BigQuery data sources in re:dash
  • Limit the amount of data scanned by using a date partition filter
    • Tables that are larger than 1 TB will require that you provide a date partition filter as part of the query.
    • You will receive an error if you attempt to query a table that requires a partition filter.
      • Cannot query over table 'moz-fx-data-derived-datasets.telemetry.main_summary_v4' without a filter over column(s) 'submission_date_s3' that can be used for partition elimination
    • See Writing Queries for examples.
  • Reduce data before using a JOIN
    • Trim the data as early in the query as possible, before the query performs a JOIN. If you reduce data early in the processing cycle, shuffling and other complex operations only execute on the data that you need.
    • Use sub queries with filters or intermediate tables or views as a way of decreasing sides of a join, prior to the join itself.
  • Do not treat WITH clauses as prepared statements
    • WITH clauses are used primarily for readability because they are not materialized. For example, placing all your queries in WITH clauses and then running UNION ALL is a misuse of the WITH clause. If a query appears in more than one WITH clause, it executes in each clause.
  • Use approximate aggregation functions
    • If the SQL aggregation function you're using has an equivalent approximation function, the approximation function will yield faster query performance. For example, instead of using COUNT(DISTINCT), use APPROX_COUNT_DISTINCT().
    • See approximate aggregation functions in the standard SQL reference.
  • Reference the data size prediction ("This query will process X bytes") in STMO and the BigQuery UI to help gauge the efficiency of your queries. You should see this number go down as you limit the range of submission_dates or include fewer fields in your SELECT statement. For clustered tables, this estimate won't take into account benefits from LIMITs and WHERE clauses on clustering fields, so you'll need to compare to the actual "Data Scanned" after the query is run. Queries are charged by data scanned at $5/TB so each 200 GB of data scanned will cost $1; it can be useful to keep the data estimate below 200 GB while developing and testing a query to limit cost and query time, then open up to the full range of data you need when you have confidence in the results.

A complete list of optimizations can be found here and cost optimizations here