Accessing and working with BigQuery

This guide will give you a quick introduction to working with data stored in BigQuery

BigQuery uses a columnar data storage format called Capacitor which supports semi-structured data.

There is a cost associated with using BigQuery based on operations. As of right now we pay an on-demand pricing for queries based on how much data a query scans. To minimize costs see Query Optimizations. More detailed pricing information can be found here.

As we transition to GCP we will use BigQuery as our primary data warehouse and SQL Query engine. BigQuery will eventually replace our previous SQL Query Engines, Presto and Athena, and our Parquet data lake.

Table of Contents

Access

There are multiple ways to access BigQuery. For most users the primary interface will be re:dash.

See below for additional interfaces. All other interfaces will require access to be provisioned.

Interfaces

BigQuery datasets and tables can be accessed by the following methods:

Access Request

For access to BigQuery via GCP Console and API please file a bug here. As part of this request we will add you to the appropriate Google Groups and provision a GCP Service Account.

From re:dash

All Mozilla users will be able to access BigQuery via re:dash through the following Data Sources:

  • BigQuery (Beta)
  • BigQuery Search (Beta)
    • This group is restricted to users in the re:dash search group.

Access via re:dash is read-only. You will not be able to create views or tables via re:dash.

GCP BigQuery Console

  • File a bug with Data Operations for access to GCP Console.
  • Visit GCP BigQuery Console
  • Switch to the project provided to you during your access request e.g moz-fx-data-bq-<team-name>

See Using the BigQuery web UI in the GCP Console for more details.

GCP BigQuery API Access

  • File a bug with Data Operations for access to GCP BigQuery API Access.

A list of supported BigQuery client libraries can be found here.

Detailed REST reference can be found here.

From bq Command-line Tool

  • Install the GCP SDK
  • Authorize gcloud with either your user account or provisioned service account. See documentation here.
    • gcloud auth login
  • Set your google project to your team project
    • gcloud config set project moz-fx-data-bq-<team-name>
    • project name will be provided for you when your account is provisioned.

bq Examples

List tables in a BigQuery dataset

bq ls moz-fx-data-derived-datasets:analysis

Query a table

bq query --nouse_legacy_sql 'select count(*) from `moz-fx-data-derived-datasets.telemetry.main_summary_v4` where submission_date_s3 = "2019-03-01"'

Additional examples and documentation can be found here.

From Spark / Databricks

There are two Spark connectors you can use:

Standard SQL connector

This connector is based on spotify/spark-bigquery. It allows to submit arbitrary SQL queries and get results in a DataFrame. It's worth noting that filtering on a DataFrame level won't be pushed down to BigQuery - this is quite a big difference to what you might be used to if you were using Parquet as your data source. Therefore offloading as much filtering as possible to the query is a great way to speed up and lower cost of your data loads.

Connector library is added to shared_serverless_python3 cluster on Databricks - see example Scala and Python notebooks.

Storage API connector

Connectivity via BigQuery Spark Connector which uses BigQuery Storage API.

Work in progress

From Colaboratory

Colaboratory is Jupyter notebook environment, managed by Google and running in the cloud. Notebooks are stored in Google Drive and can be shared in a similar way to Google Docs.

Colaboratory can be used to easily access BigQuery and perform interactive analyses. See example notebook.

Note: this is very similar to API Access, so you will need access to your team's GCP project - file a request as described above.

Querying Tables

Projects, Datasets and Tables in BigQuery

In GCP a project is a way to organize cloud resources. We use multiple projects to maintain our BigQuery datasets. BigQuery datasets are a top-level container used to organize and control access to tables and views.

Caveats

  • The date partition field (e.g. submission_date_s3, submission_date) is mostly used as a partitioning column, but it has changed from YYYYMMDD form to the more standards-friendly YYYY-MM-DD form.
  • Unqualified queries can become very costly very easily. We've placed restrictions on large tables from accidentally querying "all data for all time", namely that you must make use of the date partition fields for large tables (like main_summary or clients_daily).
  • Please read Query Optimizations section that contains advice on how to reduce cost and improve query performance.
  • re:dash BigQuery data sources will have a 10 TB data scanned limit per query. Please let us know in #fx-metrics on Slack if you run into issues!
  • There are no other partitioning fields in BigQuery versions of parquet datasets (e.g. sample_id is no longer a partitioning field and will not necessarily reduce data scanned).

Projects with BigQuery datasets

ProjectDatasetPurpose
moz-fx-data-derived-datasets Imported parquet data, BigQuery ETL, ad-hoc analysis
analysisUser generated tables for analysis
backfillTemporary staging area for back-fills
blpadiBlocklist ping derived data(restricted)
searchSearch data imported form parquet (restricted)
staticStatic data for use with analysis
telemetryImported parquet data and data generated from BigQuery ETL
tmpTemporary staging area for parquet data loads
validationTemporary staging area for validation
moz-fx-data-shar-nonprod-efed Data produced by stage structured ingestion infrastructure

BigQuery table data in moz-fx-data-derived-datasets is loaded daily via Airflow

Writing Queries

To query a BigQuery table you will need to specify the dataset and table name. It is good practice to specify the project however depending on which project the query originates from this is optional.

SELECT
    col1,
    col2
FROM
    `project.dataset.table`
WHERE
    -- data_partition_field will vary based on table
    date_partition_field >= DATE_SUB(CURRENT_DATE, INTERVAL 1 MONTH)

An example query from Clients Last Seen Reference

SELECT
    submission_date,
    os,
    COUNT(*) AS count
FROM
    telemetry.clients_last_seen
WHERE
    submission_date >= DATE_SUB(CURRENT_DATE, INTERVAL 1 WEEK)
    AND days_since_seen = 0
GROUP BY
    submission_date,
    os
HAVING
    count > 10 -- remove outliers
    AND lower(os) NOT LIKE '%windows%'
ORDER BY
    os,
    submission_date DESC

Check out the BigQuery Standard SQL Functions & Operators for detailed documentation.

Writing query results to a permanent table

You can write query results to a BigQuery table you have access via GCP BigQuery Console or GCP BigQuery API Access

  • Use moz-fx-data-derived-datasets.analysis dataset.
    • Prefix your table with your username. If your username is username@mozilla.com create a table with username_my_table.
  • See Writing query results documentation for detailed steps.

Creating a View

You can create views in BigQuery if you have access via GCP BigQuery Console or GCP BigQuery API Access.

  • Use moz-fx-data-derived-datasets.analysis dataset.
    • Prefix your view with your username. If your username is username@mozilla.com create a table with username_my_view.
  • See Creating Views documentation for detailed steps.

Query Optimizations

To improve query performance and minimize the cost associated with using BigQuery please see the following query optimizations:

  • Avoid SELECT * by selecting only the columns you need
    • Using SELECT * is the most expensive way to query data. When you use SELECT * BigQuery does a full scan of every column in the table.
    • Applying a LIMIT clause to a SELECT * query does not affect the amount of data read. You are billed for reading all bytes in the entire table.
    • If you are experimenting with data or exploring data, use one of the data preview options instead of SELECT *.
      • Preview support is coming soon to BigQuery data sources in re:dash
  • Limit the amount of data scanned by using a date partition filter
    • Tables that are larger than 1 TB will require that you provide a date partition filter as part of the query.
    • You will receive an error if you attempt to query a table that requires a partition filter.
      • Cannot query over table 'moz-fx-data-derived-datasets.telemetry.main_summary_v4' without a filter over column(s) 'submission_date_s3' that can be used for partition elimination
    • See Writing Queries for examples.
  • Reduce data before using a JOIN
    • Trim the data as early in the query as possible, before the query performs a JOIN. If you reduce data early in the processing cycle, shuffling and other complex operations only execute on the data that you need.
    • Use sub queries with filters or intermediate tables or views as a way of decreasing sides of a join, prior to the join itself.
  • Do not treat WITH clauses as prepared statements
    • WITH clauses are used primarily for readability because they are not materialized. For example, placing all your queries in WITH clauses and then running UNION ALL is a misuse of the WITH clause. If a query appears in more than one WITH clause, it executes in each clause.
  • Use approximate aggregation functions
    • If the SQL aggregation function you're using has an equivalent approximation function, the approximation function will yield faster query performance. For example, instead of using COUNT(DISTINCT), use APPROX_COUNT_DISTINCT().
    • See approximate aggregation functions in the standard SQL reference.

A complete list of optimizations can be found here and cost optimizations here