Accessing BigQuery

There are many methods that you can use to access BigQuery: both interactive and programmatic. This document provides some basic information and pointers on how to get started with each.

Table of Contents

Interfaces

STMO (sql.telemetry.mozilla.org)

Queries made from STMO are read-only: you cannot create views or tables.

All users with access to STMO can access BigQuery using the following data sources:

  • Telemetry (BigQuery)
  • Telemetry Search (BigQuery)

BigQuery Console

This method requires BigQuery Access to be provisioned.

The BigQuery console is similar to STMO, but allows write access to views and tables. Some people also prefer its user interface, though note that results that you get from it can only be shared with others who also have BigQuery access provisioned.

  • Visit GCP BigQuery Console
  • Switch to the project provided to you during your access request e.g moz-fx-data-bq-<team-name>
  • Write and run your queries

Note that if you are trying to query the telemetry dataset, you will need to explicitly specify the project (moz-fx-data-shared-prod) that it lives in, since you're querying from a different one. For example:

SELECT
  client_id
FROM
  `moz-fx-data-shared-prod`.telemetry.main
WHERE
  DATE(submission_timestamp) = '2020-04-20'
  AND sample_id = 42
  AND application.channel='nightly'

For more details, see Google's Documentation on the GCP Console.

Using the bq Command-Line Tool

This method requires BigQuery Access to be provisioned.

Steps to use:

  • Install the GCP SDK
  • Authorize gcloud with either your user account or provisioned service account. See documentation here.
    • gcloud auth login
  • Set your google project to your team project
    • gcloud config set project moz-fx-data-bq-<team-name>
    • project name will be provided for you when your account is provisioned.

Once configured, you can now use the bq command-line client. The following example lists the tables and views in a BigQuery dataset:

bq ls moz-fx-data-derived-datasets:telemetry

And here's another which gets the count of entries in telemetry.main on 2019-08-22 in the nightly channel:

bq query --nouse_legacy_sql 'select count(*) from `moz-fx-data-derived-datasets.telemetry.main` where submission_date = "2019-08-22" and normalized_channel="nightly"'

Additional examples and documentation can be found in the BigQuery command-line reference.

API Access

This method requires BigQuery Access to be provisioned.

For advanced use cases involving programmatic access -- including automated workloads, ETL, BigQuery Storage API.

You can locate a list of supported BigQuery client libraries here.

Although you typically want to use a client library, Google also provides a detailed reference of their underlying REST API.

Service Accounts

Client SDKs do not access credentials the same way as the gcloud and bq command-line tools. The client SDKs generally assume that the machine is configured with a service account and looks for JSON-based credentials in several well-known locations rather than looking for user credentials.

If you have service account credentials, you can point client SDKs at them by setting:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/creds.json

If you do not have appropriate service account credentials, but your GCP user account has sufficient access, you can have your user credentials mimic a service account by running:

gcloud auth application-default login

Once you've followed the browser flow to grant access, you should be able to, for example, access BigQuery from Python:

pip install google-cloud-bigquery
python -c 'from google.cloud import bigquery; print([d.dataset_id for d in bigquery.Client().list_datasets()])'

Spark

Apache Spark is a data processing engine designed to be fast and easy to use. There are several methods you can use to access BigQuery via Spark, depending on your needs. See Custom Analysis with Spark for more information and examples.

Colaboratory

This method requires BigQuery Access to be provisioned.

Colaboratory (Colab) is Jupyter notebook environment, managed by Google and running in the cloud. Notebooks are stored in Google Drive and can be shared in a similar way to Google Docs.

Colab can be used to easily access BigQuery and perform analyses. See the Telemetry Hello World notebook for an interactive example. Under the hood, it uses the BigQuery API to read and write to BigQuery tables, so access needs to be explicitly provisioned.

BigQuery Access Request

For access to BigQuery when using the GCP Console and API, file a bug. You will be added to the appropriate Google Groups and a GCP Service Account will be provisioned for you.