Search Aggregates

Introduction

search_aggregates is designed to power high level search dashboards. It's quick and easy to query, but the data are coarse. In particular, this dataset allows you to segment by a limited number of client characteristics which are relevant to search markets. However, it is not possible to normalize by client count. If you need fine-grained data, consider using search_clients_engines_sources_daily which breaks down search counts by client, engine, and source.

Contents

Each row of search_aggregates contains the standard search count aggregations for each unique combination of the following columns. Unless otherwise noted, these columns are taken directly from main_summary.

  • submission_date - yyyymmdd
  • engine - e.g. google, bing, yahoo
  • source - The UI component used to issue a search - e.g. urlbar, abouthome
  • country
  • locale
  • addon_version - The installed version of the [followonsearch addon] (before version 61)
  • app_version
  • distribution_id - NULL means the standard Firefox build
  • search_cohort - NULL except for small segments relating to search experimentation
  • default_search_engine
  • default_private_search_engine
  • os - e.g. Linux, Windows_NT, Darwin ...
  • os_version
  • is_default_browser

There are ten aggregation columns: sap, tagged-sap, tagged-follow-on,organic, unknown, ad_click, ad_click_organic, search_with_ads, search_with_ads_organic, and client_count. Each of these columns represent different types of searches. For more details, see the search data documentation.

Gotcha

Although search_aggregates table is created on top of search_clients_engines_sources_daily, you may expect the total search metrics reported to match exactly. It's actually not the case. In case you notice the total number reported in search_aggregates higher than search_clients_engines_sources_daily, it's most likely due to Shredder. search_aggregates table is aggregated beyond the client level, so shredder doesn’t have to touch it. But search_clients_engines_sources_daily contains client_id and is subject to shredder. It's expected to lose up to 1% of rows every month as Firefox responds to clients' deletion requests, which would reduce count in search_clients_engines_soruces but not in search_aggregates. An example query to show such a difference can be found in STMO#84302.

Data Reference

Example Queries

Daily US sap searches

SELECT
    submission_date,
    SUM(SAP) AS search_counts
FROM search.search_aggregates
WHERE
    country = 'US'
    AND submission_date BETWEEN '2019-01-01' AND '2019-01-07'
GROUP BY submission_date
ORDER BY submission_date

STMO#51140

Scheduling

This job is scheduled on airflow to run daily.

Schema

As of 2021-04-29, the current version of search_aggregates is v8, and has a schema as follows. The dataset is backfilled through 2016-03-11.

root
 |-- submission_date: date (nullable = true)
 |-- submission_date_s3: date (nullable = true)
 |-- country: string (nullable = true)
 |-- engine: string (nullable = true)
 |-- normalized_engine: string (nullable = true)
 |-- source: string (nullable = true)
 |-- app_version: string (nullable = true)
 |-- distribution_id: string (nullable = true)
 |-- locale: string (nullable = true)
 |-- search_cohort: string (nullable = true)
 |-- addon_version: string (nullable = true)
 |-- tagged_sap: long (nullable = true)
 |-- tagged_follow_on: long (nullable = true)
 |-- sap: long (nullable = true)
 |-- organic: long (nullable = true)
 |-- search_with_ads: long (nullable = true)
 |-- search_with_ads_organic: long (nullable = true)
 |-- ad_click: long (nullable = true)
 |-- ad_click_organic: long (nullable = true)
 |-- unknown: long (nullable = true)
 |-- client_count: long (nullable = true)
 |-- default_search_engine: string (nullable = true)
 |-- default_private_search_engine: string (nullable = true)
 |-- os: string (nullable = true)
 |-- os_version: string (nullable = true)
 |-- is_default_browser: boolean (nullable = true)

Code Reference

The search_aggregates job is defined in bigquery-etl