Search Clients Last Seen
Introduction
search_clients_last_seen
is designed for individual search analysis for clients over the past year.
This can be useful when wondering about client search behavior over the past year.
NOTE: search_clients_last_seen
is currently a 1% sample, and the first day with a full year of
search activity is 2020-01-01.
Contents
search_clients_last_seen
has just one row per-client, for any client
who was active over the past year. Here we define active as "sent a main
ping"; so if they are present that does not mean they searched.
NOTE: Always choose a single submission_date
when querying search_clients_last_seen
.
The key pieces of this dataset are byte arrays that contain daily information about client activity over the past year. We have these for a variety of activity types:
days_seen_bytes
: Days when we received a main ping from the clientdays_searched_bytes
: Days that the client searched in any formdays_tagged_searched_bytes
: Days that the client performed a tagged searchdays_searched_with_ads_bytes
: Days that the client performed a search that contained adsdays_clicked_ads_bytes
: Days that the client clicked an ad post-search See "Utilizing BYTE columns" for how to use these fields.
There are some convenience columns around these, that give the number of days since the client was last active for that usage criteria:
days_since_seen
days_since_searched
days_since_tagged_searched
days_since_searched_with_ads
days_since_clicked_ad
days_since_created_profile
We also include a variety of dimension information (e.g. os, country, channel, default_search) to aggregate on. The query itself lays out all of the available dimensional fields.
There are, finally, a few fields with daily activity data.
These include active_hours_sum
, organic
for organic searches,
and total_searches
. Please note that these are just for the current day,
and not over the entire year of history contained in the days_*_bytes
columns.
Utilizing BYTE columns
These are stored as BigQuery BYTE
type, so they can be a bit confusing
to use. We have a few convenience functions for using them. For these functions,
anytime we say "was active", we mean "within the usage criteria defined by that
column"; for example, it could be days that clients searched with ads.:
udf.bits_to_days_seen
- The number of days the user was active.
udf.bits_to_days_since_seen
- The number of days since the user was last active.
udf.bits_to_days_since_first_seen
- The number of days since the user's first
active day. Note that this will be at most 365 days, since that is the beginning
of history for this dataset.
udf.bits_to_active_n_weeks_ago
- Returns whether or not the user was active n weeks
ago for the given activity type.
Engine Searches
Warning: This column was designed specifically for use with the revenue data, and probably isn't good for other kinds of analysis.
For each search engine, we store an array that contains the number of searches that user completed each month for the past 12 months. This is across calendar months, so the number of days are not directly comparable.
We have the same data for tagged searches, search with ads, and ad clicks.
Background and Caveats
search_clients_last_seen
does includes
(client_id
submission_date
) pairs
even if we did not receive a ping for that submission_date
.
Any client who was active over the past year will be included.
Accessing the Data
Access the data at search.search_clients_last_seen
.
Data Reference
Example Queries
STMO#70349
gives the WoW retention of users in different segments. Similar to GUD, a user is
considered retained if they do they same activity the next week. Note here the
outage from Armagaddon, and the outage for ad click telemetry in October.
STMO#70348
shows the amount of different search activity taken by clients. We can use it
to determine the % of clients who partake in each activity, regardless of their
baseline amount of activity.
Scheduling
This dataset is scheduled on Airflow (source).
Schema
As of 2020-04-22, the current version of search_clients_last_seee
is v1
,
and has a schema as follows.
It's backfilled through 2020-01-01
root
|- submission_date: date
|- client_id: string
|- sample_id: integer
|- country: string
|- app_version: string
|- distribution_id: string
|- locale: string
|- search_cohort: string
|- addon_version: string
|- os: string
|- channel: string
|- profile_creation_date: integer
|- default_search_engine: string
|- default_search_engine_data_load_path: string
|- default_search_engine_data_submission_url: string
|- profile_age_in_days: integer
|- active_addons_count_mean: float
|- user_pref_browser_search_region: string
|- os_version: string
|- max_concurrent_tab_count_max: integer
|- tab_open_event_count_sum: integer
|- active_hours_sum: float
|- subsession_hours_sum: float
|- sessions_started_on_this_day: integer
|- organic: integer
|- sap: integer
|- unknown: integer
|- tagged_sap: integer
|- tagged_follow_on: integer
|- ad_click: integer
|- search_with_ads: integer
|- total_searches: integer
|- tagged_searches: integer
+- engine_searches: record (repeated)
| |- key: string
| +- value: record
| | |- total_searches: integer (repeated)
| | |- tagged_searches: integer (repeated)
| | |- search_with_ads: integer (repeated)
| | |- ad_click: integer (repeated)
|- days_seen_bytes: bytes
|- days_searched_bytes: bytes
|- days_tagged_searched_bytes: bytes
|- days_searched_with_ads_bytes: bytes
|- days_clicked_ads_bytes: bytes
|- days_created_profile_bytes: bytes
Code Reference
The search_clients_last_seen
job is
defined in bigquery-etl