Heavy Users
As of 2018-05-18, this dataset has been deprecated and is no longer maintained. See Bug 1455314
Replacement
We've moved to assigning user's an active tag based on total_uri_count.
The activity of a user based on active_ticks is available in clients_daily
in the active_hours_sum field, which has the sum(active_ticks / 720).
To retrieve a client's 28-day active_hours, use the following query:
SELECT submission_date_s3,
client_id,
SUM(active_hours_sum) OVER (PARTITION BY client_id
ORDER BY submission_date_s3 ASC
ROWS 27 PRECEDING) AS monthly_active_hours
FROM
clients_daily
Introduction
The heavy_users table provides information about whether a given client_id is
considered a "heavy user" on each day (using submission date).
Contents
The heavy_users table contains one row per client-day, where day is
submission_date. A client has a row for a specific submission_date if
they were active at all in the 28 day window ending on that submission_date.
A user is a "heavy user" as of day N if, for the 28 day period ending
on day N, the sum of their active_ticks is in the 90th percentile (or
above) of all clients during that period. For more analysis on this,
and a discussion of new profiles, see
this link.
Background and Caveats
- Data starts at 20170801. There is technically data in the table before
this, but the
heavy_usercolumn isNULLfor those dates because it needed to bootstrap the first 28 day window. - Because it is top the 10% of clients for each 28 day period, more
than 10% of clients active on a given
submission_datewill be considered heavy users. If you join with another data source (main_summary, for example), you may see a larger proportion of heavy users than expected. - Each day has a separate, but related, set of heavy users. Initial investigations show that approximately 97.5% of heavy users as of a certain day are still considered heavy users as of the next day.
- There is no "fixing" or weighting of new profiles - days before the
profile was created are counted as zero
active_ticks. Analyses may need to use the includedprofile_creation_datefield to take this into account.
Accessing the Data
The data is available both via sql.t.m.o and Spark.
In Spark:
spark.read.parquet("s3://telemetry-parquet/heavy_users/v1")
In SQL:
SELECT * FROM heavy_users LIMIT 3
Further Reading
The code responsible for generating this dataset is here
Data Reference
Example Queries
Example queries:
- Join
heavy_userswithmain_summaryto get distribution ofmax_concurrent_tab_countfor heavy vs. non-heavy users (STMO#47041) - Join
heavy_userswithlongitudinalto get crash rates for heavy vs. non-heavy users (STMO#47044)
Schema
As of 2017-10-05, the current version of the heavy_users dataset is v1, and has a schema as follows:
root
|-- client_id: string (nullable = true)
|-- sample_id: integer (nullable = true)
|-- profile_creation_date: long (nullable = true)
|-- active_ticks: long (nullable = true)
|-- active_ticks_period: long (nullable = true)
|-- heavy_user: boolean (nullable = true)
|-- prev_year_heavy_user: boolean (nullable = true)
|-- submission_date_s3: string (nullable = true)
Code Reference
This dataset is generated by telemetry-batch-view. Refer to this repository for information on how to run or augment the dataset.