Intro
The active_profiles
dataset gives client-level estimates of whether a profile
is still an active user of the browser at a given point in time, as well as probabilistic forecasts
of the client's future activity. These quantities are estimated by a model that attempts to infer
and decouple a client's latent propensity to leave Firefox and become inactive, as well as their
latent propensity to use the browser while still active. These estimates are currently
generated for release desktop browser profiles only, across all operating systems and
geographies.
Model
The model generates predictions for each client by looking at just the recency and frequency of a
client's daily usage within the previous 90 day window. Usage is defined by the daily level binary
indicator of whether they show up in clients_daily
on a given day.
The table contains columns related to these quantities:
submission_date
: Day marking the end of the 90 day window. Earliestsubmission_date
that the table covers is'2019-05-13'
.min_day
: First day in the window that the client was seen. This could be anywhere between the first day in the window and the last day in the window.max_day
: Last day in the window the client was seen. The highest value this can be issubmission_date
.recency
: Age of client in days.frequency
: Number of days in the window that a client has returned to use the browser aftermin_day
.num_opportunities
: Given a first appearance atmin_day
, what is the highest number of days a client could have returned. That is, what is the highest possible value forfrequency
.
Since the model is only using these 2 coarse-grained statistics, these columns should make it relatively straightforward to interpret why the model made the predictions that it did for a given profile.
Latent quantities
The model estimates the expected value for 2 related latent probability variables for a user. The
values in prob_daily_leave
give our expectation of the probability that they will become inactive
on a given day, and prob_daily_usage
represents the probability that a user will return on a given
day, given that they are still active.
These quantities could be useful for disentangling usage rate from the likelihood that a user is still using the browser. We could, for example, identify intense users who are at risk of churning, or users who at first glance appear to have churned, but are actually just infrequent users.
prob_active
is the expected value of the probability that a user is still active on
submission_date
, given their most recent 90 days' of activity. 'Inactive' in this sense
means that the profile will not use the browser again, whether because they have uninstalled
the browser or for some other reason.
Predictions
There are several columns of the form e_total_days_in_next_7_days
, which give the expected
number of times that a user will show up in the next 7 days (or 14, 21, 28 days). These
predictions take into account both the likelihood that a user will become inactive in the
future, as well as their daily propensity to use the browser, given that they are still active.
The values in e_total_days_in_next_7_days
will be between 0 and 7.
An estimate for the probability that a client will contribute to MAU is available in the
column prob_mau
. This is simply the probability that the user will return at any point in
the following 28 days, thereby contributing to MAU. Since it is a probability, the values will
range between 0 and 1, just like prob_daily_leave
and prob_daily_usage
.
Attributes
There are several columns that contain attributes of the client, like os
, locale
,
normalized_channel
, normalized_os_version
, and country
. sample_id
is also included,
which can be useful for quicker queries, as the table is clustered by this column in BigQuery.
Remarks on the model
A way to think about the model that infers these quantities is to imagine a simple process
where each client is given 2 weighted coins when they become users, and that they flip each
day. Since they're weighted, the probability of heads won't be 50%, but rather some probability
between 0 and 100%, specific to each client's coin. One coin, called L
, comes up heads with
probability prob_daily_leave
, and if it ever comes up heads, the client will never use the
browser again. The daily usage coin, U
, has heads prob_daily_usage
% of the time. While
they are still active, clients flip this coin to decide whether they will use the browser
on that day, and show up in clients_daily
.
The combination of these two coin flipping processes results in a history of activity that we
can see in clients_daily
. While the model is simple, it has very good predictive power that
can tell, in aggregate, how many users will still be active at some point in the future.
A downside of the model's simplicity, however, is that its predictions are not highly tailored
to an individual client. The very simplified features do not take into account things like
seasonality, or finer grained attributes of their usage (like active hours, addons, etc.).
Further, the predictions in this table only account for existing users that have been seen in
the 90 days of history, and so longer term forecasts of user activity would need to somehow model
new users separately.
Caveats and future work
Due to the lightweight feature space of the model, the predictions perform better at the
population level rather than the individual client level, and there will be a lot of client-level
variation in behavior. That is, when grouping clients by different dimensions, say all of the
en-IN
users on Darwin, the average MAU prediction should be quite close, but a lot of users'
behavior can deviate significantly from the predictions.
The model will also be better at medium- to longer- term forecasts. In particular, the model will not be well suited to give predictions for new users who have appeared only once in the data set very recently. These constitute a disproportionately large share of users, but do not have enough history for this model to make good use of. These single day profiles are currently the subject of an investigation that will hopefully yield good heuristics for users that only show up for a single day.
Sample query
Here is a sample query that will give averages for predicted MAU, probability that users are still active, and other quantities across different operating systems:
SELECT
os,
cast(sum(prob_mau) AS int64) AS predicted_mau,
count(*) AS n,
round(avg(prob_active) * 100, 1) AS prob_active,
round(avg(prob_daily_leave) * 100, 1) AS prob_daily_leave,
round(avg(prob_daily_usage) * 100, 1) AS prob_daily_usage,
round(avg(e_total_days_in_next_28_days), 1) AS e_total_days_in_next_28_days
FROM
`telemetry.active_profiles`
WHERE
submission_date = '2019-08-01'
AND sample_id = 1
GROUP BY
1
HAVING
count(*) > 100
ORDER BY
avg(prob_daily_usage) DESC
Scheduling
The code behind the model can be found in the bgbb_lib
repo,
or on PyPI. The airflow job is defined in the
bgbb_airflow
repo.
The model to fit the parameters is run weekly, and the table is updated daily.