active_profiles dataset gives client-level estimates of whether a profile
is still an active user of the browser at a given point in time, as well as probabilistic forecasts
of the client's future activity. These quantities are estimated by a model that attempts to infer
and decouple a client's latent propensity to leave Firefox and become inactive, as well as their
latent propensity to use the browser while still active. These estimates are currently
generated for release desktop browser profiles only, across all operating systems and
The model generates predictions for each client by looking at just the recency and frequency of a
client's daily usage within the previous 90 day window. Usage is defined by the daily level binary
indicator of whether they show up in
clients_daily on a given day.
The table contains columns related to these quantities:
submission_date: Day marking the end of the 90 day window. Earliest
submission_datethat the table covers is
min_day: First day in the window that the client was seen. This could be anywhere between the first day in the window and the last day in the window.
max_day: Last day in the window the client was seen. The highest value this can be is
recency: Age of client in days.
frequency: Number of days in the window that a client has returned to use the browser after
num_opportunities: Given a first appearance at
min_day, what is the highest number of days a client could have returned. That is, what is the highest possible value for
Since the model is only using these 2 coarse-grained statistics, these columns should make it relatively straightforward to interpret why the model made the predictions that it did for a given profile.
The model estimates the expected value for 2 related latent probability variables for a user. The
prob_daily_leave give our expectation of the probability that they will become inactive
on a given day, and
prob_daily_usage represents the probability that a user will return on a given
day, given that they are still active.
These quantities could be useful for disentangling usage rate from the likelihood that a user is still using the browser. We could, for example, identify intense users who are at risk of churning, or users who at first glance appear to have churned, but are actually just infrequent users.
prob_active is the expected value of the probability that a user is still active on
submission_date, given their most recent 90 days' of activity. 'Inactive' in this sense
means that the profile will not use the browser again, whether because they have uninstalled
the browser or for some other reason.
There are several columns of the form
e_total_days_in_next_7_days, which give the expected
number of times that a user will show up in the next 7 days (or 14, 21, 28 days). These
predictions take into account both the likelihood that a user will become inactive in the
future, as well as their daily propensity to use the browser, given that they are still active.
The values in
e_total_days_in_next_7_days will be between 0 and 7.
An estimate for the probability that a client will contribute to MAU is available in the
prob_mau. This is simply the probability that the user will return at any point in
the following 28 days, thereby contributing to MAU. Since it is a probability, the values will
range between 0 and 1, just like
There are several columns that contain attributes of the client, like
sample_id is also included,
which can be useful for quicker queries, as the table is clustered by this column in BigQuery.
Remarks on the model
A way to think about the model that infers these quantities is to imagine a simple process
where each client is given 2 weighted coins when they become users, and that they flip each
day. Since they're weighted, the probability of heads won't be 50%, but rather some probability
between 0 and 100%, specific to each client's coin. One coin, called
L, comes up heads with
prob_daily_leave, and if it ever comes up heads, the client will never use the
browser again. The daily usage coin,
U, has heads
prob_daily_usage% of the time. While
they are still active, clients flip this coin to decide whether they will use the browser
on that day, and show up in
The combination of these two coin flipping processes results in a history of activity that we
can see in
clients_daily. While the model is simple, it has very good predictive power that
can tell, in aggregate, how many users will still be active at some point in the future.
A downside of the model's simplicity, however, is that its predictions are not highly tailored
to an individual client. The very simplified features do not take into account things like
seasonality, or finer grained attributes of their usage (like active hours, addons, etc.).
Further, the predictions in this table only account for existing users that have been seen in
the 90 days of history, and so longer term forecasts of user activity would need to somehow model
new users separately.
Caveats and future work
Due to the lightweight feature space of the model, the predictions perform better at the
population level rather than the individual client level, and there will be a lot of client-level
variation in behavior. That is, when grouping clients by different dimensions, say all of the
en-IN users on Darwin, the average MAU prediction should be quite close, but a lot of users'
behavior can deviate significantly from the predictions.
The model will also be better at medium- to longer- term forecasts. In particular, the model will not be well suited to give predictions for new users who have appeared only once in the data set very recently. These constitute a disproportionately large share of users, but do not have enough history for this model to make good use of. These single day profiles are currently the subject of an investigation that will hopefully yield good heuristics for users that only show up for a single day.
Here is a sample query that will give averages for predicted MAU, probability that users are still active, and other quantities across different operating systems:
SELECT os, cast(sum(prob_mau) AS int64) AS predicted_mau, count(*) AS n, round(avg(prob_active) * 100, 1) AS prob_active, round(avg(prob_daily_leave) * 100, 1) AS prob_daily_leave, round(avg(prob_daily_usage) * 100, 1) AS prob_daily_usage, round(avg(e_total_days_in_next_28_days), 1) AS e_total_days_in_next_28_days FROM `telemetry.active_profiles` WHERE submission_date = '2019-08-01' AND sample_id = 1 GROUP BY 1 HAVING count(*) > 100 ORDER BY avg(prob_daily_usage) DESC
The code behind the model can be found in the
or on PyPI. The airflow job is defined in the
The model to fit the parameters is run weekly, and the table is updated daily.