# Intro

The `active_profiles`

dataset gives client-level estimates of whether a profile
is still an active user of the browser at a given point in time, as well as probabilistic forecasts
of the client's future activity. These quantities are estimated by a model that attempts to infer
and decouple a client's latent propensity to leave Firefox and become inactive, as well as their
latent propensity to use the browser while still active. These estimates are currently
generated for release desktop browser profiles only, across all operating systems and
geographies.

# Model

The model generates predictions for each client by looking at just the recency and frequency of a
client's daily usage within the previous 90 day window. Usage is defined by the daily level binary
indicator of whether they show up in `clients_daily`

on a given day.

The table contains columns related to these quantities:

`submission_date`

: Day marking the end of the 90 day window. Earliest`submission_date`

that the table covers is`'2019-05-13'`

.`min_day`

: First day in the window that the client was seen. This could be anywhere between the first day in the window and the last day in the window.`max_day`

: Last day in the window the client was seen. The highest value this can be is`submission_date`

.`recency`

: Age of client in days.`frequency`

: Number of days in the window that a client has returned to use the browser after`min_day`

.`num_opportunities`

: Given a first appearance at`min_day`

, what is the highest number of days a client could have returned. That is, what is the highest possible value for`frequency`

.

Since the model is only using these 2 coarse-grained statistics, these columns should make it relatively straightforward to interpret why the model made the predictions that it did for a given profile.

## Latent quantities

The model estimates the expected value for 2 related latent probability variables for a user. The
values in `prob_daily_leave`

give our expectation of the probability that they will become inactive
on a given day, and `prob_daily_usage`

represents the probability that a user will return on a given
day, *given that they are still active*.

These quantities could be useful for disentangling usage *rate* from the likelihood that a user is
still using the browser. We could, for example, identify intense users who are at risk of
churning, or users who at first glance appear to have churned, but are actually just infrequent
users.

`prob_active`

is the expected value of the probability that a user is still active on
`submission_date`

, given their most recent 90 days' of activity. 'Inactive' in this sense
means that the profile will not use the browser again, whether because they have uninstalled
the browser or for some other reason.

## Predictions

There are several columns of the form `e_total_days_in_next_7_days`

, which give the expected
number of times that a user will show up in the next 7 days (or 14, 21, 28 days). These
predictions take into account both the likelihood that a user will become inactive in the
future, as well as their daily propensity to use the browser, given that they are still active.
The values in `e_total_days_in_next_7_days`

will be between 0 and 7.

An estimate for the probability that a client will contribute to MAU is available in the
column `prob_mau`

. This is simply the probability that the user will return at any point in
the following 28 days, thereby contributing to MAU. Since it is a probability, the values will
range between 0 and 1, just like `prob_daily_leave`

and `prob_daily_usage`

.

## Attributes

There are several columns that contain attributes of the client, like `os`

, `locale`

,
`normalized_channel`

, `normalized_os_version`

, and `country`

. `sample_id`

is also included,
which can be useful for quicker queries, as the table is clustered by this column in BigQuery.

## Remarks on the model

A way to think about the model that infers these quantities is to imagine a simple process
where each client is given 2 weighted coins when they become users, and that they flip each
day. Since they're weighted, the probability of heads won't be 50%, but rather some probability
between 0 and 100%, specific to each client's coin. One coin, called *L*, comes up heads with
probability `prob_daily_leave`

, and if it ever comes up heads, the client will never use the
browser again. The daily usage coin, *U*, has heads `prob_daily_usage`

% of the time. *While
they are still active*, clients flip this coin to decide whether they will use the browser
on that day, and show up in `clients_daily`

.

The combination of these two coin flipping processes results in a history of activity that we
can see in `clients_daily`

. While the model is simple, it has very good predictive power that
can tell, *in aggregate*, how many users will still be active at some point in the future.
A downside of the model's simplicity, however, is that its predictions are not highly tailored
to an individual client. The very simplified features do not take into account things like
seasonality, or finer grained attributes of their usage (like active hours, addons, etc.).
Further, the predictions in this table only account for existing users that have been seen in
the 90 days of history, and so longer term forecasts of user activity would need to somehow model
new users separately.

# Caveats and future work

Due to the lightweight feature space of the model, the predictions perform better at the
population level rather than the individual client level, and there will be a lot of client-level
variation in behavior. That is, when grouping clients by different dimensions, say all of the
`en-IN`

users on Darwin, the *average* MAU prediction should be quite close, but a lot of users'
behavior can deviate significantly from the predictions.

The model will also be better at medium- to longer- term forecasts. In particular, the model will not be well suited to give predictions for new users who have appeared only once in the data set very recently. These constitute a disproportionately large share of users, but do not have enough history for this model to make good use of. These single day profiles are currently the subject of an investigation that will hopefully yield good heuristics for users that only show up for a single day.

# Sample query

Here is a sample query that will give averages for predicted MAU, probability that users are still active, and other quantities across different operating systems:

```
select
ap.os
, cast(sum(ap.prob_mau) AS int64) as predicted_mau
, count(*) as n
, round(avg(ap.prob_active) * 100, 1) as prob_active
, round(avg(ap.prob_daily_leave) * 100, 1) as prob_daily_leave
, round(avg(ap.prob_daily_usage) * 100, 1) as prob_daily_usage
, round(avg(ap.e_total_days_in_next_28_days), 1) as e_total_days_in_next_28_days
from `telemetry.active_profiles` ap
where submission_date = '2019-08-01'
and sample_id = 1
group by 1
having count(*) > 100
order by avg(ap.prob_daily_usage) desc
```

## Scheduling

The code behind the model can be found in the `bgbb_lib`

repo,
or on PyPI. The airflow job is defined in the
`bgbb_airflow`

repo.

The model to fit the parameters is run weekly, and the table is updated daily.