Sampling in Telemetry data
Since the early days of Telemetry, it has been desirable to have a quick and simple way to do analysis on a sample of the full population of Firefox clients.
The mechanism for doing that is encoded in the data itself, namely the
sample_id
field.
This is a field that is computed from the telemetry client_id
using
the CRC hash function.
This CRC hash is then bucketed into 100 possible values from 0 to 99,
each of which represents a roughly 1% uniform sample of the client_id
space.
All ping tables that contain a client id, as well as many derived datasets,
include the sample_id
field.
TL;DR sample_id = crc32(client_id) % 100
An example python implementation:
# USAGE: python cid2sid.py 859c8a32-0b73-b547-a5e7-8ef4ed9c4c2d
# Prints
# Client ID b'859c8a32-0b73-b547-a5e7-8ef4ed9c4c2d' => Sample ID 55
import binascii
import sys
clientid = sys.argv[1].encode()
crc = binascii.crc32(clientid)
sampleid = (crc & 0xFFFFFFFF) % 100
print("Client ID {} => Sample ID {}".format(clientid, sampleid))