Got some new data you want to send to us? How in the world do you send a new ping? Follow this guide to find out.

Write Your Questions

Do not try and implement new pings unless you know specifically what questions you're trying to answer. General questions about "How do users use our product?" won't cut it - these need to be specific, concrete asks that can be translated to data points. This will also make it easier down the line as you start data review.

Create a Schema

Use JSON Schema to start with. See the examples schemas in the Mozilla Pipeline Schemas repo. This schema is just used to validate the incoming data; any ping that doesn't match the schema will be removed. Validate your JSON Schema using a validation tool.

We already have automatic deduping based on docId, which catches about 90% of duplicates and removes them from the dataset.

Start a Data Review

Data review for new pings is more complicated than when adding new probes. See Data Review for Focus-Event Ping as an example. Consider where the data falls in the Data Collection Categories.

Submit Schema to mozilla-services/mozilla-pipeline-schemas

The first schema added should be the JSON Schema made in step 2. Make sure you add at least one example ping which the data can be validated against. Additionally, a Parquet output schema should be added. This would add a new dataset, available in Re:dash.

Parquet output also has a metadata section. These are fields added to the ping at ingestion time; they might come from the URL submitted to the edge server, or the IP Address used to make the request. For now, take a look at the ingestion code to see which fields are added. Note: We deeply apologize for making you look there. Feel free to reach out to us in the #datapipeline IRC channel with questions.

Testing The Schema

Note that this only works if data is already being sent, and you want to test the schema you're writing on the data that is currently being ingested.

Test your Parquet output in Hindsight by using an output plugin. See Core ping output plugin for an example, where the parquet schema is specified as parquet_schema. If no errors arise, that means it should be correct. The "Deploy" button should not be used to actually deploy, that will be done by operations in the next step.

Deploy the Plugin

File a bug to deploy the new schema.

Real-time analysis will be key to ensuring your data is being processed and parsed correctly. It should follow the format specified in Moztelemetry doctype monitor. This allows you to check validation errors, size changes, duplicates, and more. Once you have the numbers set, file a bug to let ops deploy it

(Telemetry-Specific) Register Doctype

Data Platform Operations takes care of this. It will then be available to query more easily using the Dataset API. To do so, make a bug like Bug 1292493.

(Non-Telemetry) Add ping name to sources.json

This will make it available with the Dataset API (used with pyspark on ATMO machines). There also needs to be a schema for the layout of the heka files in net-mozaws-prod-us-west-2-pipeline-metadata//schema.json, where is located in source.json. If you want to do this, talk to :whd or :mreid.

Start Sending Data

If you're using the Telemetry APIs, use those built-in. These can be with the Gecko Telemetry APIs, the Android Telemetry APIs, or the iOS Telemetry APIs. Otherwise, see here for the endpoint and expected format.

Work is happening to make a generic endpoint. These pings can be easily registered and sent to our servers and will be automatically available in Re:dash. Please check back later for those docs.

Write ETL Jobs

We have some basic generalized ETL jobs you can use to transform your data on a batch basis - for example, a Longitudinal or client-count like dataset. Otherwise, you'll have to write your own.

You can schedule it on Airflow, or you can run it as a job in ATMO. If the output is parquet, you can add it to the Hive metastore to have it available in re:dash. Check the docs on creating your own datasets.

Build Dashboards Using ATMO or STMO

Last steps! What are you using this data for anyways?

results matching ""

    No results matching ""