Create a spark notebook that does the transformations you need, either on
raw data (using Dataset API) or on parquet data
Output the results of that to an S3 location, usually
This would partition by submission_date, meaning each day this runs and is
outputted to a new location in S3. Do NOT put the submission_date in the
parquet file as well! A column name cannot also be the name of a partition.
Partitioning is optional, but datasets should have a version in the path.
Using this template,
open a bug to publish the dataset (making it available in Spark and Re:dash) with the following attributes:
Add whiteboard tag [DataOps]
Title: "Publish dataset"
Content: Location of the dataset in S3 (from step 2 above) and the desired table name