Creating Your Own Dataset to Query in re:dash
- Create a spark notebook that does the transformations you need, either on
raw data (using Dataset API) or on parquet data
- Output the results of that to an S3 location, usually
This would partition by
submission_date, meaning each day this runs and is
outputted to a new location in S3. Do NOT put the
submission_date in the
parquet file as well! A column name cannot also be the name of a partition.
Partitioning is optional, but datasets should have a version in the path.
- Using this template,
open a bug to publish the dataset (making it available in Spark and Re:dash) with the following attributes:
- Add whiteboard tag
- Title: "Publish dataset"
- Content: Location of the dataset in S3 (from step 2 above) and the desired table name