Working with Parquet
Most of our derived datasets,
are stored in Parquet files.
You can access these datasets in re:dash,
but you may want to access the data from an
if SQL isn't powerful enough for your analysis
or if a sample of the data will not suffice.
Table of Contents
- Reading Parquet Tables
- Writing Parquet Tables
- Accessing Parquet Tables from Re:dash
Reading Parquet Tables
Spark provides native support for reading parquet files.
The result of loading a parquet file is a
For example, you can load
main_summary with the following snippet:
# Parquet files are self-describing so the schema is preserved. main_summary = spark.read.parquet('s3://telemetry-parquet/main_summary/v1/')
You can find the S3 path for common datasets in Choosing a Dataset or in the reference documentation.
Writing Parquet Tables
Saving a table to parquet is a great way to share an intermediate dataset.
Where to save data
You can save data to a subdirectory of the following bucket:
Use your username for the subdirectory name.
This bucket is available to all ATMO clusters and Airflow.
When your analysis is production ready, open a PR against python_mozetl.
How to save data
You can save the dataframe
telemetry-test-bucket with the following command:
test_dataframe.write.mode('error') \ .parquet('s3://telemetry-test-bucket/my_subdir/table_name')