Course
data-engineering-zoomcamp
Question
Why does writing a Spark DataFrame after repartitioning create multiple parquet files instead of a single file?
Answer
Spark processes data in partitions. When a DataFrame is written to disk, each partition is written as a separate output file.
For example:
trips.repartition(4).write.parquet("output/")
This creates four parquet files because the DataFrame now has four partitions.
This behavior allows Spark to write data in parallel and improves performance when working with large datasets.
Checklist