[FAQ] Why does Spark write multiple parquet files after repartitioning a dataset?

### Course

data-engineering-zoomcamp

### Question

Why does writing a Spark DataFrame after repartitioning create multiple parquet files instead of a single file?

### Answer

Spark processes data in partitions. When a DataFrame is written to disk, each partition is written as a separate output file.

For example:

```
trips.repartition(4).write.parquet("output/")
```
This creates four parquet files because the DataFrame now has four partitions.

This behavior allows Spark to write data in parallel and improves performance when working with large datasets.

### Checklist

- [x] I have searched existing FAQs and this question is not already answered
- [x] The answer provides accurate, helpful information
- [ ] I have included any relevant code examples or links

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FAQ] Why does Spark write multiple parquet files after repartitioning a dataset? #237

Course

Question

Answer

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FAQ] Why does Spark write multiple parquet files after repartitioning a dataset? #237

Description

Course

Question

Answer

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions