Skip to content

[FAQ] Why does Spark write multiple parquet files after repartitioning a dataset? #237

@AsherJD-io

Description

@AsherJD-io

Course

data-engineering-zoomcamp

Question

Why does writing a Spark DataFrame after repartitioning create multiple parquet files instead of a single file?

Answer

Spark processes data in partitions. When a DataFrame is written to disk, each partition is written as a separate output file.

For example:

trips.repartition(4).write.parquet("output/")

This creates four parquet files because the DataFrame now has four partitions.

This behavior allows Spark to write data in parallel and improves performance when working with large datasets.

Checklist

  • I have searched existing FAQs and this question is not already answered
  • The answer provides accurate, helpful information
  • I have included any relevant code examples or links

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions