Skip to content

Spark: Add separate data and metadata file lists to RewriteTablePath#15382

Open
mxm wants to merge 5 commits intoapache:mainfrom
mxm:rewrite-table-path-file-lists
Open

Spark: Add separate data and metadata file lists to RewriteTablePath#15382
mxm wants to merge 5 commits intoapache:mainfrom
mxm:rewrite-table-path-file-lists

Conversation

@mxm
Copy link
Contributor

@mxm mxm commented Feb 20, 2026

Downstream tools often need different copy strategies for data vs metadata files (e.g. different location, additional checks on data vs metadata files). This splits the file list into separate data and metadata lists while keeping the combined list for backward compatibility.

@mxm
Copy link
Contributor Author

mxm commented Feb 20, 2026

Comment on lines +334 to +335
String fileListLocation = saveFileList(copyPlan, RESULT_LOCATION);
String dataFileListLocation = saveFileList(dataCopyPlan, DATA_FILE_LIST_DIR);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we afraid of the cost of duplicated file list write?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We potentially write a lot of files (metadata, manifest lists, manifests), so writing two instead of one txt file looks ok to me.

Comment on lines -510 to +532
@Override
public RewriteContentFileResult append(RewriteResult<ContentFile<?>> r1) {
this.copyPlan().addAll(r1.copyPlan());
this.toRewrite().addAll(r1.toRewrite());

public RewriteContentFileResult append(RewriteContentFileResult other) {
this.copyPlan().addAll(other.copyPlan());
this.toRewrite().addAll(other.toRewrite());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual return type from where this is being called it RewriteContentFileResult, so I decided to narrow the type.

RewriteResult relies on generics which are subject to type erasure, see

public RewriteResult<T> append(RewriteResult<T> r1) {
It could be worthwhile to get rid of the types, since the different types aren't treated any different.

@mxm mxm force-pushed the rewrite-table-path-file-lists branch from 6f88d67 to 9577ecd Compare March 12, 2026 11:44
@mxm
Copy link
Contributor Author

mxm commented Mar 12, 2026

Rebased to resolve conflicts with #15381.

@mxm mxm force-pushed the rewrite-table-path-file-lists branch from 9577ecd to 356bebf Compare March 12, 2026 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants