Skip to content

Spark 4.1: Add targetTable support to RewriteTablePath#15412

Open
mxm wants to merge 1 commit intoapache:mainfrom
mxm:rewrite-table-path-target-table
Open

Spark 4.1: Add targetTable support to RewriteTablePath#15412
mxm wants to merge 1 commit intoapache:mainfrom
mxm:rewrite-table-path-target-table

Conversation

@mxm
Copy link
Contributor

@mxm mxm commented Feb 23, 2026

This enables incremental copy using the target table to automatically determine where to resume. We validate that source and target are in sync at the determined version by comparing snapshot IDs and checking that all manifest files exist.

Example:

Table sourceTable = ...
Table targetTable = ...
actions()
    .rewriteTablePath(sourceTable)
    .rewriteLocationPrefix(sourceLocation, targetLocation)
    // Resolve startVersion from the targetTable
    .targetTable(targetTable)
    .execute();

The targetTable() parameter takes precedence over startVersion(..).

@mxm
Copy link
Contributor Author

mxm commented Feb 23, 2026

@manuzhang manuzhang changed the title Spark: Add targetTable support to RewriteTablePath Spark 4.1: Add targetTable support to RewriteTablePath Feb 23, 2026
@manuzhang
Copy link
Member

It looks a bit strange to have the concept of targetTable for RewriteTablePath, which merely prepares a table for copying to another location. What's the end-to-end process?

@mxm
Copy link
Contributor Author

mxm commented Feb 23, 2026

Thanks for asking! This is useful for continuous incremental replication of a source table to a destination table. Rather than doing a one-off full rewrite/copy of a table to a new destination, you want to repeat the process against an existing copy. In order to avoid having to rewrite/copy everything again, you want to rewrite/copy just the files between the last copied version and the current version of the source table. This is already possible today, but it requires setting the startVersion parameter to the last copied version, which is error-prone.

Comment on lines +234 to +235
Preconditions.checkArgument(
startVersionName == null, "Cannot set both startVersion and targetTable.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better placed in the validateAndSetStartVersion method

Copy link
Contributor Author

@mxm mxm Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we should move this check.


private String findVersion(String version, TableMetadata sourceMetadata) {
String currentSourceMetadataFile = currentMetadataPath(table);
if (currentSourceMetadataFile.endsWith(version)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems brittle to me. I don't think we should rely on filenames to determine if the snapshot files are the same or not. I don't think we have guarantees there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe relying on snapshot ids would be better?

Copy link
Contributor Author

@mxm mxm Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The original code did only file name comparison to assess that the version/metadata files are identical. Now, this method is only used to select the candidate version. Afterwards, we check via isSameSnapshot(sourceVersionFullPath, targetVersionFullPath) that the snapshots ids match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but if the filenames are changed then we will not find the snapshot and do a full copy. So probably we should just skip this check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how likely a rename is, given the implications this has. If we switch to snapshot ids, this means reading the metadata file for all versions until the matching one. This is slower than the filename-based search followed by the snapshot id verification. I've pushed this change with 1163057.

* @return this for method chaining
*/
default RewriteTablePath targetTable(Table targetTable) {
return this;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this throw?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated this to match the behavior in the other PRs #15381 and #15382.

@manuzhang
Copy link
Member

@mxm Thanks for your detailed explanation. I feel targetTable is a parameter more suitable for a larger process like you described than RewriteTablePath alone.

@mxm
Copy link
Contributor Author

mxm commented Mar 2, 2026

@mxm Thanks for your detailed explanation. I feel targetTable is a parameter more suitable for a larger process like you described than RewriteTablePath alone.

That makes sense. Would you prefer for the parameter to be renamed? Some suggestions:

  1. resumeFrom(Table t) => Indicates the incremental copy process.
  2. startVersion(Table t) => Overload for startVersion(String versionName), which is an existing parameter to specify the version for incremental copy.

@manuzhang
Copy link
Member

Is resumeFrom(Table t) or startVersion(Table t) possible with Spark SQL call procedure?

@mxm
Copy link
Contributor Author

mxm commented Mar 4, 2026

Yes, it is possible. We already pass the source table via parsing a TableIdentifier into the procedure. We can do that same for the target table.

This enables incremental copy using the target table to automatically
determine where to resume. We validate that source and target are in sync
at the determined version by comparing snapshot IDs and checking that
all manifest files exist.

Example:

```java
Table sourceTable = ...
Table targetTable = ...
actions()
    .rewriteTablePath(sourceTable)
    .rewriteLocationPrefix(sourceLocation, targetLocation)
    // Resolve startVersion from the targetTable
    .targetTable(targetTable)
    .execute();
```

The `targetTable()` parameter takes precedence over `startVersion(..)`.
@mxm mxm force-pushed the rewrite-table-path-target-table branch from 1163057 to 38b7124 Compare March 12, 2026 11:54
@mxm
Copy link
Contributor Author

mxm commented Mar 12, 2026

Rebased to resolve conflicts with #15381.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants