Skip to content

API, Core, Spark: Pass FileIO on Spark's read path#15448

Open
nastra wants to merge 2 commits intoapache:mainfrom
nastra:remote-planning-file-io-alternative
Open

API, Core, Spark: Pass FileIO on Spark's read path#15448
nastra wants to merge 2 commits intoapache:mainfrom
nastra:remote-planning-file-io-alternative

Conversation

@nastra
Copy link
Contributor

@nastra nastra commented Feb 26, 2026

The downside of this approach is that the changes are much bigger than in #15368 and we also need

When accessing/reading data files, the codebase is using the Table's FileIO instance through table.io() on Spark's read path. With remote scan planning the FileIO instance is configured with a PlanID + custom storage credentials inside RESTTableScan, but that instance is never propagated to the place(s) that actually perform the read., thus leading to errors.

This PR passes the FileIO obtained during remote/distributed scan planning next to the Table instance on Spark's read path.

This is an alternative to #15368 and requires SerializableFileIOWithSize, which makes sure that the FileIO instance is only closed on the driver and not on executor nodes (similar to SerializableTableWithSize).

@nastra nastra force-pushed the remote-planning-file-io-alternative branch 2 times, most recently from 000673a to dc8a4f7 Compare February 26, 2026 10:44
@nastra nastra marked this pull request as draft March 3, 2026 09:00
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class SerializableFileIOWithSize
Copy link
Contributor Author

@nastra nastra Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sure that the FileIO instance is only closed on the driver and not on executor nodes and is similar to SerializableTableWithSize

@nastra nastra force-pushed the remote-planning-file-io-alternative branch 2 times, most recently from 45a44aa to 122ccc8 Compare March 6, 2026 17:30
@SuppressWarnings("unchecked")
private <T extends RESTResponse> T maybeAddStorageCredential(T response) {
if (response instanceof PlanTableScanResponse resp
&& PlanStatus.COMPLETED == resp.planStatus()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking into when we can return a FileIO from the scan and I was surprised to see that storage-credentials is returned for any CompletedPlanningResult rather than CompletedPlanningWithIDResult. The one with ID is used for completed responses from the plan endpoint, while the generic result is returned from both plan and fetch endpoints.

Why can we return storage credentials when fetching tasks? Wouldn't it make more sense to return credentials once per planning operation? That simplifies when we know we have credentials.

That wouldn't solve the problem of needing to wait until after planFiles is called to get the FileIO, but it would at least simplify the protocol so we don't need to try to create a new FileIO after each call to fetch more tasks. (@danielcweeks, any thoughts on this?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if we want to change this we would at the very least need to update the Spec that we added in #14563

SparkScan(
SparkSession spark,
Table table,
Supplier<FileIO> fileIO,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a supplier because it will be a broadcast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a supplier because planFiles() will be called later in the read path and so the actual FileIO with the right credentials will only be available later

@nastra nastra force-pushed the remote-planning-file-io-alternative branch from 122ccc8 to 4a69352 Compare March 9, 2026 11:34
@github-actions github-actions bot removed the OPENAPI label Mar 9, 2026
@nastra nastra force-pushed the remote-planning-file-io-alternative branch from 4a69352 to f9e1bde Compare March 9, 2026 13:22
@nastra nastra force-pushed the remote-planning-file-io-alternative branch 2 times, most recently from c13cd0a to b5367d7 Compare March 12, 2026 07:04
@nastra nastra marked this pull request as ready for review March 12, 2026 07:06
@nastra nastra force-pushed the remote-planning-file-io-alternative branch from b5367d7 to 3e3ec04 Compare March 12, 2026 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants