fix: enable native_datafusion Spark SQL tests for #3320, #3401, #3719 by andygrove · Pull Request #3718 · apache/datafusion-comet

andygrove · 2026-03-17T14:08:05Z

Which issue does this PR close?

Closes #3320.
Closes #3401.
Closes #3311.
Closes #3719.

Rationale for this change

Several Spark SQL tests were skipped with IgnoreCometNativeDataFusion but now pass, either without code changes or with a targeted error conversion fix.

What changes are included in this PR?

Enabled 5 tests (tags removed)

#3320 — ParquetFilterSuite (2 tests):

SPARK-31026: Parquet predicate pushdown for fields having dots in the names
Filters should be pushed down for Parquet readers at row group level

Both already had conditional assertion guards for native execution. The skip tag was overly conservative.

#3401 — StreamingSelfUnionSuite (2 tests):

self-union, DSv1, read via DataStreamReader API
self-union, DSv1, read via table API

Streaming DSv1 Parquet reads work correctly with native_datafusion.

#3719 — FileBasedDataSourceSuite (1 test):

Spark native readers should respect spark.sql.caseSensitive

Fixed by adding error conversion: DataFusion's "Unable to get field named" schema error is now detected in throw_exception and converted to a SparkRuntimeException with error class _LEGACY_ERROR_TEMP_2093 via the JSON error path. The Spark test is updated to assert the error class without checking exact message parameters, since DataFusion reports the field names differently than Spark.

Error conversion changes

native/spark-expr/src/error.rs: Added SparkError::DuplicateFieldCaseInsensitive variant with required_field_name and matched_fields params, error class _LEGACY_ERROR_TEMP_2093
native/core/src/errors.rs: Added try_convert_duplicate_field_error() to detect DataFusion's schema error pattern and convert to the new SparkError variant
ShimSparkErrorConverter.scala (3.4, 3.5, 4.0): Added handler calling QueryExecutionErrors.foundDuplicateFieldInCaseInsensitiveModeError()

Re-linked 8 tests from #3311 to specific issues

native_datafusion: CometNativeException not converted to SparkRuntimeException for case-sensitivity error #3719 (1 test): CometNativeException not converted to SparkRuntimeException — now fixed
native_datafusion: no error thrown for schema mismatch when reading Parquet with incompatible types #3720 (7 tests): No error thrown for schema mismatch with incompatible types — still open

Documentation

Updated docs/source/contributor-guide/parquet_scans.md to reflect that duplicate field errors in physical Parquet files now produce the correct Spark exception type.

How are these changes tested?

All 5 enabled tests were run locally against Spark 3.5.8 with ENABLE_COMET=true ENABLE_COMET_ONHEAP=true -Dspark.comet.scan.impl=native_datafusion and pass. The diff was regenerated following the documented process and verified to apply cleanly.

…#3401, apache#3719 - Remove IgnoreCometNativeDataFusion tags from 5 tests that now pass: - ParquetFilterSuite: SPARK-31026 and row group level filter pushdown - StreamingSelfUnionSuite: DSv1 self-union tests - FileBasedDataSourceSuite: caseSensitive test - Add SparkError::DuplicateFieldCaseInsensitive to convert DataFusion's "Unable to get field named" schema error to SparkRuntimeException with error class _LEGACY_ERROR_TEMP_2093, matching Spark's behavior - Re-link remaining apache#3311 tests to specific issues apache#3719, apache#3720

Filter DataFusion's valid fields list to only case-insensitive matches instead of passing all schema fields. This fixes the SPARK-25207 test in ParquetFilterSuite which expects matched fields [B, b] not [A, B, b]. Also update the Spark test diff to accept both uppercase and lowercase field names since DataFusion doesn't have access to the original table schema field name.

DataFusion may deduplicate Parquet columns case-insensitively, reporting only one variant in "Valid fields" (e.g. ["A", "B"] when file has A, B, b). Detect this by checking if the requested field has a case-insensitive match but differs in case, then reconstruct the duplicate list for Spark's _LEGACY_ERROR_TEMP_2093 error. Also regenerate 3.5.8.diff with correct 0.15.0-SNAPSHOT version.

andygrove · 2026-03-17T19:44:18Z

native/core/src/errors.rs

+/// DataFusion may deduplicate them and report: Unable to get field named "b". Valid
+/// fields: ["A", "B"]. When the requested field has a case-insensitive match among the
+/// valid fields, we convert this to Spark's _LEGACY_ERROR_TEMP_2093 error.
+fn try_convert_duplicate_field_error(error_msg: &str) -> Option<SparkError> {


I'm not sure if this is overkill or not. We could just update the Spark SQL test to allow for different exceptions with Comet

…3.5.8.diff Regenerated diff from Spark v3.5.8 checkout to restore tags that were accidentally removed when the diff was edited directly.

andygrove marked this pull request as draft March 17, 2026 14:08

andygrove closed this Mar 17, 2026

andygrove reopened this Mar 17, 2026

andygrove force-pushed the fix/enable-native-datafusion-spark-sql-tests-3320-3401 branch from 85126d1 to 1204e0e Compare March 17, 2026 14:41

andygrove changed the title ~~fix: enable native_datafusion Spark SQL tests for #3320 and #3401~~ fix: enable native_datafusion Spark SQL tests for #3320, #3401; re-link #3311 tests to #3719, #3720 Mar 17, 2026

andygrove force-pushed the fix/enable-native-datafusion-spark-sql-tests-3320-3401 branch 3 times, most recently from 1204e0e to 82acd9c Compare March 17, 2026 15:36

andygrove changed the title ~~fix: enable native_datafusion Spark SQL tests for #3320, #3401; re-link #3311 tests to #3719, #3720~~ fix: enable native_datafusion Spark SQL tests for #3320, #3401, #3719 Mar 17, 2026

andygrove force-pushed the fix/enable-native-datafusion-spark-sql-tests-3320-3401 branch from 82acd9c to 5ee8e78 Compare March 17, 2026 16:25

andygrove added 2 commits March 17, 2026 12:08

andygrove commented Mar 17, 2026

View reviewed changes

fix: restore IgnoreCometNativeDataFusion tags for 4 parquet tests in …

27c38aa

…3.5.8.diff Regenerated diff from Spark v3.5.8 checkout to restore tags that were accidentally removed when the diff was edited directly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: enable native_datafusion Spark SQL tests for #3320, #3401, #3719#3718

fix: enable native_datafusion Spark SQL tests for #3320, #3401, #3719#3718
andygrove wants to merge 4 commits intoapache:mainfrom
andygrove:fix/enable-native-datafusion-spark-sql-tests-3320-3401

andygrove commented Mar 17, 2026 •

edited

Loading

Uh oh!

andygrove Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Enabled 5 tests (tags removed)

Error conversion changes

Re-linked 8 tests from #3311 to specific issues

Documentation

How are these changes tested?

Uh oh!

andygrove Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andygrove commented Mar 17, 2026 •

edited

Loading