Skip to content

fix: enable native_datafusion Spark SQL tests for #3320, #3401, #3719#3718

Draft
andygrove wants to merge 4 commits intoapache:mainfrom
andygrove:fix/enable-native-datafusion-spark-sql-tests-3320-3401
Draft

fix: enable native_datafusion Spark SQL tests for #3320, #3401, #3719#3718
andygrove wants to merge 4 commits intoapache:mainfrom
andygrove:fix/enable-native-datafusion-spark-sql-tests-3320-3401

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Mar 17, 2026

Which issue does this PR close?

Closes #3320.
Closes #3401.
Closes #3311.
Closes #3719.

Rationale for this change

Several Spark SQL tests were skipped with IgnoreCometNativeDataFusion but now pass, either without code changes or with a targeted error conversion fix.

What changes are included in this PR?

Enabled 5 tests (tags removed)

#3320 — ParquetFilterSuite (2 tests):

  • SPARK-31026: Parquet predicate pushdown for fields having dots in the names
  • Filters should be pushed down for Parquet readers at row group level

Both already had conditional assertion guards for native execution. The skip tag was overly conservative.

#3401 — StreamingSelfUnionSuite (2 tests):

  • self-union, DSv1, read via DataStreamReader API
  • self-union, DSv1, read via table API

Streaming DSv1 Parquet reads work correctly with native_datafusion.

#3719 — FileBasedDataSourceSuite (1 test):

  • Spark native readers should respect spark.sql.caseSensitive

Fixed by adding error conversion: DataFusion's "Unable to get field named" schema error is now detected in throw_exception and converted to a SparkRuntimeException with error class _LEGACY_ERROR_TEMP_2093 via the JSON error path. The Spark test is updated to assert the error class without checking exact message parameters, since DataFusion reports the field names differently than Spark.

Error conversion changes

  • native/spark-expr/src/error.rs: Added SparkError::DuplicateFieldCaseInsensitive variant with required_field_name and matched_fields params, error class _LEGACY_ERROR_TEMP_2093
  • native/core/src/errors.rs: Added try_convert_duplicate_field_error() to detect DataFusion's schema error pattern and convert to the new SparkError variant
  • ShimSparkErrorConverter.scala (3.4, 3.5, 4.0): Added handler calling QueryExecutionErrors.foundDuplicateFieldInCaseInsensitiveModeError()

Re-linked 8 tests from #3311 to specific issues

Documentation

Updated docs/source/contributor-guide/parquet_scans.md to reflect that duplicate field errors in physical Parquet files now produce the correct Spark exception type.

How are these changes tested?

All 5 enabled tests were run locally against Spark 3.5.8 with ENABLE_COMET=true ENABLE_COMET_ONHEAP=true -Dspark.comet.scan.impl=native_datafusion and pass. The diff was regenerated following the documented process and verified to apply cleanly.

@andygrove andygrove marked this pull request as draft March 17, 2026 14:08
@andygrove andygrove closed this Mar 17, 2026
@andygrove andygrove reopened this Mar 17, 2026
@andygrove andygrove force-pushed the fix/enable-native-datafusion-spark-sql-tests-3320-3401 branch from 85126d1 to 1204e0e Compare March 17, 2026 14:41
@andygrove andygrove changed the title fix: enable native_datafusion Spark SQL tests for #3320 and #3401 fix: enable native_datafusion Spark SQL tests for #3320, #3401; re-link #3311 tests to #3719, #3720 Mar 17, 2026
@andygrove andygrove force-pushed the fix/enable-native-datafusion-spark-sql-tests-3320-3401 branch 3 times, most recently from 1204e0e to 82acd9c Compare March 17, 2026 15:36
@andygrove andygrove changed the title fix: enable native_datafusion Spark SQL tests for #3320, #3401; re-link #3311 tests to #3719, #3720 fix: enable native_datafusion Spark SQL tests for #3320, #3401, #3719 Mar 17, 2026
…#3401, apache#3719

- Remove IgnoreCometNativeDataFusion tags from 5 tests that now pass:
  - ParquetFilterSuite: SPARK-31026 and row group level filter pushdown
  - StreamingSelfUnionSuite: DSv1 self-union tests
  - FileBasedDataSourceSuite: caseSensitive test
- Add SparkError::DuplicateFieldCaseInsensitive to convert DataFusion's
  "Unable to get field named" schema error to SparkRuntimeException
  with error class _LEGACY_ERROR_TEMP_2093, matching Spark's behavior
- Re-link remaining apache#3311 tests to specific issues apache#3719, apache#3720
@andygrove andygrove force-pushed the fix/enable-native-datafusion-spark-sql-tests-3320-3401 branch from 82acd9c to 5ee8e78 Compare March 17, 2026 16:25
Filter DataFusion's valid fields list to only case-insensitive matches
instead of passing all schema fields. This fixes the SPARK-25207 test
in ParquetFilterSuite which expects matched fields [B, b] not [A, B, b].

Also update the Spark test diff to accept both uppercase and lowercase
field names since DataFusion doesn't have access to the original table
schema field name.
DataFusion may deduplicate Parquet columns case-insensitively, reporting
only one variant in "Valid fields" (e.g. ["A", "B"] when file has A, B,
b). Detect this by checking if the requested field has a case-insensitive
match but differs in case, then reconstruct the duplicate list for
Spark's _LEGACY_ERROR_TEMP_2093 error.

Also regenerate 3.5.8.diff with correct 0.15.0-SNAPSHOT version.
/// DataFusion may deduplicate them and report: Unable to get field named "b". Valid
/// fields: ["A", "B"]. When the requested field has a case-insensitive match among the
/// valid fields, we convert this to Spark's _LEGACY_ERROR_TEMP_2093 error.
fn try_convert_duplicate_field_error(error_msg: &str) -> Option<SparkError> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is overkill or not. We could just update the Spark SQL test to allow for different exceptions with Comet

…3.5.8.diff

Regenerated diff from Spark v3.5.8 checkout to restore tags that were
accidentally removed when the diff was edited directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment