fix: enable native_datafusion Spark SQL tests for #3320, #3401, #3719#3718
Draft
andygrove wants to merge 4 commits intoapache:mainfrom
Draft
fix: enable native_datafusion Spark SQL tests for #3320, #3401, #3719#3718andygrove wants to merge 4 commits intoapache:mainfrom
andygrove wants to merge 4 commits intoapache:mainfrom
Conversation
85126d1 to
1204e0e
Compare
1204e0e to
82acd9c
Compare
…#3401, apache#3719 - Remove IgnoreCometNativeDataFusion tags from 5 tests that now pass: - ParquetFilterSuite: SPARK-31026 and row group level filter pushdown - StreamingSelfUnionSuite: DSv1 self-union tests - FileBasedDataSourceSuite: caseSensitive test - Add SparkError::DuplicateFieldCaseInsensitive to convert DataFusion's "Unable to get field named" schema error to SparkRuntimeException with error class _LEGACY_ERROR_TEMP_2093, matching Spark's behavior - Re-link remaining apache#3311 tests to specific issues apache#3719, apache#3720
82acd9c to
5ee8e78
Compare
Filter DataFusion's valid fields list to only case-insensitive matches instead of passing all schema fields. This fixes the SPARK-25207 test in ParquetFilterSuite which expects matched fields [B, b] not [A, B, b]. Also update the Spark test diff to accept both uppercase and lowercase field names since DataFusion doesn't have access to the original table schema field name.
DataFusion may deduplicate Parquet columns case-insensitively, reporting only one variant in "Valid fields" (e.g. ["A", "B"] when file has A, B, b). Detect this by checking if the requested field has a case-insensitive match but differs in case, then reconstruct the duplicate list for Spark's _LEGACY_ERROR_TEMP_2093 error. Also regenerate 3.5.8.diff with correct 0.15.0-SNAPSHOT version.
andygrove
commented
Mar 17, 2026
| /// DataFusion may deduplicate them and report: Unable to get field named "b". Valid | ||
| /// fields: ["A", "B"]. When the requested field has a case-insensitive match among the | ||
| /// valid fields, we convert this to Spark's _LEGACY_ERROR_TEMP_2093 error. | ||
| fn try_convert_duplicate_field_error(error_msg: &str) -> Option<SparkError> { |
Member
Author
There was a problem hiding this comment.
I'm not sure if this is overkill or not. We could just update the Spark SQL test to allow for different exceptions with Comet
…3.5.8.diff Regenerated diff from Spark v3.5.8 checkout to restore tags that were accidentally removed when the diff was edited directly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #3320.
Closes #3401.
Closes #3311.
Closes #3719.
Rationale for this change
Several Spark SQL tests were skipped with
IgnoreCometNativeDataFusionbut now pass, either without code changes or with a targeted error conversion fix.What changes are included in this PR?
Enabled 5 tests (tags removed)
#3320 — ParquetFilterSuite (2 tests):
SPARK-31026: Parquet predicate pushdown for fields having dots in the namesFilters should be pushed down for Parquet readers at row group levelBoth already had conditional assertion guards for native execution. The skip tag was overly conservative.
#3401 — StreamingSelfUnionSuite (2 tests):
self-union, DSv1, read via DataStreamReader APIself-union, DSv1, read via table APIStreaming DSv1 Parquet reads work correctly with
native_datafusion.#3719 — FileBasedDataSourceSuite (1 test):
Spark native readers should respect spark.sql.caseSensitiveFixed by adding error conversion: DataFusion's "Unable to get field named" schema error is now detected in
throw_exceptionand converted to aSparkRuntimeExceptionwith error class_LEGACY_ERROR_TEMP_2093via the JSON error path. The Spark test is updated to assert the error class without checking exact message parameters, since DataFusion reports the field names differently than Spark.Error conversion changes
native/spark-expr/src/error.rs: AddedSparkError::DuplicateFieldCaseInsensitivevariant withrequired_field_nameandmatched_fieldsparams, error class_LEGACY_ERROR_TEMP_2093native/core/src/errors.rs: Addedtry_convert_duplicate_field_error()to detect DataFusion's schema error pattern and convert to the newSparkErrorvariantShimSparkErrorConverter.scala(3.4, 3.5, 4.0): Added handler callingQueryExecutionErrors.foundDuplicateFieldInCaseInsensitiveModeError()Re-linked 8 tests from #3311 to specific issues
Documentation
Updated
docs/source/contributor-guide/parquet_scans.mdto reflect that duplicate field errors in physical Parquet files now produce the correct Spark exception type.How are these changes tested?
All 5 enabled tests were run locally against Spark 3.5.8 with
ENABLE_COMET=true ENABLE_COMET_ONHEAP=true -Dspark.comet.scan.impl=native_datafusionand pass. The diff was regenerated following the documented process and verified to apply cleanly.