Conversation
9abf7c7 to
1e3e8a7
Compare
data/src/test/java/org/apache/iceberg/data/TestBaseFormatModel.java
Outdated
Show resolved
Hide resolved
| EqualityDeleteWriter<W> writer = | ||
| writerBuilder | ||
| .schema(TestBase.SCHEMA) | ||
| .engineSchema(writeEngineSchema(TestBase.SCHEMA)) |
There was a problem hiding this comment.
Why is this added?
I remember similar issues when I was working on the Spark model, but I also remember fixing it.
Do we need this at this point?
There was a problem hiding this comment.
Tests fail for AVRO without engineSchema with the error java.lang.IllegalArgumentException: Invalid struct: null is not a struct.
When I checked the code:
- For
AVRO,engineSchemais passed directly toSparkAvroWriterwith no null fallback (SparkFormatModels.java line 43) - For
Parquet: SparkParquetWriters.buildWriter has a fallback, if engineSchema is null, it converts from icebergSchema (SparkParquetWriters.java line 89)
This is according to my understanding, please correct me if I am incorrect. Should I keep engineSchema in the tests, or should AVRO have a similar fallback?
There was a problem hiding this comment.
We should create a similar fallback for Avro in an independent PR.
This is why these tests are good!
data/src/test/java/org/apache/iceberg/data/TestBaseFormatModel.java
Outdated
Show resolved
Hide resolved
| InputFile inputFile = encryptedFile.encryptingOutputFile().toInputFile(); | ||
| List<R> readRecords; | ||
| try (CloseableIterable<R> reader = | ||
| FormatModelRegistry.readBuilder(fileFormat, readType(), inputFile) |
There was a problem hiding this comment.
We don't need engine specific reader for the positional deletes. We can just read with the generic reader.
data/src/test/java/org/apache/iceberg/data/TestBaseFormatModel.java
Outdated
Show resolved
Hide resolved
1e3e8a7 to
42ab761
Compare
rambleraptor
left a comment
There was a problem hiding this comment.
Loving the direction this is going!
|
|
||
| @ParameterizedTest | ||
| @FieldSource("FILE_FORMATS") | ||
| public void testDataWriterRoundTrip(FileFormat fileFormat) throws IOException { |
There was a problem hiding this comment.
What would you think about creating a roundTrip method (or possibly several depending on the types)? Most of these roundTrip methods are trying to do the same things.
My gut feeling is that we'd use the roundTrip methods on a lot of different tests.
| public class TestGenericFormatModels { | ||
| private static final List<Record> TEST_RECORDS = | ||
| RandomGenericData.generate(TestBase.SCHEMA, 10, 1L); | ||
| public abstract class TestBaseFormatModel<T> { |
There was a problem hiding this comment.
Make sure that the visibility modifiers as strict as possible for classes, methods, attributes
| format -> | ||
| Arrays.stream(DataGenerators.ALL) | ||
| .map(generator -> Arguments.of(format, generator))) | ||
| .collect(Collectors.toList()); |
There was a problem hiding this comment.
| .collect(Collectors.toList()); | |
| .toList(); |
| fileIO.newOutputFile("test-file"), EncryptionKeyMetadata.EMPTY); | ||
| } | ||
|
|
||
| protected List<T> convertToEngineRecords(List<Record> records, Schema schema) { |
There was a problem hiding this comment.
Private? Will we overwrite this mehtod as we have expose convertToEngine ?
|
|
||
| @ParameterizedTest | ||
| @FieldSource("FORMAT_AND_GENERATOR") | ||
| public void testDataWriterEngineWriteGenericRead( |
There was a problem hiding this comment.
Since we set DataGenerator as package private, this method should we set package private or set DataGenerator public ?
| .spec(PartitionSpec.unpartitioned()) | ||
| .build(); | ||
|
|
||
| Schema schema = dataGenerator.schema(); |
There was a problem hiding this comment.
Do we need this? Or just use this instead of dataGenerator.schema().
| DataWriter<Record> writer = | ||
| writerBuilder.schema(dataGenerator.schema()).spec(PartitionSpec.unpartitioned()).build(); | ||
|
|
||
| Schema schema = dataGenerator.schema(); |
data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java
Outdated
Show resolved
Hide resolved
| readRecords = ImmutableList.copyOf(reader); | ||
| } | ||
|
|
||
| assertEqualsGenericToEngine(dataGenerator.schema().asStruct(), genericRecords, readRecords); |
| } | ||
|
|
||
| DataTestHelpers.assertEquals( | ||
| positionDeleteSchema.asStruct(), genericPositionDeletes(positionDeleteSchema), readRecords); |
There was a problem hiding this comment.
Why is genericPositionDeletes instead of positionDeletes?
| protected abstract void assertEqualsEngineToGeneric( | ||
| Types.StructType struct, List<T> expected, List<Record> actual); | ||
|
|
||
| protected abstract void assertEqualsGenericToEngine( | ||
| Types.StructType struct, List<Record> expected, List<T> actual); |
There was a problem hiding this comment.
Do we need these?
I think after the Record -> Engine conversion we will just compare the Record to Record for writes, and the Engine to Engine for the reads.
spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/data/InternalRowConverter.java
Show resolved
Hide resolved
spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/data/InternalRowConverter.java
Show resolved
Hide resolved
…ations with Generic, Spark, and Flink tests
Co-authored-by: pvary <peter.vary.apache@gmail.com>
2798462 to
0894342
Compare
| @FieldSource("FORMAT_AND_GENERATOR") | ||
| void testDataWriterEngineWriteGenericRead(FileFormat fileFormat, DataGenerator dataGenerator) | ||
| throws IOException { | ||
| // Write with engine type T, read with Generic Record |
There was a problem hiding this comment.
This is basically a method comment.
Could we move to the method javadoc?
| @FieldSource("FORMAT_AND_GENERATOR") | ||
| void testDataWriterGenericWriteEngineRead(FileFormat fileFormat, DataGenerator dataGenerator) | ||
| throws IOException { | ||
| // Write with Generic Record, read with engine type T |
There was a problem hiding this comment.
This is basically a method comment.
Could we move to the method javadoc?
| @FieldSource("FORMAT_AND_GENERATOR") | ||
| void testEqualityDeleteWriterEngineWriteGenericRead( | ||
| FileFormat fileFormat, DataGenerator dataGenerator) throws IOException { | ||
| // Write with engine type T, read with Generic Record |
There was a problem hiding this comment.
This is basically a method comment.
Could we move to the method javadoc?
| @FieldSource("FORMAT_AND_GENERATOR") | ||
| void testEqualityDeleteWriterGenericWriteEngineRead( | ||
| FileFormat fileFormat, DataGenerator dataGenerator) throws IOException { | ||
| // Write with Generic Record, read with engine type T |
There was a problem hiding this comment.
This is basically a method comment.
Could we move to the method javadoc?
| @ParameterizedTest | ||
| @FieldSource("FILE_FORMATS") | ||
| void testPositionDeleteWriterEngineWriteGenericRead(FileFormat fileFormat) throws IOException { | ||
| // Write position deletes, read with Generic Record |
There was a problem hiding this comment.
This is basically a method comment.
Could we move to the method javadoc?
| import org.apache.spark.unsafe.types.UTF8String; | ||
|
|
||
| /** Converts Iceberg Record to Spark InternalRow for testing. */ | ||
| public class InternalRowConverter extends CustomOrderSchemaVisitor<Object> { |
There was a problem hiding this comment.
This was my mistake. Could you please revert back to your original converter solution?
|
Can we move this PR to "Ready to review"? |
Adds base test class and tests for
FormatModelimplementations, with aDataGeneratorpattern for testing different schema types.Changes
BaseFormatModelTests<T>- Abstract base test class parameterized by engine typeDataGenerator- Interface for generating test data with schemaDataGenerators- Collection of data generatorsInternalRowConverter- Converts Iceberg Record to Spark InternalRowTestSparkFormatModel- Spark implementation of format model testsTestFlinkFormatModel- Flink implementation of format model testsPart of #15415