Skip to content

Add TCK for File Format API#15441

Open
joyhaldar wants to merge 9 commits intoapache:mainfrom
joyhaldar:file-format-api-tck
Open

Add TCK for File Format API#15441
joyhaldar wants to merge 9 commits intoapache:mainfrom
joyhaldar:file-format-api-tck

Conversation

@joyhaldar
Copy link
Contributor

@joyhaldar joyhaldar commented Feb 25, 2026

Adds base test class and tests for FormatModel implementations, with a DataGenerator pattern for testing different schema types.

Changes

  • BaseFormatModelTests<T> - Abstract base test class parameterized by engine type
  • DataGenerator - Interface for generating test data with schema
  • DataGenerators - Collection of data generators
  • InternalRowConverter - Converts Iceberg Record to Spark InternalRow
  • TestSparkFormatModel - Spark implementation of format model tests
  • TestFlinkFormatModel - Flink implementation of format model tests

Part of #15415

EqualityDeleteWriter<W> writer =
writerBuilder
.schema(TestBase.SCHEMA)
.engineSchema(writeEngineSchema(TestBase.SCHEMA))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this added?
I remember similar issues when I was working on the Spark model, but I also remember fixing it.
Do we need this at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests fail for AVRO without engineSchema with the error java.lang.IllegalArgumentException: Invalid struct: null is not a struct.

When I checked the code:

This is according to my understanding, please correct me if I am incorrect. Should I keep engineSchema in the tests, or should AVRO have a similar fallback?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should create a similar fallback for Avro in an independent PR.
This is why these tests are good!

InputFile inputFile = encryptedFile.encryptingOutputFile().toInputFile();
List<R> readRecords;
try (CloseableIterable<R> reader =
FormatModelRegistry.readBuilder(fileFormat, readType(), inputFile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need engine specific reader for the positional deletes. We can just read with the generic reader.

Copy link
Contributor

@rambleraptor rambleraptor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loving the direction this is going!


@ParameterizedTest
@FieldSource("FILE_FORMATS")
public void testDataWriterRoundTrip(FileFormat fileFormat) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think about creating a roundTrip method (or possibly several depending on the types)? Most of these roundTrip methods are trying to do the same things.

My gut feeling is that we'd use the roundTrip methods on a lot of different tests.

public class TestGenericFormatModels {
private static final List<Record> TEST_RECORDS =
RandomGenericData.generate(TestBase.SCHEMA, 10, 1L);
public abstract class TestBaseFormatModel<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure that the visibility modifiers as strict as possible for classes, methods, attributes

format ->
Arrays.stream(DataGenerators.ALL)
.map(generator -> Arguments.of(format, generator)))
.collect(Collectors.toList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.collect(Collectors.toList());
.toList();

fileIO.newOutputFile("test-file"), EncryptionKeyMetadata.EMPTY);
}

protected List<T> convertToEngineRecords(List<Record> records, Schema schema) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Private? Will we overwrite this mehtod as we have expose convertToEngine ?


@ParameterizedTest
@FieldSource("FORMAT_AND_GENERATOR")
public void testDataWriterEngineWriteGenericRead(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we set DataGenerator as package private, this method should we set package private or set DataGenerator public ?

.spec(PartitionSpec.unpartitioned())
.build();

Schema schema = dataGenerator.schema();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Or just use this instead of dataGenerator.schema().

DataWriter<Record> writer =
writerBuilder.schema(dataGenerator.schema()).spec(PartitionSpec.unpartitioned()).build();

Schema schema = dataGenerator.schema();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same above.

readRecords = ImmutableList.copyOf(reader);
}

assertEqualsGenericToEngine(dataGenerator.schema().asStruct(), genericRecords, readRecords);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

}

DataTestHelpers.assertEquals(
positionDeleteSchema.asStruct(), genericPositionDeletes(positionDeleteSchema), readRecords);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is genericPositionDeletes instead of positionDeletes?

Comment on lines +64 to +68
protected abstract void assertEqualsEngineToGeneric(
Types.StructType struct, List<T> expected, List<Record> actual);

protected abstract void assertEqualsGenericToEngine(
Types.StructType struct, List<Record> expected, List<T> actual);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these?

I think after the Record -> Engine conversion we will just compare the Record to Record for writes, and the Engine to Engine for the reads.

@joyhaldar joyhaldar force-pushed the file-format-api-tck branch from 2798462 to 0894342 Compare March 11, 2026 13:31
@FieldSource("FORMAT_AND_GENERATOR")
void testDataWriterEngineWriteGenericRead(FileFormat fileFormat, DataGenerator dataGenerator)
throws IOException {
// Write with engine type T, read with Generic Record
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically a method comment.
Could we move to the method javadoc?

@FieldSource("FORMAT_AND_GENERATOR")
void testDataWriterGenericWriteEngineRead(FileFormat fileFormat, DataGenerator dataGenerator)
throws IOException {
// Write with Generic Record, read with engine type T
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically a method comment.
Could we move to the method javadoc?

@FieldSource("FORMAT_AND_GENERATOR")
void testEqualityDeleteWriterEngineWriteGenericRead(
FileFormat fileFormat, DataGenerator dataGenerator) throws IOException {
// Write with engine type T, read with Generic Record
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically a method comment.
Could we move to the method javadoc?

@FieldSource("FORMAT_AND_GENERATOR")
void testEqualityDeleteWriterGenericWriteEngineRead(
FileFormat fileFormat, DataGenerator dataGenerator) throws IOException {
// Write with Generic Record, read with engine type T
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically a method comment.
Could we move to the method javadoc?

@ParameterizedTest
@FieldSource("FILE_FORMATS")
void testPositionDeleteWriterEngineWriteGenericRead(FileFormat fileFormat) throws IOException {
// Write position deletes, read with Generic Record
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically a method comment.
Could we move to the method javadoc?

import org.apache.spark.unsafe.types.UTF8String;

/** Converts Iceberg Record to Spark InternalRow for testing. */
public class InternalRowConverter extends CustomOrderSchemaVisitor<Object> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my mistake. Could you please revert back to your original converter solution?

@pvary
Copy link
Contributor

pvary commented Mar 12, 2026

Can we move this PR to "Ready to review"?

@joyhaldar joyhaldar changed the title Add BaseFormatModelTest for FormatModel implementations Add TCK for File Format API Mar 13, 2026
@joyhaldar joyhaldar marked this pull request as ready for review March 13, 2026 06:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants