Add TCK for File Format API by joyhaldar · Pull Request #15441 · apache/iceberg

joyhaldar · 2026-02-25T11:02:51Z

Adds base test class and tests for FormatModel implementations, with a DataGenerator pattern for testing different schema types.

Changes

BaseFormatModelTests<T> - Abstract base test class parameterized by engine type
DataGenerator - Interface for generating test data with schema
DataGenerators - Collection of data generators
InternalRowConverter - Converts Iceberg Record to Spark InternalRow
TestSparkFormatModel - Spark implementation of format model tests
TestFlinkFormatModel - Flink implementation of format model tests

Part of #15415

data/src/test/java/org/apache/iceberg/data/TestBaseFormatModel.java

pvary · 2026-02-25T11:19:52Z

data/src/test/java/org/apache/iceberg/data/TestBaseFormatModel.java

+    EqualityDeleteWriter<W> writer =
+        writerBuilder
+            .schema(TestBase.SCHEMA)
+            .engineSchema(writeEngineSchema(TestBase.SCHEMA))


Why is this added?
I remember similar issues when I was working on the Spark model, but I also remember fixing it.
Do we need this at this point?

Tests fail for AVRO without engineSchema with the error java.lang.IllegalArgumentException: Invalid struct: null is not a struct.

When I checked the code:

For AVRO, engineSchema is passed directly to SparkAvroWriter with no null fallback (SparkFormatModels.java line 43)

For Parquet: SparkParquetWriters.buildWriter has a fallback, if engineSchema is null, it converts from icebergSchema (SparkParquetWriters.java line 89)

This is according to my understanding, please correct me if I am incorrect. Should I keep engineSchema in the tests, or should AVRO have a similar fallback?

We should create a similar fallback for Avro in an independent PR.
This is why these tests are good!

data/src/test/java/org/apache/iceberg/data/TestBaseFormatModel.java

pvary · 2026-02-25T11:25:15Z

data/src/test/java/org/apache/iceberg/data/TestBaseFormatModel.java

+    InputFile inputFile = encryptedFile.encryptingOutputFile().toInputFile();
+    List<R> readRecords;
+    try (CloseableIterable<R> reader =
+        FormatModelRegistry.readBuilder(fileFormat, readType(), inputFile)


We don't need engine specific reader for the positional deletes. We can just read with the generic reader.

data/src/test/java/org/apache/iceberg/data/TestBaseFormatModel.java

rambleraptor

Loving the direction this is going!

rambleraptor · 2026-02-25T19:36:52Z

data/src/test/java/org/apache/iceberg/data/TestBaseFormatModel.java

+
+  @ParameterizedTest
+  @FieldSource("FILE_FORMATS")
+  public void testDataWriterRoundTrip(FileFormat fileFormat) throws IOException {


What would you think about creating a roundTrip method (or possibly several depending on the types)? Most of these roundTrip methods are trying to do the same things.

My gut feeling is that we'd use the roundTrip methods on a lot of different tests.

pvary · 2026-02-26T12:55:26Z

data/src/test/java/org/apache/iceberg/data/TestBaseFormatModel.java

-public class TestGenericFormatModels {
-  private static final List<Record> TEST_RECORDS =
-      RandomGenericData.generate(TestBase.SCHEMA, 10, 1L);
+public abstract class TestBaseFormatModel<T> {


Make sure that the visibility modifiers as strict as possible for classes, methods, attributes

Guosmilesmile · 2026-03-02T06:22:41Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+              format ->
+                  Arrays.stream(DataGenerators.ALL)
+                      .map(generator -> Arguments.of(format, generator)))
+          .collect(Collectors.toList());


Suggested change

.collect(Collectors.toList());

.toList();

Guosmilesmile · 2026-03-02T06:24:26Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+            fileIO.newOutputFile("test-file"), EncryptionKeyMetadata.EMPTY);
+  }
+
+  protected List<T> convertToEngineRecords(List<Record> records, Schema schema) {


Private? Will we overwrite this mehtod as we have expose convertToEngine ?

Guosmilesmile · 2026-03-02T06:36:51Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+
+  @ParameterizedTest
+  @FieldSource("FORMAT_AND_GENERATOR")
+  public void testDataWriterEngineWriteGenericRead(


Since we set DataGenerator as package private, this method should we set package private or set DataGenerator public ?

Guosmilesmile · 2026-03-02T06:55:46Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+            .spec(PartitionSpec.unpartitioned())
+            .build();
+
+    Schema schema = dataGenerator.schema();


Do we need this? Or just use this instead of dataGenerator.schema().

Guosmilesmile · 2026-03-02T06:59:45Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+    DataWriter<Record> writer =
+        writerBuilder.schema(dataGenerator.schema()).spec(PartitionSpec.unpartitioned()).build();
+
+    Schema schema = dataGenerator.schema();


The same above.

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

pvary · 2026-03-02T09:54:34Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+      readRecords = ImmutableList.copyOf(reader);
+    }
+
+    assertEqualsGenericToEngine(dataGenerator.schema().asStruct(), genericRecords, readRecords);


Same as above

pvary · 2026-03-02T09:55:42Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+    }
+
+    DataTestHelpers.assertEquals(
+        positionDeleteSchema.asStruct(), genericPositionDeletes(positionDeleteSchema), readRecords);


Why is genericPositionDeletes instead of positionDeletes?

pvary · 2026-03-02T09:59:07Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+  protected abstract void assertEqualsEngineToGeneric(
+      Types.StructType struct, List<T> expected, List<Record> actual);
+
+  protected abstract void assertEqualsGenericToEngine(
+      Types.StructType struct, List<Record> expected, List<T> actual);


Do we need these?

I think after the Record -> Engine conversion we will just compare the Record to Record for writes, and the Engine to Engine for the reads.

spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/data/InternalRowConverter.java

…ations with Generic, Spark, and Flink tests

…Generic tests

Co-authored-by: pvary <peter.vary.apache@gmail.com>

pvary · 2026-03-12T12:11:27Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+  @FieldSource("FORMAT_AND_GENERATOR")
+  void testDataWriterEngineWriteGenericRead(FileFormat fileFormat, DataGenerator dataGenerator)
+      throws IOException {
+    // Write with engine type T, read with Generic Record


This is basically a method comment.
Could we move to the method javadoc?

pvary · 2026-03-12T12:11:50Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+  @FieldSource("FORMAT_AND_GENERATOR")
+  void testDataWriterGenericWriteEngineRead(FileFormat fileFormat, DataGenerator dataGenerator)
+      throws IOException {
+    // Write with Generic Record, read with engine type T


This is basically a method comment.
Could we move to the method javadoc?

pvary · 2026-03-12T12:12:05Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+  @FieldSource("FORMAT_AND_GENERATOR")
+  void testEqualityDeleteWriterEngineWriteGenericRead(
+      FileFormat fileFormat, DataGenerator dataGenerator) throws IOException {
+    // Write with engine type T, read with Generic Record


This is basically a method comment.
Could we move to the method javadoc?

pvary · 2026-03-12T12:12:18Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+  @FieldSource("FORMAT_AND_GENERATOR")
+  void testEqualityDeleteWriterGenericWriteEngineRead(
+      FileFormat fileFormat, DataGenerator dataGenerator) throws IOException {
+    // Write with Generic Record, read with engine type T


This is basically a method comment.
Could we move to the method javadoc?

pvary · 2026-03-12T12:12:33Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+  @ParameterizedTest
+  @FieldSource("FILE_FORMATS")
+  void testPositionDeleteWriterEngineWriteGenericRead(FileFormat fileFormat) throws IOException {
+    // Write position deletes, read with Generic Record


This is basically a method comment.
Could we move to the method javadoc?

pvary · 2026-03-12T12:13:00Z

spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/data/InternalRowConverter.java

+import org.apache.spark.unsafe.types.UTF8String;
+
+/** Converts Iceberg Record to Spark InternalRow for testing. */
+public class InternalRowConverter extends CustomOrderSchemaVisitor<Object> {


This was my mistake. Could you please revert back to your original converter solution?

…rnalRowConverter

pvary · 2026-03-12T15:59:19Z

Can we move this PR to "Ready to review"?

github-actions bot added spark data flink labels Feb 25, 2026

joyhaldar force-pushed the file-format-api-tck branch from 9abf7c7 to 1e3e8a7 Compare February 25, 2026 11:08

pvary reviewed Feb 25, 2026

View reviewed changes

data/src/test/java/org/apache/iceberg/data/TestBaseFormatModel.java Outdated Show resolved Hide resolved

pvary reviewed Feb 25, 2026

View reviewed changes

data/src/test/java/org/apache/iceberg/data/TestBaseFormatModel.java Outdated Show resolved Hide resolved

pvary reviewed Feb 25, 2026

View reviewed changes

data/src/test/java/org/apache/iceberg/data/TestBaseFormatModel.java Outdated Show resolved Hide resolved

joyhaldar force-pushed the file-format-api-tck branch from 1e3e8a7 to 42ab761 Compare February 25, 2026 12:01

rambleraptor reviewed Feb 25, 2026

View reviewed changes

pvary reviewed Feb 26, 2026

View reviewed changes

Guosmilesmile reviewed Mar 2, 2026

View reviewed changes

pvary reviewed Mar 2, 2026

View reviewed changes

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java Outdated Show resolved Hide resolved

pvary reviewed Mar 2, 2026

View reviewed changes

spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/data/InternalRowConverter.java Show resolved Hide resolved

pvary reviewed Mar 2, 2026

View reviewed changes

spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/data/InternalRowConverter.java Show resolved Hide resolved

joyhaldar and others added 7 commits March 11, 2026 18:59

git push --forceAdd BaseFormatModelTest TCK for FormatModel implement…

b3e3ac0

…ations with Generic, Spark, and Flink tests

Refactor TestBaseFormatModel to single type parameter for Engine <-> …

7533252

…Generic tests

Refactor format model tests to use DataGenerator pattern

9054c02

Address review comments: simplify assertions to compare same types

e1581ea

Co-authored-by: pvary <peter.vary.apache@gmail.com>

Declare the variable when it is set

bbd06b7

Refactor InternalRowConverter to use CustomOrderSchemaVisitor pattern

1d4707f

Fix checkstyle

0894342

joyhaldar force-pushed the file-format-api-tck branch from 2798462 to 0894342 Compare March 11, 2026 13:31

pvary reviewed Mar 12, 2026

View reviewed changes

Address review comments, move inline comments to Javadoc, revert Inte…

2079fef

…rnalRowConverter

joyhaldar changed the title ~~Add BaseFormatModelTest for FormatModel implementations~~ Add TCK for File Format API Mar 13, 2026

joyhaldar marked this pull request as ready for review March 13, 2026 06:26

Update the switch

88b92d4

Conversation

joyhaldar commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rambleraptor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pvary commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

joyhaldar commented Feb 25, 2026 •

edited

Loading