Derive names of generated nested markers from column names by koperagen · Pull Request #1702 · Kotlin/dataframe

koperagen · 2026-02-19T14:37:22Z

First iteration of generated dataschemas usability improvements. One way to make the output more ready-to-use is better names. I think it makes a lot of sense to use properly cased property name for nested schemas. Besides that, i plan to use this change in ".cast + exception workflow" we discussed earlier

Before:

@DataSchema interface _DataFrameType {
    val members: List<_DataFrameType1>
    val repos: List<_DataFrameType2>
}
@DataSchema interface _DataFrameType1 { val login: String; val url: String }
@DataSchema interface _DataFrameType2 {
    val license: _DataFrameType4
    val contributors: List<_DataFrameType3>
}
@DataSchema interface _DataFrameType3 { val contributions: String; val login: String }
@DataSchema interface _DataFrameType4 { val key: String?; val name: String? }

After:

@DataSchema interface _DataFrameType {
    val members: List<Members>
    val repos: List<Repos>
}
@DataSchema interface Members { val login: String; val url: String }
@DataSchema interface Repos {
    val license: License
    val contributors: List<Contributors>
}
@DataSchema interface Contributors { val contributions: String; val login: String }
@DataSchema interface License { val key: String?; val name: String? }

If we want, we can add a parameter to generate prefix+numerical suffix type of names, as we used to. Same goes for notebooks: we could do it there. But i checked how it looks like and IMO even in notebooks new naming is clearer

Jolanrensen · 2026-02-23T12:33:23Z

If we want, we can add a parameter to generate prefix+numerical suffix type of names, as we used to. Same goes for notebooks: we could do it there. But i checked how it looks like and IMO even in notebooks new naming is clearer

After running some tests, I think there's already some numerical suffix added in case of clashes :) I was worried about that at first, but it seems to work great in its current state!

For instance, a DF like:

nested1:
    nameAndCity:
        name: String
        city: String?

nested2:
    nameAndCity:
        name: String
        city2: String?

produces something like:

@DataSchema
interface _DataFrameType {
    val nested1: Nested1
    val nested2: Nested2
}

@DataSchema(isOpen = false)
interface Nested1 {
    val nameAndCity: NameAndCity
}
@DataSchema(isOpen = false)
interface NameAndCity {
    val city: String?
    val name: String
}

@DataSchema(isOpen = false)
interface Nested2 {
    val nameAndCity: NameAndCity1
}
@DataSchema(isOpen = false)
interface NameAndCity1 {
    val city2: String?
    val name: String
}

neatly avoiding a NameAndCity name clash :)

(I'm not sure that's exactly what you meant, but I'm at least glad it works correctly :) )

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/generateCode.kt

Jolanrensen · 2026-02-23T12:39:37Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/constructors.kt

 @Interpretable("PathOf")
-public fun pathOf(vararg columnNames: String): ColumnPath = ColumnPath(columnNames.asList())
+public fun pathOf(vararg columnNames: String): ColumnPath =
+    if (columnNames.isEmpty()) ColumnPath.EMPTY else ColumnPath(columnNames.asList())


I was wondering if we could somehow put this logic in the constructor of ColumnPath but that likely needs a factory function of some sorts, not ideal :/

Yes, turned out to be too bothersome to include in this PR :( Maybe as another issue

zaleslaw · 2026-02-25T10:39:07Z

First of all, thank you for the excellent idea. I’m confident that many users will appreciate it and that it will save them time.

I will share my opinion. When I see property-based names, I get the impression that these classes were taken from somewhere in my codebase or generated by AI. When I see a name like DataFrame1, it feels more deterministic, as if it follows an algorithmic approach. That gives me a stronger sense of trust and control.

It would be good to keep the previous option available via a parameter or configuration.

On the other hand, how should we handle situations where generated names clash with existing classes? In the case of DataFrame1, we would only collide with our own classes, since there is no random sequence involved. But with names like Members or Names, we could easily clash with existing domain entities in the application.

Finally, in the JVM ecosystem, class names are usually singular. So it would be Contributor, not Contributors.

@DataSchema interface DataFrameType {
    val members: List<Member>
    val repos: List<Repo>
}

@DataSchema interface Member {
    val login: String
    val url: String
}

@DataSchema interface Repo {
    val license: License
    val contributors: List<Contributor>
}

@DataSchema interface Contributor {
    val contributions: String
    val login: String
}

@DataSchema interface License {
    val key: String?
    val name: String?
}

zaleslaw

Make the naming strategy configurable so the previous deterministic approach (e.g., DataFrame1) remains available and backward compatible.
Introduce a clear collision-avoidance mechanism for property-based names to prevent clashes with existing domain classes.
Align generated class names with JVM conventions by using singular nouns (e.g., Contributor instead of Contributors).

Jolanrensen · 2026-02-25T12:36:18Z

@zaleslaw See https://github.com/Kotlin/dataframe/blob/a0f9654076f9398c347f6b95eb31eef3a1e53ca4/core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/codeGen/MarkerNameProvider.kt for the name logic. I don't see how simply taking the name and capitalizing it could be seen as non-deterministic. Gradle does it all the time, for instance. The trailing s could be removed in theory, but it could also cause unexpected results if your entity is called "virus", "bus", "gas" or idk. Now that's non-deterministic. There you would actually need AI or a dictionary ;)

Also, like I mentioned, I tested name clash behavior. It's baked into our generation already, so you'd get Members, Members1 etc. Or, if the schema exactly matches, the same Members interface will be used for both occurrences.

… previous deterministic name generation

koperagen · 2026-03-02T16:29:40Z

Make the naming strategy configurable so the previous deterministic approach (e.g., DataFrame1) remains available and backward compatible.

Added a nestedMarkerNameProvider parameter to all codegen functions. MarkerNameProvider.PredefinedName will generate our usual Marker1, Marker2 style names. By default switched to new MarkerNameProvider.fromColumnName

Introduce a clear collision-avoidance mechanism for property-based names to prevent clashes with existing domain classes.

On top of our numerical suffix system that works to avoid collisions while generating a single schema, i changes code generation to generate nested declarations. So, all nested schemas now live in their own scope. So even if user has multiple schemas with similarly named structural columns, there will be no conflicts.

Align generated class names with JVM conventions by using singular nouns (e.g., Contributor instead of Contributors).

Doesn't seem possible to achieve desired result with algorithmic approach :( But considering declarations are nested, it will be very easy to rename them. Once renamed, i plan to have cast to preserve user defined names for nested columns.

@DataSchema
data class MySchema(val name: UserName) {

  @DataSchema
  class UserName(val firstName: String)
}

Run main multiple times. If schema changes, new code is generated and thrown as an exception. Codegen re-uses names of nested declarations

fun main () { 
  DataFrame.readCsv().cast<MySchema>() // codegen will be triggered if cast fails
}

Updated schema generated preserves UserName

@DataSchema
data class MySchema(val name: UserName) {

  @DataSchema
  class UserName(val firstName: String?)
}

…ial name conflicts for "domain" names

Jolanrensen · 2026-03-02T19:16:04Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/codeGen/SchemaProcessorImpl.kt

        if (existingMarker != null) {
            return existingMarker
        }
        val baseName = when (val provider = nestedMarkerNameProvider) {


May I suggest a small refactor to decrease the nested logic?

val baseName = when (val provider = nestedMarkerNameProvider) { is MarkerNameProvider.GeneratedName if columnPath.isNotEmpty() -> provider(columnPath) else -> namePrefix }

Jolanrensen · 2026-03-02T19:23:12Z

On top of our numerical suffix system that works to avoid collisions while generating a single schema, i changes code generation to generate nested declarations. So, all nested schemas now live in their own scope. So even if user has multiple schemas with similarly named structural columns, there will be no conflicts.

How does this behave in notebooks? I don't seem to see any nesting behavior when tracking execution there, though the names of the nested types do seem to appear in top-level dataschema interfaces.
I also wonder how that would treat the polymorphic behavior of the schemas there

Jolanrensen · 2026-03-02T19:24:41Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/generateCode.kt

    visibility: MarkerVisibility = MarkerVisibility.IMPLICIT_PUBLIC,
    useFqNames: Boolean = false,
    nameNormalizer: NameNormalizer = NameNormalizer.default,
+    nestedMarkerNameProvider: MarkerNameProvider = MarkerNameProvider.fromColumnName,


Note, this is a binary-breaking API change!

AndreiKingsley

Wow, this is a very cool feature! I really missed it!

koperagen · 2026-03-03T11:31:23Z

How does this behave in notebooks? I don't seem to see any nesting behavior when tracking execution there, though the names of the nested types do seem to appear in top-level dataschema interfaces.

:)) Sorry, i forgot to mention this detail. Nesting is disabled when "extensionProperties" are enabled. Keeping this for notebooks just like before in order to not break anything unexpectedly (plus, because changing generation of extension properties to refer to nested types turned out very bothersome)

koperagen added 2 commits February 19, 2026 15:24

Add ColumnPath.EMPTY singleton to avoid some of the internal allocations

779b5b6

Derive names of generated nested markers from column names

a0f9654

koperagen added this to the 1.0.0-Beta5 milestone Feb 19, 2026

koperagen requested review from Jolanrensen and zaleslaw February 19, 2026 14:37

koperagen self-assigned this Feb 19, 2026

koperagen added the enhancement New feature or request label Feb 19, 2026

koperagen changed the title ~~Nested markers naming~~ Derive names of generated nested markers from column names Feb 19, 2026

Jolanrensen approved these changes Feb 23, 2026

View reviewed changes

zaleslaw requested changes Feb 25, 2026

View reviewed changes

Add nested marker name provider parameter to codegen APIs to preserve…

722d577

… previous deterministic name generation

koperagen requested review from AndreiKingsley and Jolanrensen March 2, 2026 16:30

koperagen added 3 commits March 2, 2026 18:32

Improve toString in codegen impl for debugging

8567a99

Migrate shouldBe -> assertEquals in CodeGenerationTests.kt

37f5600

Make data schema declarations nested to their root to minimize potent…

7180f04

…ial name conflicts for "domain" names

koperagen force-pushed the nested-markers-naming branch from f6a2442 to 7180f04 Compare March 2, 2026 16:32

Jolanrensen reviewed Mar 2, 2026

View reviewed changes

AndreiKingsley approved these changes Mar 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Derive names of generated nested markers from column names#1702

Derive names of generated nested markers from column names#1702
koperagen wants to merge 6 commits intomasterfrom
nested-markers-naming

koperagen commented Feb 19, 2026 •

edited by Jolanrensen

Loading

Uh oh!

Jolanrensen commented Feb 23, 2026

Uh oh!

Uh oh!

Jolanrensen Feb 23, 2026

Uh oh!

koperagen Mar 2, 2026

Uh oh!

zaleslaw commented Feb 25, 2026 •

edited by Jolanrensen

Loading

Uh oh!

zaleslaw left a comment

Uh oh!

Jolanrensen commented Feb 25, 2026 •

edited

Loading

Uh oh!

koperagen commented Mar 2, 2026

Uh oh!

Jolanrensen Mar 2, 2026

Uh oh!

Jolanrensen commented Mar 2, 2026 •

edited

Loading

Uh oh!

Jolanrensen Mar 2, 2026 •

edited

Loading

Uh oh!

AndreiKingsley left a comment

Uh oh!

koperagen commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

koperagen commented Feb 19, 2026 • edited by Jolanrensen Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jolanrensen commented Feb 23, 2026

Uh oh!

Uh oh!

Jolanrensen Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

koperagen Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

zaleslaw commented Feb 25, 2026 • edited by Jolanrensen Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zaleslaw left a comment

Choose a reason for hiding this comment

Uh oh!

Jolanrensen commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

koperagen commented Mar 2, 2026

Uh oh!

Jolanrensen Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Jolanrensen commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jolanrensen Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndreiKingsley left a comment

Choose a reason for hiding this comment

Uh oh!

koperagen commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

koperagen commented Feb 19, 2026 •

edited by Jolanrensen

Loading

zaleslaw commented Feb 25, 2026 •

edited by Jolanrensen

Loading

Jolanrensen commented Feb 25, 2026 •

edited

Loading

Jolanrensen commented Mar 2, 2026 •

edited

Loading

Jolanrensen Mar 2, 2026 •

edited

Loading