Skip to content

Derive names of generated nested markers from column names#1702

Open
koperagen wants to merge 6 commits intomasterfrom
nested-markers-naming
Open

Derive names of generated nested markers from column names#1702
koperagen wants to merge 6 commits intomasterfrom
nested-markers-naming

Conversation

@koperagen
Copy link
Collaborator

@koperagen koperagen commented Feb 19, 2026

First iteration of generated dataschemas usability improvements. One way to make the output more ready-to-use is better names. I think it makes a lot of sense to use properly cased property name for nested schemas. Besides that, i plan to use this change in ".cast + exception workflow" we discussed earlier

Before:

@DataSchema interface _DataFrameType {
    val members: List<_DataFrameType1>
    val repos: List<_DataFrameType2>
}
@DataSchema interface _DataFrameType1 { val login: String; val url: String }
@DataSchema interface _DataFrameType2 {
    val license: _DataFrameType4
    val contributors: List<_DataFrameType3>
}
@DataSchema interface _DataFrameType3 { val contributions: String; val login: String }
@DataSchema interface _DataFrameType4 { val key: String?; val name: String? }

After:

@DataSchema interface _DataFrameType {
    val members: List<Members>
    val repos: List<Repos>
}
@DataSchema interface Members { val login: String; val url: String }
@DataSchema interface Repos {
    val license: License
    val contributors: List<Contributors>
}
@DataSchema interface Contributors { val contributions: String; val login: String }
@DataSchema interface License { val key: String?; val name: String? }

If we want, we can add a parameter to generate prefix+numerical suffix type of names, as we used to. Same goes for notebooks: we could do it there. But i checked how it looks like and IMO even in notebooks new naming is clearer

@koperagen koperagen added this to the 1.0.0-Beta5 milestone Feb 19, 2026
@koperagen koperagen self-assigned this Feb 19, 2026
@koperagen koperagen added the enhancement New feature or request label Feb 19, 2026
@koperagen koperagen changed the title Nested markers naming Derive names of generated nested markers from column names Feb 19, 2026
@Jolanrensen
Copy link
Collaborator

If we want, we can add a parameter to generate prefix+numerical suffix type of names, as we used to. Same goes for notebooks: we could do it there. But i checked how it looks like and IMO even in notebooks new naming is clearer

After running some tests, I think there's already some numerical suffix added in case of clashes :) I was worried about that at first, but it seems to work great in its current state!

For instance, a DF like:

nested1:
    nameAndCity:
        name: String
        city: String?

nested2:
    nameAndCity:
        name: String
        city2: String?

produces something like:

@DataSchema
interface _DataFrameType {
    val nested1: Nested1
    val nested2: Nested2
}

@DataSchema(isOpen = false)
interface Nested1 {
    val nameAndCity: NameAndCity
}
@DataSchema(isOpen = false)
interface NameAndCity {
    val city: String?
    val name: String
}

@DataSchema(isOpen = false)
interface Nested2 {
    val nameAndCity: NameAndCity1
}
@DataSchema(isOpen = false)
interface NameAndCity1 {
    val city2: String?
    val name: String
}

neatly avoiding a NameAndCity name clash :)

(I'm not sure that's exactly what you meant, but I'm at least glad it works correctly :) )

@Interpretable("PathOf")
public fun pathOf(vararg columnNames: String): ColumnPath = ColumnPath(columnNames.asList())
public fun pathOf(vararg columnNames: String): ColumnPath =
if (columnNames.isEmpty()) ColumnPath.EMPTY else ColumnPath(columnNames.asList())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we could somehow put this logic in the constructor of ColumnPath but that likely needs a factory function of some sorts, not ideal :/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, turned out to be too bothersome to include in this PR :( Maybe as another issue

@zaleslaw
Copy link
Collaborator

zaleslaw commented Feb 25, 2026

First of all, thank you for the excellent idea. I’m confident that many users will appreciate it and that it will save them time.

I will share my opinion. When I see property-based names, I get the impression that these classes were taken from somewhere in my codebase or generated by AI. When I see a name like DataFrame1, it feels more deterministic, as if it follows an algorithmic approach. That gives me a stronger sense of trust and control.

It would be good to keep the previous option available via a parameter or configuration.

On the other hand, how should we handle situations where generated names clash with existing classes? In the case of DataFrame1, we would only collide with our own classes, since there is no random sequence involved. But with names like Members or Names, we could easily clash with existing domain entities in the application.

Finally, in the JVM ecosystem, class names are usually singular. So it would be Contributor, not Contributors.

@DataSchema interface DataFrameType {
    val members: List<Member>
    val repos: List<Repo>
}

@DataSchema interface Member {
    val login: String
    val url: String
}

@DataSchema interface Repo {
    val license: License
    val contributors: List<Contributor>
}

@DataSchema interface Contributor {
    val contributions: String
    val login: String
}

@DataSchema interface License {
    val key: String?
    val name: String?
}

Copy link
Collaborator

@zaleslaw zaleslaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Make the naming strategy configurable so the previous deterministic approach (e.g., DataFrame1) remains available and backward compatible.
  2. Introduce a clear collision-avoidance mechanism for property-based names to prevent clashes with existing domain classes.
  3. Align generated class names with JVM conventions by using singular nouns (e.g., Contributor instead of Contributors).

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Feb 25, 2026

@zaleslaw See https://github.com/Kotlin/dataframe/blob/a0f9654076f9398c347f6b95eb31eef3a1e53ca4/core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/codeGen/MarkerNameProvider.kt for the name logic. I don't see how simply taking the name and capitalizing it could be seen as non-deterministic. Gradle does it all the time, for instance. The trailing s could be removed in theory, but it could also cause unexpected results if your entity is called "virus", "bus", "gas" or idk. Now that's non-deterministic. There you would actually need AI or a dictionary ;)

Also, like I mentioned, I tested name clash behavior. It's baked into our generation already, so you'd get Members, Members1 etc. Or, if the schema exactly matches, the same Members interface will be used for both occurrences.

@koperagen
Copy link
Collaborator Author

Make the naming strategy configurable so the previous deterministic approach (e.g., DataFrame1) remains available and backward compatible.

Added a nestedMarkerNameProvider parameter to all codegen functions. MarkerNameProvider.PredefinedName will generate our usual Marker1, Marker2 style names. By default switched to new MarkerNameProvider.fromColumnName

Introduce a clear collision-avoidance mechanism for property-based names to prevent clashes with existing domain classes.

On top of our numerical suffix system that works to avoid collisions while generating a single schema, i changes code generation to generate nested declarations. So, all nested schemas now live in their own scope. So even if user has multiple schemas with similarly named structural columns, there will be no conflicts.

Align generated class names with JVM conventions by using singular nouns (e.g., Contributor instead of Contributors).

Doesn't seem possible to achieve desired result with algorithmic approach :( But considering declarations are nested, it will be very easy to rename them. Once renamed, i plan to have cast to preserve user defined names for nested columns.

@DataSchema
data class MySchema(val name: UserName) {

  @DataSchema
  class UserName(val firstName: String)
}

Run main multiple times. If schema changes, new code is generated and thrown as an exception. Codegen re-uses names of nested declarations

fun main () { 
  DataFrame.readCsv().cast<MySchema>() // codegen will be triggered if cast fails
}

Updated schema generated preserves UserName

@DataSchema
data class MySchema(val name: UserName) {

  @DataSchema
  class UserName(val firstName: String?)
}

@koperagen koperagen force-pushed the nested-markers-naming branch from f6a2442 to 7180f04 Compare March 2, 2026 16:32
if (existingMarker != null) {
return existingMarker
}
val baseName = when (val provider = nestedMarkerNameProvider) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I suggest a small refactor to decrease the nested logic?

val baseName =
    when (val provider = nestedMarkerNameProvider) {
        is MarkerNameProvider.GeneratedName if columnPath.isNotEmpty() ->
            provider(columnPath)

        else -> namePrefix
    }

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Mar 2, 2026

On top of our numerical suffix system that works to avoid collisions while generating a single schema, i changes code generation to generate nested declarations. So, all nested schemas now live in their own scope. So even if user has multiple schemas with similarly named structural columns, there will be no conflicts.

How does this behave in notebooks? I don't seem to see any nesting behavior when tracking execution there, though the names of the nested types do seem to appear in top-level dataschema interfaces.
I also wonder how that would treat the polymorphic behavior of the schemas there

visibility: MarkerVisibility = MarkerVisibility.IMPLICIT_PUBLIC,
useFqNames: Boolean = false,
nameNormalizer: NameNormalizer = NameNormalizer.default,
nestedMarkerNameProvider: MarkerNameProvider = MarkerNameProvider.fromColumnName,
Copy link
Collaborator

@Jolanrensen Jolanrensen Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, this is a binary-breaking API change!

Copy link
Collaborator

@AndreiKingsley AndreiKingsley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is a very cool feature! I really missed it!

@koperagen
Copy link
Collaborator Author

How does this behave in notebooks? I don't seem to see any nesting behavior when tracking execution there, though the names of the nested types do seem to appear in top-level dataschema interfaces.

:)) Sorry, i forgot to mention this detail. Nesting is disabled when "extensionProperties" are enabled. Keeping this for notebooks just like before in order to not break anything unexpectedly (plus, because changing generation of extension properties to refer to nested types turned out very bothersome)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants