Derive names of generated nested markers from column names#1702
Derive names of generated nested markers from column names#1702
Conversation
After running some tests, I think there's already some numerical suffix added in case of clashes :) I was worried about that at first, but it seems to work great in its current state! For instance, a DF like: produces something like: @DataSchema
interface _DataFrameType {
val nested1: Nested1
val nested2: Nested2
}
@DataSchema(isOpen = false)
interface Nested1 {
val nameAndCity: NameAndCity
}
@DataSchema(isOpen = false)
interface NameAndCity {
val city: String?
val name: String
}
@DataSchema(isOpen = false)
interface Nested2 {
val nameAndCity: NameAndCity1
}
@DataSchema(isOpen = false)
interface NameAndCity1 {
val city2: String?
val name: String
}neatly avoiding a (I'm not sure that's exactly what you meant, but I'm at least glad it works correctly :) ) |
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/generateCode.kt
Outdated
Show resolved
Hide resolved
| @Interpretable("PathOf") | ||
| public fun pathOf(vararg columnNames: String): ColumnPath = ColumnPath(columnNames.asList()) | ||
| public fun pathOf(vararg columnNames: String): ColumnPath = | ||
| if (columnNames.isEmpty()) ColumnPath.EMPTY else ColumnPath(columnNames.asList()) |
There was a problem hiding this comment.
I was wondering if we could somehow put this logic in the constructor of ColumnPath but that likely needs a factory function of some sorts, not ideal :/
There was a problem hiding this comment.
Yes, turned out to be too bothersome to include in this PR :( Maybe as another issue
|
First of all, thank you for the excellent idea. I’m confident that many users will appreciate it and that it will save them time. I will share my opinion. When I see property-based names, I get the impression that these classes were taken from somewhere in my codebase or generated by AI. When I see a name like DataFrame1, it feels more deterministic, as if it follows an algorithmic approach. That gives me a stronger sense of trust and control. It would be good to keep the previous option available via a parameter or configuration. On the other hand, how should we handle situations where generated names clash with existing classes? In the case of Finally, in the JVM ecosystem, class names are usually singular. So it would be @DataSchema interface DataFrameType {
val members: List<Member>
val repos: List<Repo>
}
@DataSchema interface Member {
val login: String
val url: String
}
@DataSchema interface Repo {
val license: License
val contributors: List<Contributor>
}
@DataSchema interface Contributor {
val contributions: String
val login: String
}
@DataSchema interface License {
val key: String?
val name: String?
}
|
zaleslaw
left a comment
There was a problem hiding this comment.
- Make the naming strategy configurable so the previous deterministic approach (e.g., DataFrame1) remains available and backward compatible.
- Introduce a clear collision-avoidance mechanism for property-based names to prevent clashes with existing domain classes.
- Align generated class names with JVM conventions by using singular nouns (e.g., Contributor instead of Contributors).
|
@zaleslaw See https://github.com/Kotlin/dataframe/blob/a0f9654076f9398c347f6b95eb31eef3a1e53ca4/core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/codeGen/MarkerNameProvider.kt for the name logic. I don't see how simply taking the name and capitalizing it could be seen as non-deterministic. Gradle does it all the time, for instance. The trailing Also, like I mentioned, I tested name clash behavior. It's baked into our generation already, so you'd get |
… previous deterministic name generation
Added a nestedMarkerNameProvider parameter to all codegen functions.
On top of our numerical suffix system that works to avoid collisions while generating a single schema, i changes code generation to generate nested declarations. So, all nested schemas now live in their own scope. So even if user has multiple schemas with similarly named structural columns, there will be no conflicts.
Doesn't seem possible to achieve desired result with algorithmic approach :( But considering declarations are nested, it will be very easy to rename them. Once renamed, i plan to have Run main multiple times. If schema changes, new code is generated and thrown as an exception. Codegen re-uses names of nested declarations Updated schema generated preserves UserName |
f6a2442 to
7180f04
Compare
| if (existingMarker != null) { | ||
| return existingMarker | ||
| } | ||
| val baseName = when (val provider = nestedMarkerNameProvider) { |
There was a problem hiding this comment.
May I suggest a small refactor to decrease the nested logic?
val baseName =
when (val provider = nestedMarkerNameProvider) {
is MarkerNameProvider.GeneratedName if columnPath.isNotEmpty() ->
provider(columnPath)
else -> namePrefix
}
How does this behave in notebooks? I don't seem to see any nesting behavior when tracking execution there, though the names of the nested types do seem to appear in top-level dataschema interfaces. |
| visibility: MarkerVisibility = MarkerVisibility.IMPLICIT_PUBLIC, | ||
| useFqNames: Boolean = false, | ||
| nameNormalizer: NameNormalizer = NameNormalizer.default, | ||
| nestedMarkerNameProvider: MarkerNameProvider = MarkerNameProvider.fromColumnName, |
There was a problem hiding this comment.
Note, this is a binary-breaking API change!
AndreiKingsley
left a comment
There was a problem hiding this comment.
Wow, this is a very cool feature! I really missed it!
:)) Sorry, i forgot to mention this detail. Nesting is disabled when "extensionProperties" are enabled. Keeping this for notebooks just like before in order to not break anything unexpectedly (plus, because changing generation of extension properties to refer to nested types turned out very bothersome) |
First iteration of generated dataschemas usability improvements. One way to make the output more ready-to-use is better names. I think it makes a lot of sense to use properly cased property name for nested schemas. Besides that, i plan to use this change in ".cast + exception workflow" we discussed earlier
Before:
After:
If we want, we can add a parameter to generate prefix+numerical suffix type of names, as we used to. Same goes for notebooks: we could do it there. But i checked how it looks like and IMO even in notebooks new naming is clearer