Conversation
Hmm, so this makes our code look something like: df.readCsv("something")
.require { "name"["firstName"]<String>() }
.require { "name"["lastName"]<String>() }
.require { "age"<Int>() }
.require { "address"<String>() }I don't quite understand why you actually need the reified type parameter at all here. Isn't @Refine
@Interpretable("Require0")
public fun <T> DataFrame<T>.require(columns: ColumnsSelector<T, *>): DataFrame<T> {
// attempts to resolve all columns, throw Exception if any column mentioned is not there
getColumnsWithPaths(UnresolvedColumnsPolicy.Fail, columns)
// Managed to successfully resolve all columns, compiler plugin can safely assume they are present
return this
}When executing val newDf = df.readCsv("something")
.require { "name"["firstName"]<String>() and "name"["lastName"]<String>() and "age"<Int>() }the compiler plugin would make sure that you can call: newDf.name.firstName
newDf.age
...Slightly different to |
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/require.kt
Outdated
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/DataFrameReceiver.kt
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/DataFrameReceiver.kt
Show resolved
Hide resolved
Yes and no? We still need to throw an exception if column is not Int, for example. I'd say we cannot do it for an arbitrary selector like |
select is very similar, now i get what you mean. If we select column but provided type is incorrect, select doesn't immediately throws an exception now. This is a limitation, a problem even. But at least we should be more strict for |
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/DataFrameReceiver.kt
Outdated
Show resolved
Hide resolved
Ahh, you're right, it indeed doesn't check that. It does check column kind though. For our Plus, doing it completely safely may involve modifying all implementation of But okay, let's look at it from another angle. What if we don't check types at runtime for now, like we already do for The String API is already unsafe, and all So why runtime checks would be very nice, both for |
Exception is only triggered when value is pulled from the column:
I agree |
|
|
||
| **Related operations**: [](cast.md), [](convertTo) | ||
|
|
||
| ```kotlin |
There was a problem hiding this comment.
- Please add a Korro sample in "samples" module
2)Might be a good idea to show this thatpeopleDforiginally doesn't have EPs:
// Won't compile
peopleDf.select { name.firstName }
// Declare column with a runtime check
val df = peopleDf.require { "name"["firstName"]<String>() }
// Use extension properties after `require`
df.select { name.firstName }
There was a problem hiding this comment.
I cannot do it yet because require is not supported in compiler plugin :( But i'll do after we update to 2.4.0-RC or something
There was a problem hiding this comment.
but i'll update the code snippet
There was a problem hiding this comment.
Yeah, I really didn't think about that, sorry 😄 !
But please create at least a commented function and korro marks so we don't forget and an issue.
|
I think we should take the following approach to “require” — it's a great thing if you need to quickly create EPs for 1-3 columns. |
I agree, yes
Why not? :) |
Me too, however, I would still put
hmm, I mean it's one of the reasons we're looking at #1168 and why we have this comprehensive
On the other hand, if this is one of the goals, limiting |
|
As for using it in production. That may be nice :) It's like defining the schema functionally, which can be very versitile: df.require {
"user" {
"name" {
"firstName"<String>() and "lastName"<String>()
} and "age"<Int>()
} and
"address" {
"street"<String>() and "city"<String>()
}
}however, maybe a different DSL would make more sense as more than half of the CS DSL makes no sense for Something more similar to df.require {
"user" {
"name" {
"firstName"<String>()
"lastName"<String>()
}
"age"<Int>()
}
"address" {
"street"<String>()
"city"<String>()
}
}This would also allow us to build any checks we like into each column definition. So... Maybe we could follow the df
.require(column<Int>("user")) // implemented now
.require { // implemented later
// new DSL for multiple columns, notation TBD
}
|
|
So, why not add 2 functions? For defining schema |
|
@AndreiKingsley Yes, I think two would work too, but it's hard for users to tell the difference between the two. Let's say they just want to df.requireColumn { "singleCol"<String>() }
df.defineColumns { "singleCol"<String>() }and there's no way for them to tell the difference or see which benefits them more. |
|
If we go with 2 functions route, one option is |
|
I'd name it like |
|
Will it be ok to keep the scope of this issue to |
"define" is a bit of an odd name, though. Because it's you, the user, doing the defining, not DataFrame. We usually have names that are "commands" (imperative) to the library, like "select", "filter", "add", etc. or "end goals" (declarative), like "distinct()", "first()", and or most of the CS DSL: "colsOf", "all", "named"
Yes, I think so :) but we should decide the name first, as there's no easy way to make |
|
@Jolanrensen Maybe requireSchema? |
|
@koperagen But then I'd recommend making it So:
|
Fine by me
Sorry, i don't agree with this logic. We also have get(ColumnSelector), and at the same time we have getColumns(ColumnsSelector) and operations like after, before, asGroupBy with ColumnSelector. So type itself is not the only deciding factor |
That only leaves So I still think the logic holds up. |
cf7fdc5 to
62332ee
Compare
62332ee to
48508dd
Compare
|
Added a commit with require -> requireColumn rename |
48508dd to
0304ba4
Compare
| val resolvedColumn = getColumnWithPath(column) | ||
| val actualType = resolvedColumn.data.type | ||
| require(resolvedColumn.data.isSubtypeOf(type)) { | ||
| "Column '${resolvedColumn.path.joinToString()}' has type '$actualType', which is not subtype of required '$type' type." |
There was a problem hiding this comment.
*a subtype of the required '$type' type.
|
@koperagen please don't merge before my approval |
Seems not feasible to have ColumnsSelector because we cannot make interface members inline functions and have reified type parameters. So, no go for our existing String Column Accessors like String.invoke, col, and so on. context parameters are still considered experimental, so it'll have to wait.
In the meantime, multiple require calls in chain :)
With this new operation we'll further improve workflow where all compile time schema information is derived from operations: require (new), operations with selectors using String Column Accessors, and things like add, toDataFrame, map, etc. This can be a good alternative to usual
DataSchemaworkflow, with good potential for incremental introduction of type safety and lower entry barrier. Well, you know the idea.