Skip to content

Add require operation#1715

Open
koperagen wants to merge 5 commits intomasterfrom
air/implement-dataframe.require-column-validator-9ce538e8-d
Open

Add require operation#1715
koperagen wants to merge 5 commits intomasterfrom
air/implement-dataframe.require-column-validator-9ce538e8-d

Conversation

@koperagen
Copy link
Collaborator

Seems not feasible to have ColumnsSelector because we cannot make interface members inline functions and have reified type parameters. So, no go for our existing String Column Accessors like String.invoke, col, and so on. context parameters are still considered experimental, so it'll have to wait.
In the meantime, multiple require calls in chain :)
With this new operation we'll further improve workflow where all compile time schema information is derived from operations: require (new), operations with selectors using String Column Accessors, and things like add, toDataFrame, map, etc. This can be a good alternative to usual DataSchema workflow, with good potential for incremental introduction of type safety and lower entry barrier. Well, you know the idea.

@koperagen koperagen added this to the 1.0.0-Beta5 milestone Feb 27, 2026
@koperagen koperagen self-assigned this Feb 27, 2026
@koperagen koperagen added the enhancement New feature or request label Feb 27, 2026
@Jolanrensen
Copy link
Collaborator

Seems not feasible to have ColumnsSelector because we cannot make interface members inline functions and have reified type parameters. So, no go for our existing String Column Accessors like String.invoke, col, and so on. context parameters are still considered experimental, so it'll have to wait.
In the meantime, multiple require calls in chain :)

Hmm, so this makes our code look something like:

df.readCsv("something")
    .require { "name"["firstName"]<String>() }
    .require { "name"["lastName"]<String>() }
    .require { "age"<Int>() }
    .require { "address"<String>() }

I don't quite understand why you actually need the reified type parameter at all here. Isn't require {} just select {} where it checks the columns you mention are there and the original (but refined) DF is returned? The rest is then done on the compiler plugin side. So something like:

@Refine
@Interpretable("Require0")
public fun <T> DataFrame<T>.require(columns: ColumnsSelector<T, *>): DataFrame<T> {
    // attempts to resolve all columns, throw Exception if any column mentioned is not there
    getColumnsWithPaths(UnresolvedColumnsPolicy.Fail, columns)
    // Managed to successfully resolve all columns, compiler plugin can safely assume they are present
    return this
}

When executing

val newDf = df.readCsv("something")
    .require { "name"["firstName"]<String>() and "name"["lastName"]<String>() and "age"<Int>() }

the compiler plugin would make sure that you can call:

newDf.name.firstName
newDf.age
...

Slightly different to .select {} because that would expose newDf.firstName (pulled out of its group), but that's the only difference, right?

@koperagen
Copy link
Collaborator Author

I don't quite understand why you actually need the reified type parameter at all here. Isn't require {} just select {} where it checks the columns you mention are there and the original (but refined) DF is returned? The rest is then done on the compiler plugin side.

Yes and no? We still need to throw an exception if column is not Int, for example. I'd say we cannot do it for an arbitrary selector like "a"<Int>() and "c"<String>() because neither Int nor String are available as KType for us

@koperagen
Copy link
Collaborator Author

I don't quite understand why you actually need the reified type parameter at all here. Isn't require {} just select {} where it checks the columns you mention are there and the original (but refined) DF is returned? The rest is then done on the compiler plugin side. So something like:

select is very similar, now i get what you mean. If we select column but provided type is incorrect, select doesn't immediately throws an exception now. This is a limitation, a problem even. But at least we should be more strict for require as name implies

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Feb 27, 2026

Yes and no? We still need to throw an exception if column is not Int, for example. I'd say we cannot do it for an arbitrary selector like "a"() and "c"() because neither Int nor String are available as KType for us

select is very similar, now i get what you mean. If we select column but provided type is incorrect, select doesn't immediately throws an exception now. This is a limitation, a problem even. But at least we should be more strict for require as name implies

Ahh, you're right, it indeed doesn't check that. It does check column kind though. For our colGroup(), valueCol(), and frameCol() accessors inside the CS DSL, they call ensureIsValueColumn(), etc., which attach a requirement check .onResolve {}. But indeed, because we cannot put inline functions with reified types in interfaces, we cannot catch and check the types on resolve...

Plus, doing it completely safely may involve modifying all implementation of ColumnAccessor and giving it an expected KType or something...


But okay, let's look at it from another angle. What if we don't check types at runtime for now, like we already do for select {}?
df.select { "a"<Int>() }.a is possible to write already, even if a is String. It will just fail at the .a call instead of .select {}.

The String API is already unsafe, and all require {} wants to achieve is to make a little bridge from the String API to column accessors from the compiler plugin.

So why runtime checks would be very nice, both for require {} as for any other CS DSL function, I don't think it is worth limiting require {} just to one column. For now, that is, of course :)

@koperagen
Copy link
Collaborator Author

koperagen commented Mar 2, 2026

df.select { "a"() }.a is possible to write already, even if a is String. It will just fail at the .a call instead of .select {}.

Exception is only triggered when value is pulled from the column: df.select { "a"<Int>() }.a[0]. Int in DataColumn<Int> itself is an unchecked cast. So error can travel quite a bit before triggering a ClassCast exception. This troubles me
TBF, i'm fine with both approaches. We can choose. My thoughts here:

  1. Maybe we should be more strict because require is more explicitly about ensuring specific schema than select
  2. Multiple require being a bit verbose might not be a bad thing, right?
  3. Changing ColumnSelector to ColumnsSelector will be a source compatible change in the future

The String API is already unsafe, and all require {} wants to achieve is to make a little bridge from the String API to column accessors from the compiler plugin.

I agree


**Related operations**: [](cast.md), [](convertTo)

```kotlin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please add a Korro sample in "samples" module
    2)Might be a good idea to show this that peopleDf originally doesn't have EPs:
// Won't compile
peopleDf.select { name.firstName }
// Declare column with a runtime check
val df = peopleDf.require { "name"["firstName"]<String>() }
// Use extension properties after `require`
df.select { name.firstName }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot do it yet because require is not supported in compiler plugin :( But i'll do after we update to 2.4.0-RC or something

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but i'll update the code snippet

Copy link
Collaborator

@AndreiKingsley AndreiKingsley Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I really didn't think about that, sorry 😄 !
But please create at least a commented function and korro marks so we don't forget and an issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndreiKingsley
Copy link
Collaborator

I think we should take the following approach to “require” — it's a great thing if you need to quickly create EPs for 1-3 columns.
However, I would not recommend using it in production code, and would use the classic cast approach instead.
Do you agree? If so, I would put this in the documentation.

@koperagen
Copy link
Collaborator Author

I think we should take the following approach to “require” — it's a great thing if you need to quickly create EPs for 1-3 columns.
However, I would not recommend using it in production code, and would use the classic cast approach instead.
Do you agree? If so, I would put this in the documentation.

it's a great thing if you need to quickly create EPs for 1-3 columns.
would use the classic cast approach instead

I agree, yes

I would not recommend using it in production code

Why not? :)

@Jolanrensen
Copy link
Collaborator

So error can travel quite a bit before triggering a ClassCast exception. This troubles me

Me too, however, I would still put require and select (and any other selecting multiple columns operation for that matter) on the same level of "safeness".
@zaleslaw @AndreiKingsley what do you think?

Multiple require being a bit verbose might not be a bad thing, right?

hmm, I mean it's one of the reasons we're looking at #1168 and why we have this comprehensive add {} or even select {} DSL.
I fear repetition of the same statement over and over again hurts readability in the API.

I think we should take the following approach to “require” — it's a great thing if you need to quickly create EPs for 1-3 columns.
However, I would not recommend using it in production code, and would use the classic cast approach instead.
Do you agree? If so, I would put this in the documentation.

On the other hand, if this is one of the goals, limiting require {} to one column does force people towards cast/convert... cast(verify = true) is a bit safer, as it fails earlier

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Mar 2, 2026

As for using it in production. That may be nice :) It's like defining the schema functionally, which can be very versitile:

df.require {
    "user" {
        "name" {
            "firstName"<String>() and "lastName"<String>()
        } and "age"<Int>()
    } and
        "address" {
            "street"<String>() and "city"<String>()
        }
}

however, maybe a different DSL would make more sense as more than half of the CS DSL makes no sense for require {} (like colsOf<T>(), except {} etc. etc.) Plus, we don't care about the return value, unlike .select {}

Something more similar to add {}:

df.require {
    "user" {
        "name" {
            "firstName"<String>()
            "lastName"<String>()
        }
        "age"<Int>()
    }
    "address" {
        "street"<String>()
        "city"<String>()
    }
}

This would also allow us to build any checks we like into each column definition.

So... Maybe we could follow the add pattern; have an overload for a single column and one for multiple columns:

df
    .require(column<Int>("user")) // implemented now
    .require { // implemented later
        // new DSL for multiple columns, notation TBD
    }

@AndreiKingsley
Copy link
Collaborator

So, why not add 2 functions?
For a single typed column

df.requireColumn { singleColumn }

For defining schema

df.defineColumns {
     "user" {
        "name" {
            "firstName"<String>()
            "lastName"<String>()
        }
        "age"<Int>()
    }
    "address" {
        "street"<String>()
        "city"<String>()
    }
}

@Jolanrensen
Copy link
Collaborator

@AndreiKingsley Yes, I think two would work too, but it's hard for users to tell the difference between the two. Let's say they just want to require one column. In your case they could write:

df.requireColumn { "singleCol"<String>() }
df.defineColumns { "singleCol"<String>() }

and there's no way for them to tell the difference or see which benefits them more.

@koperagen
Copy link
Collaborator Author

If we go with 2 functions route, one option is require { singleColumn } and cast { DSL here }

@AndreiKingsley
Copy link
Collaborator

I'd name it like
require and defineSchema

@koperagen
Copy link
Collaborator Author

Will it be ok to keep the scope of this issue to require with single column? Let's decide about name and DSL for second one separately?

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Mar 2, 2026

I'd name it like require and defineSchema

"define" is a bit of an odd name, though. Because it's you, the user, doing the defining, not DataFrame.

We usually have names that are "commands" (imperative) to the library, like "select", "filter", "add", etc. or "end goals" (declarative), like "distinct()", "first()", and or most of the CS DSL: "colsOf", "all", "named"

Will it be ok to keep the scope of this issue to require with single column? Let's decide about name and DSL for second one separately?

Yes, I think so :) but we should decide the name first, as there's no easy way to make require {} allow multiple columns later. So we'll need to decide between require {}, requireCol {}, requireColumn {}, etc.

@koperagen
Copy link
Collaborator Author

koperagen commented Mar 2, 2026

@Jolanrensen
requireColumns?
@AndreiKingsley
For defineSchema question is whether we want to discard previous type information or to keep it?

Maybe requireSchema?

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Mar 2, 2026

@koperagen But then I'd recommend making it require { multiple columns }, requireColumn { column }.
Because select { }, add {}, convert {}, update {}, etc. also allow multiple columns, and we have getColumn { singleColumn } as only named operation for a single column. However, having "require" and another again hurts discoverability and 'not knowing which variant to choose' if we did have both... unless we have two entirely different names:

So:

  • requireColumn {} for the singular one. It appends "-Column" to not conflict with the other multiple-column operations.
  • cast {}, requireColumns {}, or requireSchema {} for the multiple columns one.
  • Maybe in the future we could figure out a require {} which takes any number of columns and we could drop both variants altogether. However, I would not create require {} if it is limited to just one of the cases.

@koperagen
Copy link
Collaborator Author

requireColumn { column }

Fine by me

Because select { }, add {}, convert {}, update {}, etc. also allow multiple columns, and we have getColumn { singleColumn } as only named operation for a single column.

Sorry, i don't agree with this logic. We also have get(ColumnSelector), and at the same time we have getColumns(ColumnsSelector) and operations like after, before, asGroupBy with ColumnSelector. So type itself is not the only deciding factor

@Jolanrensen
Copy link
Collaborator

Sorry, i don't agree with this logic. We also have get(ColumnSelector), and at the same time we have getColumns(ColumnsSelector) and operations like after, before, asGroupBy with ColumnSelector. So type itself is not the only deciding factor

get allows both ColumnSelector as well as ColumnsSelector, (but that trick doesn't seem to work well with require {} if it's inlined, I tried :/), so that follows select and the other operations in that you can either supply one or multiple columns.

after, before etc. aren't operations. They are selectors inside the Columns Selection DSL.

That only leaves asGroupBy {} as top-level operation asking for a single column. However, I'd say this is more an exception than a rule. You can call asGroupBy() as well. It's just that GroupBy needs a single FrameColumn and if you have multiple, you need to specify which one you mean.

So I still think the logic holds up.

@koperagen koperagen force-pushed the air/implement-dataframe.require-column-validator-9ce538e8-d branch from cf7fdc5 to 62332ee Compare March 2, 2026 16:37
@koperagen koperagen force-pushed the air/implement-dataframe.require-column-validator-9ce538e8-d branch from 62332ee to 48508dd Compare March 2, 2026 16:43
@koperagen
Copy link
Collaborator Author

Added a commit with require -> requireColumn rename

@koperagen koperagen force-pushed the air/implement-dataframe.require-column-validator-9ce538e8-d branch from 48508dd to 0304ba4 Compare March 2, 2026 16:53
val resolvedColumn = getColumnWithPath(column)
val actualType = resolvedColumn.data.type
require(resolvedColumn.data.isSubtypeOf(type)) {
"Column '${resolvedColumn.path.joinToString()}' has type '$actualType', which is not subtype of required '$type' type."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*a subtype of the required '$type' type.

Copy link
Collaborator

@Jolanrensen Jolanrensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@zaleslaw
Copy link
Collaborator

zaleslaw commented Mar 3, 2026

@koperagen please don't merge before my approval

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants