Add `require` operation by koperagen · Pull Request #1715 · Kotlin/dataframe

koperagen · 2026-02-27T14:18:46Z

Seems not feasible to have ColumnsSelector because we cannot make interface members inline functions and have reified type parameters. So, no go for our existing String Column Accessors like String.invoke, col, and so on. context parameters are still considered experimental, so it'll have to wait.
In the meantime, multiple require calls in chain :)
With this new operation we'll further improve workflow where all compile time schema information is derived from operations: require (new), operations with selectors using String Column Accessors, and things like add, toDataFrame, map, etc. This can be a good alternative to usual DataSchema workflow, with good potential for incremental introduction of type safety and lower entry barrier. Well, you know the idea.

Jolanrensen · 2026-02-27T15:38:31Z

Seems not feasible to have ColumnsSelector because we cannot make interface members inline functions and have reified type parameters. So, no go for our existing String Column Accessors like String.invoke, col, and so on. context parameters are still considered experimental, so it'll have to wait.
In the meantime, multiple require calls in chain :)

Hmm, so this makes our code look something like:

df.readCsv("something")
    .require { "name"["firstName"]<String>() }
    .require { "name"["lastName"]<String>() }
    .require { "age"<Int>() }
    .require { "address"<String>() }

I don't quite understand why you actually need the reified type parameter at all here. Isn't require {} just select {} where it checks the columns you mention are there and the original (but refined) DF is returned? The rest is then done on the compiler plugin side. So something like:

@Refine
@Interpretable("Require0")
public fun <T> DataFrame<T>.require(columns: ColumnsSelector<T, *>): DataFrame<T> {
    // attempts to resolve all columns, throw Exception if any column mentioned is not there
    getColumnsWithPaths(UnresolvedColumnsPolicy.Fail, columns)
    // Managed to successfully resolve all columns, compiler plugin can safely assume they are present
    return this
}

When executing

val newDf = df.readCsv("something")
    .require { "name"["firstName"]<String>() and "name"["lastName"]<String>() and "age"<Int>() }

the compiler plugin would make sure that you can call:

newDf.name.firstName
newDf.age
...

Slightly different to .select {} because that would expose newDf.firstName (pulled out of its group), but that's the only difference, right?

docs/StardustDocs/topics/require.md

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/require.kt

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/DataFrameReceiver.kt

koperagen · 2026-02-27T16:43:01Z

I don't quite understand why you actually need the reified type parameter at all here. Isn't require {} just select {} where it checks the columns you mention are there and the original (but refined) DF is returned? The rest is then done on the compiler plugin side.

Yes and no? We still need to throw an exception if column is not Int, for example. I'd say we cannot do it for an arbitrary selector like "a"<Int>() and "c"<String>() because neither Int nor String are available as KType for us

koperagen · 2026-02-27T16:46:58Z

I don't quite understand why you actually need the reified type parameter at all here. Isn't require {} just select {} where it checks the columns you mention are there and the original (but refined) DF is returned? The rest is then done on the compiler plugin side. So something like:

select is very similar, now i get what you mean. If we select column but provided type is incorrect, select doesn't immediately throws an exception now. This is a limitation, a problem even. But at least we should be more strict for require as name implies

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/DataFrameReceiver.kt

Jolanrensen · 2026-02-27T18:19:57Z

Yes and no? We still need to throw an exception if column is not Int, for example. I'd say we cannot do it for an arbitrary selector like "a"() and "c"() because neither Int nor String are available as KType for us

select is very similar, now i get what you mean. If we select column but provided type is incorrect, select doesn't immediately throws an exception now. This is a limitation, a problem even. But at least we should be more strict for require as name implies

Ahh, you're right, it indeed doesn't check that. It does check column kind though. For our colGroup(), valueCol(), and frameCol() accessors inside the CS DSL, they call ensureIsValueColumn(), etc., which attach a requirement check .onResolve {}. But indeed, because we cannot put inline functions with reified types in interfaces, we cannot catch and check the types on resolve...

Plus, doing it completely safely may involve modifying all implementation of ColumnAccessor and giving it an expected KType or something...

But okay, let's look at it from another angle. What if we don't check types at runtime for now, like we already do for select {}?
df.select { "a"<Int>() }.a is possible to write already, even if a is String. It will just fail at the .a call instead of .select {}.

The String API is already unsafe, and all require {} wants to achieve is to make a little bridge from the String API to column accessors from the compiler plugin.

So why runtime checks would be very nice, both for require {} as for any other CS DSL function, I don't think it is worth limiting require {} just to one column. For now, that is, of course :)

koperagen · 2026-03-02T10:28:57Z

df.select { "a"() }.a is possible to write already, even if a is String. It will just fail at the .a call instead of .select {}.

Exception is only triggered when value is pulled from the column: df.select { "a"<Int>() }.a[0]. Int in DataColumn<Int> itself is an unchecked cast. So error can travel quite a bit before triggering a ClassCast exception. This troubles me
TBF, i'm fine with both approaches. We can choose. My thoughts here:

Maybe we should be more strict because require is more explicitly about ensuring specific schema than select
Multiple require being a bit verbose might not be a bad thing, right?
Changing ColumnSelector to ColumnsSelector will be a source compatible change in the future

The String API is already unsafe, and all require {} wants to achieve is to make a little bridge from the String API to column accessors from the compiler plugin.

I agree

docs/StardustDocs/topics/require.md

AndreiKingsley · 2026-03-02T11:45:48Z

docs/StardustDocs/topics/require.md

+
+**Related operations**: [](cast.md), [](convertTo)
+
+```kotlin


Please add a Korro sample in "samples" module
2)Might be a good idea to show this that peopleDf originally doesn't have EPs:

// Won't compile peopleDf.select { name.firstName } // Declare column with a runtime check val df = peopleDf.require { "name"["firstName"]<String>() } // Use extension properties after `require` df.select { name.firstName }

I cannot do it yet because require is not supported in compiler plugin :( But i'll do after we update to 2.4.0-RC or something

but i'll update the code snippet

Yeah, I really didn't think about that, sorry 😄 !
But please create at least a commented function and korro marks so we don't forget and an issue.

AndreiKingsley · 2026-03-02T11:53:02Z

I think we should take the following approach to “require” — it's a great thing if you need to quickly create EPs for 1-3 columns.
However, I would not recommend using it in production code, and would use the classic cast approach instead.
Do you agree? If so, I would put this in the documentation.

koperagen · 2026-03-02T11:56:53Z

I think we should take the following approach to “require” — it's a great thing if you need to quickly create EPs for 1-3 columns.
However, I would not recommend using it in production code, and would use the classic cast approach instead.
Do you agree? If so, I would put this in the documentation.

it's a great thing if you need to quickly create EPs for 1-3 columns.
would use the classic cast approach instead

I agree, yes

I would not recommend using it in production code

Why not? :)

Jolanrensen · 2026-03-02T11:57:31Z

So error can travel quite a bit before triggering a ClassCast exception. This troubles me

Me too, however, I would still put require and select (and any other selecting multiple columns operation for that matter) on the same level of "safeness".
@zaleslaw @AndreiKingsley what do you think?

Multiple require being a bit verbose might not be a bad thing, right?

hmm, I mean it's one of the reasons we're looking at #1168 and why we have this comprehensive add {} or even select {} DSL.
I fear repetition of the same statement over and over again hurts readability in the API.

I think we should take the following approach to “require” — it's a great thing if you need to quickly create EPs for 1-3 columns.
However, I would not recommend using it in production code, and would use the classic cast approach instead.
Do you agree? If so, I would put this in the documentation.

On the other hand, if this is one of the goals, limiting require {} to one column does force people towards cast/convert... cast(verify = true) is a bit safer, as it fails earlier

Jolanrensen · 2026-03-02T12:12:05Z

As for using it in production. That may be nice :) It's like defining the schema functionally, which can be very versitile:

df.require {
    "user" {
        "name" {
            "firstName"<String>() and "lastName"<String>()
        } and "age"<Int>()
    } and
        "address" {
            "street"<String>() and "city"<String>()
        }
}

however, maybe a different DSL would make more sense as more than half of the CS DSL makes no sense for require {} (like colsOf<T>(), except {} etc. etc.) Plus, we don't care about the return value, unlike .select {}

Something more similar to add {}:

df.require {
    "user" {
        "name" {
            "firstName"<String>()
            "lastName"<String>()
        }
        "age"<Int>()
    }
    "address" {
        "street"<String>()
        "city"<String>()
    }
}

This would also allow us to build any checks we like into each column definition.

So... Maybe we could follow the add pattern; have an overload for a single column and one for multiple columns:

df
    .require(column<Int>("user")) // implemented now
    .require { // implemented later
        // new DSL for multiple columns, notation TBD
    }

AndreiKingsley · 2026-03-02T12:28:47Z

So, why not add 2 functions?
For a single typed column

df.requireColumn { singleColumn }

For defining schema

df.defineColumns {
     "user" {
        "name" {
            "firstName"<String>()
            "lastName"<String>()
        }
        "age"<Int>()
    }
    "address" {
        "street"<String>()
        "city"<String>()
    }
}

Jolanrensen · 2026-03-02T12:36:03Z

@AndreiKingsley Yes, I think two would work too, but it's hard for users to tell the difference between the two. Let's say they just want to require one column. In your case they could write:

df.requireColumn { "singleCol"<String>() }
df.defineColumns { "singleCol"<String>() }

and there's no way for them to tell the difference or see which benefits them more.

koperagen · 2026-03-02T12:39:27Z

If we go with 2 functions route, one option is require { singleColumn } and cast { DSL here }

AndreiKingsley · 2026-03-02T12:41:54Z

I'd name it like
require and defineSchema

koperagen · 2026-03-02T12:51:43Z

Will it be ok to keep the scope of this issue to require with single column? Let's decide about name and DSL for second one separately?

Jolanrensen · 2026-03-02T12:58:47Z

I'd name it like require and defineSchema

"define" is a bit of an odd name, though. Because it's you, the user, doing the defining, not DataFrame.

We usually have names that are "commands" (imperative) to the library, like "select", "filter", "add", etc. or "end goals" (declarative), like "distinct()", "first()", and or most of the CS DSL: "colsOf", "all", "named"

Will it be ok to keep the scope of this issue to require with single column? Let's decide about name and DSL for second one separately?

Yes, I think so :) but we should decide the name first, as there's no easy way to make require {} allow multiple columns later. So we'll need to decide between require {}, requireCol {}, requireColumn {}, etc.

koperagen · 2026-03-02T13:13:43Z

@Jolanrensen
requireColumns?
@AndreiKingsley
For defineSchema question is whether we want to discard previous type information or to keep it?

Maybe requireSchema?

Jolanrensen · 2026-03-02T14:41:05Z

@koperagen But then I'd recommend making it require { multiple columns }, requireColumn { column }.
Because select { }, add {}, convert {}, update {}, etc. also allow multiple columns, and we have getColumn { singleColumn } as only named operation for a single column. However, having "require" and another again hurts discoverability and 'not knowing which variant to choose' if we did have both... unless we have two entirely different names:

So:

requireColumn {} for the singular one. It appends "-Column" to not conflict with the other multiple-column operations.
cast {}, requireColumns {}, or requireSchema {} for the multiple columns one.
Maybe in the future we could figure out a require {} which takes any number of columns and we could drop both variants altogether. However, I would not create require {} if it is limited to just one of the cases.

koperagen · 2026-03-02T15:00:44Z

requireColumn { column }

Fine by me

Because select { }, add {}, convert {}, update {}, etc. also allow multiple columns, and we have getColumn { singleColumn } as only named operation for a single column.

Sorry, i don't agree with this logic. We also have get(ColumnSelector), and at the same time we have getColumns(ColumnsSelector) and operations like after, before, asGroupBy with ColumnSelector. So type itself is not the only deciding factor

Jolanrensen · 2026-03-02T15:48:41Z

Sorry, i don't agree with this logic. We also have get(ColumnSelector), and at the same time we have getColumns(ColumnsSelector) and operations like after, before, asGroupBy with ColumnSelector. So type itself is not the only deciding factor

get allows both ColumnSelector as well as ColumnsSelector, (but that trick doesn't seem to work well with require {} if it's inlined, I tried :/), so that follows select and the other operations in that you can either supply one or multiple columns.

after, before etc. aren't operations. They are selectors inside the Columns Selection DSL.

That only leaves asGroupBy {} as top-level operation asking for a single column. However, I'd say this is more an exception than a rule. You can call asGroupBy() as well. It's just that GroupBy needs a single FrameColumn and if you have multiple, you need to specify which one you mean.

So I still think the logic holds up.

koperagen · 2026-03-02T16:44:26Z

Added a commit with require -> requireColumn rename

Jolanrensen · 2026-03-02T18:49:17Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/require.kt

+    val resolvedColumn = getColumnWithPath(column)
+    val actualType = resolvedColumn.data.type
+    require(resolvedColumn.data.isSubtypeOf(type)) {
+        "Column '${resolvedColumn.path.joinToString()}' has type '$actualType', which is not subtype of required '$type' type."


*a subtype of the required '$type' type.

Jolanrensen

Thanks!

zaleslaw · 2026-03-03T08:28:18Z

@koperagen please don't merge before my approval

koperagen added this to the 1.0.0-Beta5 milestone Feb 27, 2026

koperagen requested review from AndreiKingsley, Jolanrensen and zaleslaw February 27, 2026 14:18

koperagen self-assigned this Feb 27, 2026

koperagen added the enhancement New feature or request label Feb 27, 2026

Jolanrensen requested changes Feb 27, 2026

View reviewed changes

koperagen commented Feb 27, 2026

View reviewed changes

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/DataFrameReceiver.kt Outdated Show resolved Hide resolved

AndreiKingsley requested changes Mar 2, 2026

View reviewed changes

koperagen mentioned this pull request Mar 2, 2026

Add korro example for require #1718

Open

koperagen force-pushed the air/implement-dataframe.require-column-validator-9ce538e8-d branch from cf7fdc5 to 62332ee Compare March 2, 2026 16:37

koperagen added 3 commits March 2, 2026 18:43

Add DataFrame.require API for typed selector validation

eab616b

Improve missing column error message in CS DSL

86810a0

Update require.md

df076c5

koperagen force-pushed the air/implement-dataframe.require-column-validator-9ce538e8-d branch from 62332ee to 48508dd Compare March 2, 2026 16:43

koperagen added 2 commits March 2, 2026 18:52

Rename require -> requireColumn

c036ff2

Cross-reference use cases with requireColumn in mind

0304ba4

koperagen force-pushed the air/implement-dataframe.require-column-validator-9ce538e8-d branch from 48508dd to 0304ba4 Compare March 2, 2026 16:53

Jolanrensen reviewed Mar 2, 2026

View reviewed changes

Jolanrensen approved these changes Mar 2, 2026

View reviewed changes

Conversation

koperagen commented Feb 27, 2026

Uh oh!

Jolanrensen commented Feb 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

koperagen commented Feb 27, 2026

Uh oh!

koperagen commented Feb 27, 2026

Uh oh!

Uh oh!

Jolanrensen commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

koperagen commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AndreiKingsley Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

koperagen Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

koperagen Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

AndreiKingsley Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

koperagen Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

AndreiKingsley commented Mar 2, 2026

Uh oh!

koperagen commented Mar 2, 2026

Uh oh!

Jolanrensen commented Mar 2, 2026

Uh oh!

Jolanrensen commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndreiKingsley commented Mar 2, 2026

Uh oh!

Jolanrensen commented Mar 2, 2026

Uh oh!

koperagen commented Mar 2, 2026

Uh oh!

AndreiKingsley commented Mar 2, 2026

Uh oh!

koperagen commented Mar 2, 2026

Uh oh!

Jolanrensen commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

koperagen commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jolanrensen commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

koperagen commented Mar 2, 2026

Uh oh!

Jolanrensen commented Mar 2, 2026

Uh oh!

koperagen commented Mar 2, 2026

Uh oh!

Jolanrensen Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Jolanrensen left a comment

Choose a reason for hiding this comment

Uh oh!

zaleslaw commented Mar 3, 2026

Uh oh!

Jolanrensen commented Feb 27, 2026 •

edited

Loading

koperagen commented Mar 2, 2026 •

edited

Loading

AndreiKingsley Mar 2, 2026 •

edited

Loading

Jolanrensen commented Mar 2, 2026 •

edited

Loading

Jolanrensen commented Mar 2, 2026 •

edited

Loading

koperagen commented Mar 2, 2026 •

edited

Loading

Jolanrensen commented Mar 2, 2026 •

edited

Loading