Skip to content

GH-437: [Format] Specify VARIABLE_SIZE_LIST Logical type#438

Draft
rok wants to merge 1 commit intoapache:masterfrom
rok:VARIABLE_SIZE_LIST
Draft

GH-437: [Format] Specify VARIABLE_SIZE_LIST Logical type#438
rok wants to merge 1 commit intoapache:masterfrom
rok:VARIABLE_SIZE_LIST

Conversation

@rok
Copy link
Member

@rok rok commented Jun 24, 2024

This is to split VARIABLE_SIZE_LIST proposal from #241 as suggested here.

GitHub issue

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@pitrou
Copy link
Member

pitrou commented Mar 5, 2026

What's the point of this?

@rok
Copy link
Member Author

rok commented Mar 5, 2026

The intent was to define a variable sized list column type without repetition/definition levels. I suppose vector repetition level would address exactly this. We could reuse this PR for the purpose or just close it.

@pitrou
Copy link
Member

pitrou commented Mar 5, 2026

The intent was to define a variable sized list column type without repetition/definition levels

Why would it be any better than a LIST column? VECTOR is presumably for fized-size lists...

@rok
Copy link
Member Author

rok commented Mar 5, 2026

We would want a VECTOR-like design that would allow variable-size lists without per-element definition levels.

@pitrou
Copy link
Member

pitrou commented Mar 5, 2026

We would want a VECTOR-like design that would allow variable-size lists without per-element definition levels.

I think that's already possible if you have a LIST group node whose child node is REQUIRED.

@rok
Copy link
Member Author

rok commented Mar 5, 2026

Even with required elements, LIST still needs repetition levels, and offsets must be derived by decoding those levels (at least over the target range), rather than read directly?

@pitrou
Copy link
Member

pitrou commented Mar 5, 2026

Well, yes, that's how Parquet works. Trying to stuff lists of opaque byte arrays doesn't sound like a tremendous idea to me.

@rok
Copy link
Member Author

rok commented Mar 5, 2026

Right. This would make the format less optimizable on element level, what would be other downsides?

@pitrou
Copy link
Member

pitrou commented Mar 5, 2026

The question is more whether the upsides are worth it. This hasn't been demonstrated.

@rok
Copy link
Member Author

rok commented Mar 5, 2026

@rahil-c posted some performance findings on the ML, e.g. this table (I think it's all about fixed size lists). It would be nice to have your-vector-proposal-like form for list.

@rahil-c
Copy link

rahil-c commented Mar 5, 2026

@pitrou @rok This was the details of the experiment that I had tried locally when writing some vectors to a parquet file with LIST of FLOAT vs having it backed by a FIXED_LEN_BYTE_ARRAY, as well as playing around with different encodings and compressions. Note the experiment was done with the perspective of what parquet users can try today
https://lists.apache.org/thread/q9b2lbz8h9loodpzso98wnj1x2tcr20h

@pitrou
Copy link
Member

pitrou commented Mar 5, 2026

This is off-topic as this PR is for VARIABLE_SIZE_LIST, not FIXED_SIZE_LIST.

@rok
Copy link
Member Author

rok commented Mar 5, 2026

Yes, but performance gains are likely indicative of what would be possible here. I suppose we best first see FIXED_SIZE_LIST debate play out before continuing here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants