GH-437: [Format] Specify VARIABLE_SIZE_LIST Logical type#438
GH-437: [Format] Specify VARIABLE_SIZE_LIST Logical type#438rok wants to merge 1 commit intoapache:masterfrom
Conversation
|
What's the point of this? |
|
The intent was to define a variable sized list column type without repetition/definition levels. I suppose vector repetition level would address exactly this. We could reuse this PR for the purpose or just close it. |
Why would it be any better than a LIST column? VECTOR is presumably for fized-size lists... |
|
We would want a VECTOR-like design that would allow variable-size lists without per-element definition levels. |
I think that's already possible if you have a LIST group node whose child node is REQUIRED. |
|
Even with required elements, LIST still needs repetition levels, and offsets must be derived by decoding those levels (at least over the target range), rather than read directly? |
|
Well, yes, that's how Parquet works. Trying to stuff lists of opaque byte arrays doesn't sound like a tremendous idea to me. |
|
Right. This would make the format less optimizable on element level, what would be other downsides? |
|
The question is more whether the upsides are worth it. This hasn't been demonstrated. |
|
@rahil-c posted some performance findings on the ML, e.g. this table (I think it's all about fixed size lists). It would be nice to have your-vector-proposal-like form for list. |
|
@pitrou @rok This was the details of the experiment that I had tried locally when writing some vectors to a parquet file with LIST of FLOAT vs having it backed by a FIXED_LEN_BYTE_ARRAY, as well as playing around with different encodings and compressions. Note the experiment was done with the perspective of what parquet users can try today |
|
This is off-topic as this PR is for VARIABLE_SIZE_LIST, not FIXED_SIZE_LIST. |
|
Yes, but performance gains are likely indicative of what would be possible here. I suppose we best first see FIXED_SIZE_LIST debate play out before continuing here. |
This is to split VARIABLE_SIZE_LIST proposal from #241 as suggested here.
GitHub issue
Commits
Documentation