PARQUET-2474: Add FIXED_SIZE_LIST logical type#241
PARQUET-2474: Add FIXED_SIZE_LIST logical type#241rok wants to merge 7 commits intoapache:masterfrom
Conversation
etseidl
left a comment
There was a problem hiding this comment.
Interesting way to get lists without repetition.
LogicalTypes.md
Outdated
| ### FIXED_SIZE_LIST | ||
|
|
||
| The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements | ||
| of a primitive data type. It must annotate a `binary` primitive type. |
There was a problem hiding this comment.
"binary" means either fixed or variable length, right? I always get confused 😅.
There was a problem hiding this comment.
Could you please provide a concrete example on how the list is structured? What about their definition & repetition levels? Intuitively, I thought not limit it to binary type. For example, it would be possible to support something like int[N] or double[N] and even multi-dimensional list like int[M][N].
There was a problem hiding this comment.
Could you please provide a concrete example on how the list is structured? What about their definition & repetition levels? Intuitively, I thought not limit it to binary type. For example, it would be possible to support something like int[N] or double[N] and even multi-dimensional list like int[M][N].
I would represent the fixed sized list as a non-nested FIXED_LEN_BYTE_ARRAY + type + num_values. Multidimensional lists/arrays bring much more complexity that I'm not sure makes sense to store as a logical type (see FixedShapeTensor in Arrow). Also see #241 (comment).
Perhaps use
byte_arrayin this PR (see #251).
Done.
|
One thing to perhaps give thought to is how this might represent nested lists, say you wanted to encode a m by n matrix, would you just encode this as a I had perhaps been anticipating that fixed size list would be a variant of "REPEATED" as opposed to a physical type, that is just able to avoid incrementing the max_def_level and max_rep_level. This would make it significantly more flexible I think, although I concede it will make it harder to implement. |
|
cc @JFinis |
src/main/thrift/parquet.thrift
Outdated
| struct EnumType {} // allowed for BINARY, must be encoded with UTF-8 | ||
| struct DateType {} // allowed for INT32 | ||
| struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes | ||
| struct FixedSizeListType {} // see LogicalTypes.md |
There was a problem hiding this comment.
Something is missing here. Shouldn't this type contain the element type? And the length of the list? The length of the list could be deduced from the size of the underlying fixed_len_byte_array, but at least the element type would be necessary then.
There was a problem hiding this comment.
Changed to:
struct FixedSizeListType { // allowed for FIXED_LEN_BYTE_ARRAY[num_values * width of type],
1: required Type type; // see LogicalTypes.md
2: required i32 num_values;
}
struct VariableSizeListType { // allowed for BYTE_ARRAY, see LogicalTypes.md
1: required Type type;
}
|
|
||
| The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`. | ||
|
|
||
| ### FIXED_SIZE_LIST |
There was a problem hiding this comment.
Interesting choice to annotate a binary primitive field instead of a repeated group field. I see pros and cons with this design:
PROs:
- Guarantees zero-copy, as the layout is defined to be just bytes. In contrast, would this annotate a group, a writer could decide to use a fancy per-value encoding (e.g., dictionary) and thus create a list that first has to be "decoded" before it can be used.
- Guarantees that a list is always contained on one page instead of being split over multiple pages. Again, this helps in keeping decoders easy and guaranteeing zero copy.
- This solves the problem of redundant R-Levels. Since it's just a primitive column, no r-level considerations have to be taken into account.
CONs:
- Cannot create fixed size lists of nested types (e.g., list of structs). I see that this isn't necessary for tensors or embedding vectors, but shouldn't the feature be extensible for other scenarios as well? This limits the composability of the feature. I can now create a struct of fixed size lists, but not a fixed size list of structs.
- Cannot have null elements in fixed size lists. This might not be desired for all lists, but there can be use cases where having null values in them is preferrable.
- Parquet has a concept for (non-fixed size) lists. It is conceptually weird that fixed size lists are totally different from (non-fixed size) lists.
I think the PROs outweigh the CONs here, so I think this is fine with me. I just want everyone to be aware about the ramifications.
There was a problem hiding this comment.
cc @tustvold, as you also brought up this point. I agree that having a new property of a repeated group would be more flexible, but it also comes at some cost, as outlined above. Also, it couldn't be just a logical type in this case, as a logical type cannot change the handling of R-Levels.
There was a problem hiding this comment.
I'm now feeling that maybe wrapping a Vector[PrimitiveType, Size] is also ok, but currently representing this is a bitweird in the model. May I ask would a Vector having data below?
1. [1, 1, 1], [null, 1, 1] <-- data with null
2. null, [1, 1, 1] <-- null vector
And would vector contains a "nested" vector?
There was a problem hiding this comment.
- This solves the problem of redundant R-Levels. Since it's just a primitive column, no r-level considerations have to be taken into account.
This is the main reason I'd like to propose this type, see apache/arrow#34510.
- Cannot create fixed size lists of nested types (e.g., list of structs). I see that this isn't necessary for tensors or embedding vectors, but shouldn't the feature be extensible for other scenarios as well? This limits the composability of the feature. I can now create a struct of fixed size lists, but not a fixed size list of structs.
Lack of composability is a downside, but I think it's still worth the compromise. I've not seen need for fixed_size_list(struct) in tensor computing, but that's probably just because it's not available.
- Cannot have null elements in fixed size lists. This might not be desired for all lists, but there can be use cases where having null values in them is preferrable.
In tensor computation this is usually addressed with bitmasks, which can be stored as a fixed_size_list(binary, num_values).
- Parquet has a concept for (non-fixed size) lists. It is conceptually weird that fixed size lists are totally different from (non-fixed size) lists.
Perhaps we should call this type FixedSizeArray to disambiguate?
I'm now feeling that maybe wrapping a
Vector[PrimitiveType, Size]is also ok, but currently representing this is a bitweird in the model. May I ask would aVectorhaving data below?1. [1, 1, 1], [null, 1, 1] <-- data with null 2. null, [1, 1, 1] <-- null vectorAnd would vector contains a "nested" vector?
I think case 2. is ok, but case 1. should be expressed with a separate null bitmask that's not part of the type.
There was a problem hiding this comment.
I am not even sure what a "fixed sized list of structs" even means. Would it mean that each struct has a known size (so that each element is fixed size 🤔 ). How would that work to have a fixed size list of structs where one of the structs was a (non fixed size) list 🤔
In other words, I am not sure the composeability of fixed size list into different element types makes a lot of sense
There was a problem hiding this comment.
In other words, I am not sure the composeability of fixed size list into different element types makes a lot of sense
+1 to this. I think this comes up as a theoretical compatibility with arrow things, where Arrow places no such limitations.
|
Apologies for taking a while to reply. I've split this into two cases:
We could start with a more general multidimensional array definition and have list be a 1 dimensional array. Additional metadata required would not be that bad. I'm just a bit scared of validation and striding logic bleeding into parquet implementations. Do we have any other inputs / opinions?
That's interesting. What would you expect performance wise with this approach? |
etseidl
left a comment
There was a problem hiding this comment.
Looking good to me. Just a few questions/comments. Thanks!
LogicalTypes.md
Outdated
| The `FIXED_LEN_BYTE_ARRAY` data is interpreted as a fixed size sequence of | ||
| elements of the same primitive data type. |
There was a problem hiding this comment.
Should the encoding be defined as well, for instance the elements of the array are encoded in the same manner as PLAIN encoding?
There was a problem hiding this comment.
Yes, that seems like a thing to specify. Changed to:
The `FIXED_LEN_BYTE_ARRAY` data is interpreted as a fixed size sequence of
elements of the same primitive data type encoded with plain encoding.
LogicalTypes.md
Outdated
| ### FIXED_SIZE_LIST | ||
|
|
||
| The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements | ||
| of a primitive data type. It must annotate a `FIXED_LEN_BYTE_ARRAY` primitive type. |
There was a problem hiding this comment.
As written, the elements can themselves be arrays. Is this intended? Or should it be "non-array primitive data type"?
There was a problem hiding this comment.
I didn't really consider the possibility of elements being arrays and I think non-array limitation makes sense. Changed to:
The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
of a non-array primitive data type. It must annotate a `FIXED_LEN_BYTE_ARRAY` primitive type.|
Thanks for the review @etseidl ! I've updated this with your suggestions. |
|
@ritchie46 would this be useful for your new polars Array type? |
|
@rok is there anything I can help with? @mapleFU I saw your questions above. Are you satisfied with the answers? @coastalwhite I see you are familiar with Parquet and Array in Polars. Do you think this proposal is useful for your project? |
|
I like the general idea of moving The The one potentially large upside I can imagine of this is getting dictionary encoding for array's, but I am not sure how common that will be in real-world scenarios. In general, I would say I am in favor. Although, I am not 100% convinced yet that the added complexity will result in significant performance, file size or other benefits. |
|
@coastalwhite there is a 10x penalty in Polars 1.9.0 parquet reading as well using this snippet: apache/arrow#34510 (comment) |
@alippai thanks for pinging. I was advised on the parquet sync call to re-open a ML discussion on this, but I need a couple of weeks to get to it. If you'd like you can start it now, here's the existing thread: https://lists.apache.org/thread/xot5f3ghhtc82n1bf0wdl9zqwlrzqks3 |
Thank you for putting that to my attention. Still, I feel like that is more of a bug than an inherent performance problem in the Parquet file format. However, it is probably easier to optimize for what is proposed in this PR. |
|
@rok based on the ML discussion we should add the fast path in the cases of polars, arrow and arrow-rs where we know the fixed size already (from schema stored in the metadata or if it's provided by the consumer). This is more fragile and less universal, but maybe a good first step forward |
@alippai are you sure we have a strong enough consensus yet to start implementing fast paths? I would really like to have some more discussion before committing. |
|
@rok Sorry, wrong phrasing. I meant that was the recommendation to explore on the ML and by @coastalwhite. I didn’t see objections adding this feature to the parquet format or commitments for adding the fast path to any of the libraries (arrow cpp actually noted it’s a non-trivial part of the codebase) |
|
Sorry for my abundance of caution @alippai. I'll try to summarize this thread to the ML and ask for some more input ASAP. It would be nice to actually start some work on this. |
|
Some points in no particular order:
That's all to say providing a way to encode fixed size lists seems like a very useful capability. That being said, it does seem to be a bit of a hack to make this a logical type, and will potentially limit the options for encodings, statistics, sort orders, etc... In particular the lack of dictionary encoding I could see being a non-trivial sacrifice. 1. In fact I think arrow-rs may be one of the few readers that actually implements it |
|
Ping here for visibility since this PR was recently mentioned on the mailing list. I'd be happy to push this forward in whatever way. |
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
|
I still believe this is very useful:
|
|
@mhaseeb123 @Matt711 Is this interesting for cuDF? |
|
This is quite an old PR so I am reading earlier comments to understand what's going on. Please correct me if I am wrong anywhere or missed something. This PR adds a new Couple of quick questions:
Consider all above is true, then from cuDF's perspective, we would definitely get some speed boost from not having to decode (and write) levels data for such types and then viewing the data as lists would also be trivial. To that effect, if this type is added to Parquet, I would prefer if it supports more than just FLBAs to make the effort to support it worthwhile. CC'ing @pmattione-nvidia as he can speak more on the overhead incurred from decoding levels data in the last libcudf version. |
|
Thanks @rok for revisiting this, if you need any help on this let me know. |
|
Hey @mhaseeb123 !
Thanks for taking the time!
The [num_values x FLBA] rows themselves would be nullable with definition levels. What we wouldn't have is intra-row nullability.
FLBA is meant as physical type (container) for arbitrary FIXED_SIZE_LIST(type, size) logical type. All fixed length physical types are allowed.
Encoding here is meant for byte layout within the FLBA. FIXED_SIZE_LIST columns can use any encoding that supports FLBA (plain, dictionary, delta_byte_array, byte_stream_split).
As stated above, this already supports all fixed-width primitive types as elements — FLBA is just the container. Glad to hear this would be useful for cuDF! |
|
Based on the discussion I made language a bit more vebose and explicit: bc3df18 I'd love to hear more feedback, will also ping the ML. |
| struct DateType {} // allowed for INT32 | ||
| struct Float16Type {} // allowed for FIXED[2], must be encoded as raw FLOAT16 bytes (see LogicalTypes.md) | ||
| struct FixedSizeListType { // allowed for FIXED_LEN_BYTE_ARRAY; see LogicalTypes.md | ||
| 1: required Type type; // element type (fixed-width primitive; must not be BOOLEAN, INT96, or BYTE_ARRAY) |
There was a problem hiding this comment.
It might make sense to introduce a new enum for the list element types. The Type enum does not distinguish smaller integer types, signed/unsigned types or the float16 type.
There was a problem hiding this comment.
Good point, decimal is another type we'd lose annotation for. To avoid a new enum, how about optional LogicalType:
struct FixedSizeListType {
1: required Type type; // element type (fixed-width primitive)
2: required i32 num_values;
3: optional LogicalType element_logical_type; // optional semantic annotation of elements,
}There was a problem hiding this comment.
Adding a logical type could work, and it would then even support nested lists or matrices. It's not immediately obvious, but Type could not support that since the length of FIXED_LEN_BYTE_ARRAY is stored in SchemaElement.
What I don't like is that here, the logical type is used to influence the physical layout, where as elsewhere, a PLAIN encoded INT32 with logical type INT_8 would still be stored using 4 bytes.
Hm, thinking out loud a bit, the physical width is already defined by type_length of FIXED_LEN_BYTE_ARRAY / num_values. The logical type should then be enough to interpret these bytes, without the Type field. The only blocker for that is that there is no logical type annotation to indicate FLOAT or DOUBLE.
There was a problem hiding this comment.
The only blocker for that is that there is no logical type annotation to indicate FLOAT or DOUBLE.
Yes, I think we need either Type and Enum as you originally suggested or Type and optional LogicalType. I slightly prefer LogicalType because we already define it. Shall I update the language to sketch the LogicalType path?
|
For the record we have a new proposal on the mailing list and here that is relevant for this discussion. It would be great to get eyes on it and decide if we want to rather go with it. cc @tustvold @JFinis @jhorstmann @wgtmac @alippai @coastalwhite @etseidl @mapleFU @mhaseeb123 @rahil-c @adamreeve @emkornfield |
|
I like the new proposal. This previous proposal means readers could fall back to reading the primitive values if they don't understand the new logical type, whereas there wouldn't be a fallback path for readers that don't understand the new repetition type, but I think that's acceptable given it allows use of encodings that are better suited to the element values (eg. byte stream split for floats or maybe ALP in the future).
I think the vectors themselves would be non-optional but the child elements should be allowed to be optional? |
@pitrou's proposal would increase neither the repetition nor definition levels, so that implies vectors and their elements are non-optional. If we want to allow optional vectors and elements, we'd need some kind of 3-level structure like what currently exists for lists. This would eliminate some repetition level decoding, but would still require extra definition level handling, so I wonder if we'd see as much of a decoding speed improvement. |
|
Ah yes you're right, sorry. And I didn't mean to suggest that nullable elements should be supported, I'd just misunderstood how this would work. For scenarios where I see this being used, zero or NaN is often used in place of nulls, so I think it's fine to only support required vectors and elements. |
|
Application layer can also easily implement it's own null mask with a boolean vector under the new proposal. |
As proposed in apache/arrow#34510 and on ML, PARQUET-2474.
Arrow recently introduced FixedShapeTensor and VariableShapeTensor canonical extension types that use FixedSizeList and StructArray(List, FixedSizeList) as storage respectfully. These are targeted at machine learning and scientific applications that deal with large datasets and would benefit from using Parquet as on disk storage.
However currently FixedSizeList is stored as List in Parquet which adds significant conversion overhead when reading and writing as discussed here. It would therefore be beneficial to introduce a FIXED_SIZE_LIST logical type to Parquet.