feat(arrow-avro): Configurable Arrow timezone ID for Avro timestamps#9280
feat(arrow-avro): Configurable Arrow timezone ID for Avro timestamps#9280mzabaluev wants to merge 15 commits intoapache:mainfrom
Conversation
In the ReaderBuilder API, add a new method `with_tz` that allows users to specify the timezone ID for Avro logical types that represent UTC timestamps. The choices are between "+00:00" and "UTC" and can be selected by the new Tz enumeration.
|
This makes more sense in my opinion than using "+00:00" exclusively |
3d58167 to
c61f7e2
Compare
jecsand838
left a comment
There was a problem hiding this comment.
@mzabaluev LMGTM. Really like this improvement to arrow-avro!
For me the big things are:
- An e2e test in
reader/mod.rsusing the newwith_tz. - Better documentation on the public
with_tzmethod.
| /// Sets the timezone representation for Avro timestamp fields. | ||
| pub fn with_tz(mut self, tz: Tz) -> Self { | ||
| self.use_tz = tz; | ||
| self | ||
| } | ||
|
|
There was a problem hiding this comment.
Would you mind adding an e2e test using this?
Also may help to explain in the doc comment what the default value is.
| Codec::TimestampMillis(Some(Tz::OffsetZero)) => Self::TimestampMillisUtc, | ||
| Codec::TimestampMillis(Some(Tz::Utc)) => Self::TimestampMillisUtc, | ||
| Codec::TimestampMillis(None) => Self::TimestampMillisLocal, | ||
| Codec::TimestampMicros(Some(Tz::OffsetZero)) => Self::TimestampMicrosUtc, | ||
| Codec::TimestampMicros(Some(Tz::Utc)) => Self::TimestampMicrosUtc, | ||
| Codec::TimestampMicros(None) => Self::TimestampMicrosLocal, | ||
| Codec::TimestampNanos(Some(Tz::OffsetZero)) => Self::TimestampNanosUtc, | ||
| Codec::TimestampNanos(Some(Tz::Utc)) => Self::TimestampNanosUtc, | ||
| Codec::TimestampNanos(None) => Self::TimestampNanosLocal, |
There was a problem hiding this comment.
This maybe a bit cleaner.
| Codec::TimestampMillis(Some(Tz::OffsetZero)) => Self::TimestampMillisUtc, | |
| Codec::TimestampMillis(Some(Tz::Utc)) => Self::TimestampMillisUtc, | |
| Codec::TimestampMillis(None) => Self::TimestampMillisLocal, | |
| Codec::TimestampMicros(Some(Tz::OffsetZero)) => Self::TimestampMicrosUtc, | |
| Codec::TimestampMicros(Some(Tz::Utc)) => Self::TimestampMicrosUtc, | |
| Codec::TimestampMicros(None) => Self::TimestampMicrosLocal, | |
| Codec::TimestampNanos(Some(Tz::OffsetZero)) => Self::TimestampNanosUtc, | |
| Codec::TimestampNanos(Some(Tz::Utc)) => Self::TimestampNanosUtc, | |
| Codec::TimestampNanos(None) => Self::TimestampNanosLocal, | |
| Codec::TimestampMillis(Some(_)) => Self::TimestampMillisUtc, | |
| Codec::TimestampMillis(None) => Self::TimestampMillisLocal, | |
| Codec::TimestampMicros(Some(_)) => Self::TimestampMicrosUtc, | |
| Codec::TimestampMicros(None) => Self::TimestampMicrosLocal, | |
| Codec::TimestampNanos(Some(_)) => Self::TimestampNanosUtc, | |
| Codec::TimestampNanos(None) => Self::TimestampNanosLocal, |
There was a problem hiding this comment.
I was wary of matching Some(_) because if Tz ever gets non-UTC variants, the suggested code will still compile, but will be wrong.
| Self::TimestampMillis(tz) => DataType::Timestamp( | ||
| TimeUnit::Millisecond, | ||
| tz.as_ref().map(|tz| tz.to_string().into()), | ||
| ), | ||
| Self::TimestampMicros(tz) => DataType::Timestamp( | ||
| TimeUnit::Microsecond, | ||
| tz.as_ref().map(|tz| tz.to_string().into()), | ||
| ), | ||
| Self::TimestampNanos(tz) => DataType::Timestamp( | ||
| TimeUnit::Nanosecond, | ||
| tz.as_ref().map(|tz| tz.to_string().into()), | ||
| ), |
There was a problem hiding this comment.
Then down here you could do:
| Self::TimestampMillis(tz) => DataType::Timestamp( | |
| TimeUnit::Millisecond, | |
| tz.as_ref().map(|tz| tz.to_string().into()), | |
| ), | |
| Self::TimestampMicros(tz) => DataType::Timestamp( | |
| TimeUnit::Microsecond, | |
| tz.as_ref().map(|tz| tz.to_string().into()), | |
| ), | |
| Self::TimestampNanos(tz) => DataType::Timestamp( | |
| TimeUnit::Nanosecond, | |
| tz.as_ref().map(|tz| tz.to_string().into()), | |
| ), | |
| Self::TimestampMillis(tz) => DataType::Timestamp( | |
| TimeUnit::Millisecond, | |
| tz.as_ref().map(|tz| tz.as_str().into()), | |
| ), | |
| Self::TimestampMicros(tz) => DataType::Timestamp( | |
| TimeUnit::Microsecond, | |
| tz.as_ref().map(|tz| tz.as_str().into()), | |
| ), | |
| Self::TimestampNanos(tz) => DataType::Timestamp( | |
| TimeUnit::Nanosecond, | |
| tz.as_ref().map(|tz| tz.as_str().into()), | |
| ), |
|
Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look |
Add the as_str method to avro Tz as a more performant way to get the timezone ID string. Co-authored-by: Connor Sanders <170039284+jecsand838@users.noreply.github.com>
Add with_tz method to AsyncAvroFileReaderBuilder to allow configuring the time zone representation for Avro timestamp fields the same way as the synchronous Avro reader.
|
@alamb I have added tests, the same builder option is also added to the async reader now that it's merged. |
Which issue does this PR close?
Rationale for this change
Enable an alternative representation of UTC timestamp data types with the "UTC" timezone ID, which is useful for interoperability with applications preferring that form.
What changes are included in this PR?
In the
ReaderBuilderAPI, add a new methodwith_tzthat allows users to specify the timezone ID for Avro logical types that represent UTC timestamps. The choices are between "+00:00" and "UTC" and can be selected by the newTzenumeration.Are these changes tested?
Added unit tests to verify the representation with different
Tzparameter values.Are there any user-facing changes?
A new
with_tzmethod is added toarrow_avro::reader::Builder.