feat: Parse new, merged data files from electricity consumption data#262
feat: Parse new, merged data files from electricity consumption data#262martyngigg merged 5 commits intomainfrom
Conversation
Allow parsing a file that is concatenated versions of the single-file format. Removed the skip_rows option in favour of infering the start position.
Load the new batch daily files
📝 WalkthroughWalkthroughThis pull request migrates electricity SharePoint data ingestion from the accelerator domain to the estates domain. The accelerator implementation is removed entirely, whilst a new estates-specific implementation is introduced with enhanced functionality. Supporting dbt transformation models and source mappings are updated to reflect the estates domain, and a Trino execution utility is added for infrastructure operations. Changes
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.Change the |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
infra/local/trino-execute.sh (1)
26-26: Consider noting the version dependency on docker-compose.yml.The Trino image version
trinodb/trino:477is hardcoded here but also defined indocker-compose.yml(asx-trino-image). If the version is updated in one place, the other may be overlooked.Consider adding a comment noting this dependency, or extracting the version to a shared location (e.g., an environment variable in
env-local).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@infra/local/trino-execute.sh` at line 26, The Trino image version is hardcoded as "trinodb/trino:477" in the script while the same version is defined as x-trino-image in docker-compose.yml; update the script to either add a clear comment above the "trinodb/trino:477" reference noting that this must match the docker-compose.yml x-trino-image entry or refactor to read the version from a shared source (e.g., an environment variable defined in env-local) so both places stay in sync; locate the "trinodb/trino:477" string in infra/local/trino-execute.sh and either add the explanatory comment referencing docker-compose.yml and x-trino-image or change the image string to use the shared environment variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/.dlt/config.toml`:
- Line 2: The config currently sets backfill = false which prevents rdm_data()
from reading historical files (it only reads **/*.xlsx and **/*-daily.csv when
backfill is true); change the config to set backfill = true for the initial
rollout (or alternatively seed/migrate the existing accelerator table before
repointing dbt to the new estates landing source) so that rdm_data() ingests the
historical **/*.xlsx and **/*-daily.csv slices and you don’t expose only the
live *-ISIS.csv data on first run.
In
`@warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py`:
- Around line 135-138: The XLSX branch currently only renames "Time" to
"DateTime" so the power column "Total Power (MW)" doesn't match the CSV branch's
"ISIS Elec Total Power {MW}" header; update the code in
electricity_sharepoint.py (the df assignment/rename block where df is created
and to_utc is applied) to also rename "Total Power (MW)" to "ISIS Elec Total
Power {MW}" (or add a conditional rename if the source column exists) so
downstream mapping (stg_electricity_sharepoint_rdm_data.sql expecting
isis_elec_total_power_mwx) receives the same column name as the CSV path.
- Around line 59-61: The to_utc function currently localizes naive timestamps
without handling DST edge cases; update to_utc(ts: pd.Series) so tz_localize
includes explicit DST handling (e.g., ambiguous='infer' and
nonexistent='shift_forward' or another suitable policy) before converting to
UTC, and then remove or replace the incorrect exception handler that catches
pytz.exceptions.AmbiguousTimeError in csv_section_to_df() /
read_power_consumption_excel()—either delete that try/except block or change it
to catch ValueError (the exception pandas raises for ambiguous/nonexistent
times) so DST boundary cases no longer crash at runtime.
---
Nitpick comments:
In `@infra/local/trino-execute.sh`:
- Line 26: The Trino image version is hardcoded as "trinodb/trino:477" in the
script while the same version is defined as x-trino-image in docker-compose.yml;
update the script to either add a clear comment above the "trinodb/trino:477"
reference noting that this must match the docker-compose.yml x-trino-image entry
or refactor to read the version from a shared source (e.g., an environment
variable defined in env-local) so both places stay in sync; locate the
"trinodb/trino:477" string in infra/local/trino-execute.sh and either add the
explanatory comment referencing docker-compose.yml and x-trino-image or change
the image string to use the shared environment variable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3813427c-ffe9-408a-bcaa-9a0381825a33
⛔ Files ignored due to path filters (1)
warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py.lockis excluded by!**/*.lock
📒 Files selected for processing (9)
infra/local/trino-execute.shwarehouses/facility_ops/transform/dbt_project.ymlwarehouses/facility_ops/transform/models/staging/estates/electricity_sharepoint/_electricity_sharepoint__sources.ymlwarehouses/facility_ops/transform/models/staging/estates/electricity_sharepoint/stg_electricity_sharepoint_rdm_data.sqlwarehouses/facility_ops/transform/models/staging/estates/electricity_sharepoint/stg_electricity_sharepoint_rdm_data.ymlwarehouses/facility_ops_landing/ingest/accelerator/electricity_sharepoint/.dlt/config.tomlwarehouses/facility_ops_landing/ingest/accelerator/electricity_sharepoint/electricity_sharepoint.pywarehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/.dlt/config.tomlwarehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py
💤 Files with no reviewable changes (2)
- warehouses/facility_ops_landing/ingest/accelerator/electricity_sharepoint/electricity_sharepoint.py
- warehouses/facility_ops_landing/ingest/accelerator/electricity_sharepoint/.dlt/config.toml
Summary
An automation has been implemented in the sharepoint source storing the electricity data. It combines the separate chunks into daily files to make working with the files more manageable over time. This updates the ingestion script to deal with it.
The schema for the source has been moved to the estates domain.
A small script to run a SQL script against the local docker setup has been added to aid with testing.
Fixes #252
Summary by CodeRabbit
Release Notes
New Features
Chores