Skip to content

feat: Parse new, merged data files from electricity consumption data#262

Merged
martyngigg merged 5 commits intomainfrom
electricity-sharepoint-parse-merged-data
Mar 17, 2026
Merged

feat: Parse new, merged data files from electricity consumption data#262
martyngigg merged 5 commits intomainfrom
electricity-sharepoint-parse-merged-data

Conversation

@martyngigg
Copy link
Contributor

@martyngigg martyngigg commented Mar 17, 2026

Summary

An automation has been implemented in the sharepoint source storing the electricity data. It combines the separate chunks into daily files to make working with the files more manageable over time. This updates the ingestion script to deal with it.

The schema for the source has been moved to the estates domain.

A small script to run a SQL script against the local docker setup has been added to aid with testing.

Fixes #252

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced electricity consumption data pipeline for the estates domain with enhanced capabilities for historical data backfilling and incremental ingestion.
  • Chores

    • Added utility script for local Trino command execution.

Allow parsing a file that is concatenated versions of the
single-file format. Removed the skip_rows option in favour
of infering the start position.
@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

This pull request migrates electricity SharePoint data ingestion from the accelerator domain to the estates domain. The accelerator implementation is removed entirely, whilst a new estates-specific implementation is introduced with enhanced functionality. Supporting dbt transformation models and source mappings are updated to reflect the estates domain, and a Trino execution utility is added for infrastructure operations.

Changes

Cohort / File(s) Summary
Infrastructure Utilities
infra/local/trino-execute.sh
New Bash script enabling Trino command execution through Traefik with TLS termination, accepting catalog and command arguments.
dbt Transform Configuration
warehouses/facility_ops/transform/dbt_project.yml, warehouses/facility_ops/transform/models/staging/estates/electricity_sharepoint/*
Added estates schema mapping in dbt project configuration; renamed electricity_sharepoint source from accelerator to estates; updated SQL model to reference estates source.
Data Ingestion - Accelerator (Removed)
warehouses/facility_ops_landing/ingest/accelerator/electricity_sharepoint/.dlt/config.toml, warehouses/facility_ops_landing/ingest/accelerator/electricity_sharepoint/electricity_sharepoint.py
Removed accelerator electricity_sharepoint M365 configuration and entire Python module (170 lines), eliminating transformer, resource definitions, and CLI orchestration for this data path.
Data Ingestion - Estates (New)
warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/.dlt/config.toml, warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py
Added estates electricity_sharepoint configuration with SharePoint site and root directory; introduced new Python module (251 lines) with enhanced DLT pipeline including CSV/Excel parsing, parallel file processing, optional backfill logic, and incremental ingestion via latest timestamp detection.

Poem

🐰 From accelerator to estates we hop,
Electricity data shall never stop,
New backfill magic, enhanced with care,
SharePoint files merge in the air! ⚡
A migration swift, the future looks bright!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed The PR implements the core requirement from #252: updating the ingestion to handle merged/concatenated files that estates now produce to reduce cost.
Out of Scope Changes check ✅ Passed All changes are directly related to supporting merged electricity data files: updating the parser, adding configuration, introducing a utility script, and moving the pipeline from accelerator to estates.
Title check ✅ Passed The title accurately describes the main objective of the changeset: updating the electricity consumption data ingestion to handle new merged/daily data files from SharePoint.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

@martyngigg martyngigg changed the title feat: Parse new, merged data files feat: Parse new, merged data files from electricity consumption data Mar 17, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
infra/local/trino-execute.sh (1)

26-26: Consider noting the version dependency on docker-compose.yml.

The Trino image version trinodb/trino:477 is hardcoded here but also defined in docker-compose.yml (as x-trino-image). If the version is updated in one place, the other may be overlooked.

Consider adding a comment noting this dependency, or extracting the version to a shared location (e.g., an environment variable in env-local).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infra/local/trino-execute.sh` at line 26, The Trino image version is
hardcoded as "trinodb/trino:477" in the script while the same version is defined
as x-trino-image in docker-compose.yml; update the script to either add a clear
comment above the "trinodb/trino:477" reference noting that this must match the
docker-compose.yml x-trino-image entry or refactor to read the version from a
shared source (e.g., an environment variable defined in env-local) so both
places stay in sync; locate the "trinodb/trino:477" string in
infra/local/trino-execute.sh and either add the explanatory comment referencing
docker-compose.yml and x-trino-image or change the image string to use the
shared environment variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/.dlt/config.toml`:
- Line 2: The config currently sets backfill = false which prevents rdm_data()
from reading historical files (it only reads **/*.xlsx and **/*-daily.csv when
backfill is true); change the config to set backfill = true for the initial
rollout (or alternatively seed/migrate the existing accelerator table before
repointing dbt to the new estates landing source) so that rdm_data() ingests the
historical **/*.xlsx and **/*-daily.csv slices and you don’t expose only the
live *-ISIS.csv data on first run.

In
`@warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py`:
- Around line 135-138: The XLSX branch currently only renames "Time" to
"DateTime" so the power column "Total Power (MW)" doesn't match the CSV branch's
"ISIS Elec Total Power {MW}" header; update the code in
electricity_sharepoint.py (the df assignment/rename block where df is created
and to_utc is applied) to also rename "Total Power (MW)" to "ISIS Elec Total
Power {MW}" (or add a conditional rename if the source column exists) so
downstream mapping (stg_electricity_sharepoint_rdm_data.sql expecting
isis_elec_total_power_mwx) receives the same column name as the CSV path.
- Around line 59-61: The to_utc function currently localizes naive timestamps
without handling DST edge cases; update to_utc(ts: pd.Series) so tz_localize
includes explicit DST handling (e.g., ambiguous='infer' and
nonexistent='shift_forward' or another suitable policy) before converting to
UTC, and then remove or replace the incorrect exception handler that catches
pytz.exceptions.AmbiguousTimeError in csv_section_to_df() /
read_power_consumption_excel()—either delete that try/except block or change it
to catch ValueError (the exception pandas raises for ambiguous/nonexistent
times) so DST boundary cases no longer crash at runtime.

---

Nitpick comments:
In `@infra/local/trino-execute.sh`:
- Line 26: The Trino image version is hardcoded as "trinodb/trino:477" in the
script while the same version is defined as x-trino-image in docker-compose.yml;
update the script to either add a clear comment above the "trinodb/trino:477"
reference noting that this must match the docker-compose.yml x-trino-image entry
or refactor to read the version from a shared source (e.g., an environment
variable defined in env-local) so both places stay in sync; locate the
"trinodb/trino:477" string in infra/local/trino-execute.sh and either add the
explanatory comment referencing docker-compose.yml and x-trino-image or change
the image string to use the shared environment variable.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3813427c-ffe9-408a-bcaa-9a0381825a33

📥 Commits

Reviewing files that changed from the base of the PR and between 121607e and 862e859.

⛔ Files ignored due to path filters (1)
  • warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • infra/local/trino-execute.sh
  • warehouses/facility_ops/transform/dbt_project.yml
  • warehouses/facility_ops/transform/models/staging/estates/electricity_sharepoint/_electricity_sharepoint__sources.yml
  • warehouses/facility_ops/transform/models/staging/estates/electricity_sharepoint/stg_electricity_sharepoint_rdm_data.sql
  • warehouses/facility_ops/transform/models/staging/estates/electricity_sharepoint/stg_electricity_sharepoint_rdm_data.yml
  • warehouses/facility_ops_landing/ingest/accelerator/electricity_sharepoint/.dlt/config.toml
  • warehouses/facility_ops_landing/ingest/accelerator/electricity_sharepoint/electricity_sharepoint.py
  • warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/.dlt/config.toml
  • warehouses/facility_ops_landing/ingest/estates/electricity_sharepoint/electricity_sharepoint.py
💤 Files with no reviewable changes (2)
  • warehouses/facility_ops_landing/ingest/accelerator/electricity_sharepoint/electricity_sharepoint.py
  • warehouses/facility_ops_landing/ingest/accelerator/electricity_sharepoint/.dlt/config.toml

@martyngigg martyngigg merged commit 8721aa6 into main Mar 17, 2026
2 checks passed
@martyngigg martyngigg deleted the electricity-sharepoint-parse-merged-data branch March 17, 2026 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ask for RDM electricity data to be merged

1 participant