DRC-2820: PoC scenario detection and base mode classification by even-wei · Pull Request #2 · DataRecce/dbt-tpch

even-wei · 2026-02-25T06:33:19Z

Summary

Add detection heuristic prototype that analyzes manifest.json to recommend shared vs isolated base mode
Add row count comparison script to validate false alarm patterns
Add test incremental model demonstrating the data divergence scenario

Changes

scripts/detect_base_mode.py — Parses manifest for incremental models, event_time coverage, materialization mix, and project scale to classify the project
scripts/compare_environments.py — Compares row counts between base/current PostgreSQL schemas
models/metrics/metrics_daily_shipments.sql — Incremental model (delete+insert) for testing the isolated base scenario

Test plan

Detection script recommends "shared base" for 70-model project with no incremental models
Detection script flips to "isolated base" when 1 incremental model is added
Row count comparison shows 51/51 match with shared base (no incremental)
Row count comparison shows 1 mismatch (-50.7%) when incremental model has divergent data
JSON output mode works for machine-readable integration

Size

+485/-0 across 3 files

Refs: DRC-2820

🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

…tion Add detection heuristic prototype that analyzes dbt manifest.json to recommend shared base vs isolated base mode for Recce CI. The core insight: projects with incremental models need isolated base to avoid false alarm row count diffs. - detect_base_mode.py: parses manifest, classifies by materialization type, event_time coverage, and project scale - compare_environments.py: compares row counts between base/current schemas to validate false alarm patterns - metrics_daily_shipments.sql: test incremental model demonstrating the data divergence scenario Validated: 51/51 tables match with shared base (no incremental), 1 mismatch (-50.7%) when incremental model present. Refs: DRC-2820 Signed-off-by: even-wei <evenwei@infuseai.io>

…lation The root cause of false alarms is NOT "incremental models accumulate data." It is conditional logic (is_incremental(), current_date(), target.name, {{ this }}) that produces different SQL depending on build context. Two environments built under different conditions run different queries against the same source → different results → false alarm diffs. - detect_base_mode.py: reframed signal explanations from "data accumulation" to "conditional logic produces non-deterministic SQL"; added snapshot model detection alongside incremental - metrics_daily_shipments.sql: added target-dependent else branch (pg-base gets 365 days, others get 90 days) to demonstrate the real-world conditional fork pattern seen in fct_cmab_strategy_reward Validated: pg-base (365d) vs pg-current (90d) produces -4.8% row diff on the conditional model, while all 50 deterministic models match. Refs: DRC-2820 Signed-off-by: even-wei <evenwei@infuseai.io>

…ialization Add models demonstrating that false alarms are caused by non-deterministic SQL patterns (target.name, current_date), not by materialization type: - metrics_regional_revenue (table): target.name date window → -68.8% mismatch - metrics_shipping_efficiency (table): target.name branching → -68.8% mismatch - metrics_order_summary (view): target.name date window → -78.7% mismatch - metrics_daily_orders (incremental, deterministic else): 0% match — safe Detection script v2: scans raw SQL for non-deterministic patterns instead of checking materialization type. Correctly flags 4/4 problematic models, correctly marks safe incremental as safe. Zero false positives, zero false negatives. Compare script: now includes views in row count comparison. Signed-off-by: even-wei <evenwei@infuseai.io>

Prototype two detection methods for non-deterministic dbt models: 1. compiled_sql_diff.py — Compiles under two targets, normalizes schema names and batch metadata, diffs remaining SQL. Requires --full-refresh to catch incremental else branches. 2. compare_detection_approaches.py — Runs both Jinja scanning and compiled SQL diffing side-by-side, comparing accuracy against ground truth. Key findings: - Both approaches achieve 100% accuracy on dbt-tpch (73 models) - Compiled SQL diff needs --full-refresh for incremental models - Schema normalization must be precise (db.schema.table only) - dbt_batch_id/ts must be stripped as compile-time artifacts Relates to DRC-2863 Signed-off-by: even-wei <evenwei@infuseai.io>

even-wei added 4 commits February 25, 2026 14:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRC-2820: PoC scenario detection and base mode classification#2

DRC-2820: PoC scenario detection and base mode classification#2
even-wei wants to merge 4 commits intomasterfrom
feature/drc-2820-poc-scenario-detection

even-wei commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

even-wei commented Feb 25, 2026

Summary

Changes

Test plan

Size

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant