Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
193 changes: 193 additions & 0 deletions docs/2-getting-started/environment-best-practices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
---
title: Environment Best Practices
---

# Environment Best Practices

Unreliable comparison environments produce misleading validation results. When source data drifts, branches fall behind, or environments collide, you cannot trust what Recce reports.

This guide covers strategies to prepare reliable, efficient environments for Recce data validation. Recce compares a *base environment* (production or staging, representing your main branch) against a *current environment* (representing your pull request branch).

## When to use this guide

- Setting up CI/CD for Recce for the first time
- Seeing inconsistent diff results across PRs
- Managing warehouse costs from accumulated PR environments
- Troubleshooting validation results that don't match expectations

## Challenges this guide addresses

Several factors can affect comparison accuracy:

- Source data updates continuously
- Transformations take time to run
- Other pull requests (PRs) merge into the base branch
- Generated environments accumulate in the warehouse

## Use per-PR schemas

Each PR should have its own isolated schema. This prevents interference between concurrent PRs and makes cleanup straightforward.

```yaml
# profiles.yml
ci:
schema: "{{ env_var('CI_SCHEMA') }}"

# CI workflow
env:
CI_SCHEMA: "pr_${{ github.event.pull_request.number }}"
```

Benefits:

- Complete isolation between PRs
- Parallel validation without conflicts
- Easy cleanup by dropping the schema

See [Environment Setup](environment-setup.md) for detailed configuration.

## Prepare a single base environment

Use one consistent base environment for all PRs to compare against. Options:

| Base Environment | Characteristics | Best For |
|------------------|-----------------|----------|
| Production | Latest merged code, full data | Accurate production comparison |
| Staging | Latest merged code, limited data | Faster comparisons, lower cost |

If using staging as base:

- Ensure transformed results reflect the latest commit of the base branch
- Use the same source data as PR environments
- Use the same transformation logic as PR environments

The staging environment should match PR environments as closely as possible, differing only in git commit.

## Limit source data range

Most data is temporal. Using only recent data reduces transformation time while still validating correctness.

**Strategy:** Use data from the last month, excluding the current week. This ensures consistent results regardless of when transformations run.

```sql
SELECT *
FROM {{ source('your_source_name', 'orders') }}
{% if target.name != 'prod' %}
WHERE
order_date >= DATEADD(month, -1, CURRENT_DATE)
AND order_date < DATE_TRUNC('week', CURRENT_DATE)
{% endif %}
```

![Diagram showing how limiting data to the previous month excluding current week creates consistent comparison windows](../assets/images/7-cicd/prep-env-limit-data-range.png){: .shadow}

Benefits:

- Faster transformation execution
- Consistent comparison results
- Reduced warehouse costs

## Reduce source data volatility

If source data updates frequently (hourly or more), comparison results can vary based on timing rather than code changes.

**Strategies:**

- **Zero-copy clone** (Snowflake, BigQuery, Databricks): Freeze source data at a specific point in time
- **Weekly snapshots**: Update source data weekly to reduce variability

![Diagram showing zero-copy clone creating a frozen snapshot of source data for consistent CI comparisons](../assets/images/7-cicd/prep-env-clone-source.png){: .shadow}

## Keep base environment current

The base environment can become outdated in two scenarios:

1. **New source data**: If you update data weekly, update the base environment at least weekly
2. **PRs merged to main**: Trigger base environment update on merge events

Configure your CD workflow to run:

- On merge to main (immediate update)
- On schedule (e.g., daily at 2 AM UTC)

See [Setup CD](setup-cd.md) for workflow configuration.

## Obtain artifacts for environments

Recce uses base and current environment artifacts (`manifest.json`, `catalog.json`) to find corresponding tables in the data warehouse for comparison.

**Recommended approaches:**

- **Recce Cloud** - Automatic artifact management via `recce-cloud upload`. See [Setup CD](setup-cd.md) and [Setup CI](setup-ci.md).
- **dbt Cloud** - Download artifacts from dbt Cloud jobs. See dbt Cloud Setup (separate guide).

**Alternative approaches** (for custom setups):

- **Cloud storage** - Upload artifacts to S3, GCS, or Azure Blob in CI
- **GitHub Actions artifacts** - Use `gh run download` to retrieve from workflow runs
- **Stateless** - Checkout the base branch and run `dbt docs generate` on-demand

## Keep PR branch in sync with base

If a PR runs after other PRs merge to main, the comparison mixes:

- Changes from the current PR
- Changes from other merged PRs

This produces comparison results that don't accurately reflect the current PR's impact.

![Diagram showing how an outdated PR branch mixes changes from other merged PRs into comparison results](../assets/images/7-cicd/prep-env-pr-outdated.png){: .shadow}

**GitHub**: Enable [branch protection](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/keeping-your-pull-request-in-sync-with-the-base-branch) to show when PRs are outdated.

**CI check**: Add a workflow step to verify the PR is up-to-date:

```yaml
- name: Check if PR is up-to-date
if: github.event_name == 'pull_request'
run: |
git fetch origin main
UPSTREAM=${GITHUB_BASE_REF:-'main'}
HEAD=${GITHUB_HEAD_REF:-${GITHUB_REF#refs/heads/}}
if [ "$(git rev-list --left-only --count ${HEAD}...origin/${UPSTREAM})" -eq 0 ]; then
echo "Branch is up-to-date"
else
echo "Branch is not up-to-date"
exit 1
fi
```

## Clean up PR environments

As PRs accumulate, so do generated schemas. Implement cleanup to manage warehouse storage.

**On PR close**: Create a workflow that drops the PR schema when the PR closes.

```jinja
{% macro clear_schema(schema_name) %}
{% set drop_schema_command = "DROP SCHEMA IF EXISTS " ~ schema_name ~ " CASCADE;" %}
{% do run_query(drop_schema_command) %}
{% endmacro %}
```

Run the cleanup:

```shell
dbt run-operation clear_schema --args "{'schema_name': 'pr_123'}"
```

**Scheduled cleanup**: Remove schemas not used for a week.

## Example configuration

| Environment | Schema | When to Run | Count | Data Range |
|-------------|--------|-------------|-------|------------|
| Production | `public` | Daily | 1 | All |
| Staging | `staging` | Daily + on merge | 1 | 1 month, excluding current week |
| PR | `pr_<number>` | On push | # of open PRs | 1 month, excluding current week |

## Next steps

- [Environment Setup](environment-setup.md) - Technical configuration for profiles.yml and CI/CD
- [Setup CD](setup-cd.md) - Configure automatic baseline updates
- [Setup CI](setup-ci.md) - Configure PR validation
195 changes: 195 additions & 0 deletions docs/2-getting-started/environment-setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
---
title: Environment Setup
---

# Environment Setup

Configure your dbt profiles and CI/CD environment variables for Recce data validation.

## Goal

Set up isolated schemas for base vs current comparison. After completing this guide, your CI/CD workflows automatically create per-PR schemas and compare them against production.

## Prerequisites

- [x] **dbt project**: A working dbt project with `profiles.yml` configured
- [x] **CI/CD platform**: GitHub Actions, GitLab CI, or similar
- [x] **Warehouse access**: Credentials with permissions to create schemas dynamically

## Why separate schemas matter

Recce compares two sets of data to validate changes:

- **Base**: The production state (main branch)
- **Current**: The PR branch with your changes

For accurate validation, these must point to different schemas in your warehouse. Without separation, you would compare identical data and miss meaningful differences.

## How CI/CD works with Recce

Recce uses both continuous delivery (CD) and continuous integration (CI) to automate data validation:

- **CD (Continuous Delivery)**: Runs after merge to main. Updates baseline artifacts with latest production state.
- **CI (Continuous Integration)**: Runs on PR. Validates proposed changes against baseline.

**Set up CD first**, then CI. CD establishes your baseline (production artifacts), which CI uses for comparison.

## Configure profiles.yml

Your `profiles.yml` file defines how dbt connects to your warehouse. Add a `ci` target with a dynamic schema for PR isolation.

```yaml
jaffle_shop:
target: dev
outputs:
dev:
type: snowflake
account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
user: "{{ env_var('SNOWFLAKE_USER') }}"
password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
database: analytics
warehouse: COMPUTE_WH
schema: dev
threads: 4

# CI environment with dynamic schema per PR
ci:
type: snowflake
account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
user: "{{ env_var('SNOWFLAKE_USER') }}"
password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
database: analytics
warehouse: COMPUTE_WH
schema: "{{ env_var('CI_SCHEMA') }}"
threads: 4

prod:
type: snowflake
account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
user: "{{ env_var('SNOWFLAKE_USER') }}"
password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
database: analytics
warehouse: COMPUTE_WH
schema: public
threads: 4
```

After saving, your profile supports three targets: `dev` for local development, `ci` for PR validation, and `prod` for production.

Key points:

- The `ci` target uses `env_var('CI_SCHEMA')` for dynamic schema assignment
- The `prod` target uses a fixed schema (`public`) for consistency
- Adapt this pattern for other warehouses (BigQuery uses `dataset` instead of `schema`)

## Set CI/CD environment variables

Your CI/CD workflow sets the schema dynamically for each PR. The key configuration:

**GitHub Actions:**

```yaml
env:
CI_SCHEMA: "pr_${{ github.event.pull_request.number }}"
```

**GitLab CI:**

```yaml
variables:
CI_SCHEMA: "mr_${CI_MERGE_REQUEST_IID}"
```

This creates schemas like `pr_123`, `pr_456` for each PR automatically. When a PR opens, the workflow sets `CI_SCHEMA` and dbt writes to that isolated schema.

For complete workflow examples, see [Setup CD](setup-cd.md) and [Setup CI](setup-ci.md).

## Recommended pattern: Schema-per-PR

Create an isolated schema for each PR. This is the recommended approach for teams.

| Base Schema | Current Schema | Example |
|-------------|----------------|---------|
| `public` (prod) | `pr_123` | PR #123 gets its own schema |

**Why this pattern:**

- Complete isolation between PRs
- Multiple PRs can run validation in parallel without conflicts
- Easy cleanup by dropping the schema when PR closes
- Clear audit trail of what data each PR produced

## Alternative patterns

### Using staging as base

Instead of comparing against production, compare against a staging environment with limited data.

| Base Schema | Current Schema | Use Case |
|-------------|----------------|----------|
| `staging` | `pr_123` | Teams wanting faster comparisons |

**Pros:**

- Faster diffs with limited data ranges
- Consistent source data between base and current
- Reduced warehouse costs

**Cons:**

- Staging may drift from production
- Issues caught in staging might not reflect production behavior
- Requires maintaining an additional environment

See [Environment Best Practices](environment-best-practices.md) for strategies on limiting data ranges.

### Shared development schema (not recommended)

Using a single `dev` schema for all development work.

| Base Schema | Current Schema | Use Case |
|-------------|----------------|----------|
| `public` (prod) | `dev` | Solo developers only |

**Why this is not recommended:**

- Multiple PRs overwrite each other's data
- Cannot run parallel validations
- Comparison results may include changes from other work
- Difficult to isolate issues to specific PRs

Only use this pattern for individual local development, not for CI/CD automation.

## Verification

After configuring your setup, verify that both base and current schemas are accessible.

### Check configuration locally

```shell
dbt debug --target ci
```

### Verify in Recce interface

Launch Recce and check **Environment Info** in the top-right corner. You should see:

- **Base**: Your production schema (e.g., `public`)
- **Current**: Your PR-specific schema (e.g., `pr_123`)

## Troubleshooting

| Issue | Solution |
|-------|----------|
| Schema creation fails | Verify your CI credentials have `CREATE SCHEMA` permissions |
| Environment variable not found | Check that secrets are configured in your CI/CD platform settings |
| Base and current show same schema | Ensure `--target ci` is used in CI, not `--target dev` |
| Profile not found | Verify `profiles.yml` is accessible in CI (check path or use `DBT_PROFILES_DIR`) |
| Connection timeout | Check warehouse IP allowlists include CI runner IP ranges |

## Next steps

- [Get Started with Recce Cloud](start-free-with-cloud.md) - Complete onboarding guide
- [Environment Best Practices](environment-best-practices.md) - Strategies for source data and schema management
- [Setup CD](setup-cd.md) - CD workflow for GitHub Actions and GitLab CI
- [Setup CI](setup-ci.md) - CI workflow for GitHub Actions and GitLab CI
Loading