Skip to content

Add Polars compare tool#591

Open
paddymul wants to merge 3 commits intomainfrom
feat/polars-compare
Open

Add Polars compare tool#591
paddymul wants to merge 3 commits intomainfrom
feat/polars-compare

Conversation

@paddymul
Copy link
Collaborator

Summary

  • Adds buckaroo/polars_compare.py with col_join_dfs for Polars DataFrames, mirroring the pandas compare module from Fix BuckarooCompare for arbitrary join keys #589
  • Uses pl.DataFrame.join() with coalesce=False to detect membership via null patterns on join keys (polars has no merge indicator)
  • Maps "outer" to "full" for polars join API compatibility
  • Uses eq_missing/ne_missing for null-aware comparisons (polars != returns null when either operand is null)
  • Validates join key uniqueness via pl.struct().is_unique()

Test plan

  • 11 tests in tests/unit/polars_compare_test.py covering:
    • Single join key
    • Multi-key joins
    • Outer join membership (df1-only, df2-only, both)
    • Reordered rows
    • One-sided extra columns
    • String join_columns normalization
    • Sentinel column name rejection
    • Inner join
    • Null-heavy comparisons
    • Duplicate key rejection
    • outer/full how alias
  • All 11 tests pass locally

🤖 Generated with Claude Code

Polars equivalent of the pandas compare module. Key differences:
- Uses pl.DataFrame.join() with coalesce=False to detect membership
  via null patterns on join keys (polars has no merge indicator)
- Maps 'outer' to 'full' for polars join API compatibility
- Uses eq_missing/ne_missing for null-aware comparisons
- Validates join key uniqueness via pl.struct().is_unique()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0c2ca5af1a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 61 to 64
left_key = join_columns[0]
right_key = f"{left_key}{df2_suffix}"
m_df = m_df.with_columns(
pl.when(pl.col(left_key).is_not_null() & pl.col(right_key).is_not_null())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Track row origin without nullable join keys

membership is derived from whether join_columns[0] and its suffixed counterpart are null, but that logic fails when the first join key itself can be null. In an outer/full join, a df1-only row with a null first key will have both key columns null after the join and gets labeled as 2 (df2-only) instead of 1, which miscolors row provenance and can distort downstream comparisons that depend on membership. Derive origin from explicit per-side marker columns added before the join (or another non-null indicator) rather than key nullness.

Useful? React with 👍 / 👎.

@github-actions
Copy link

github-actions bot commented Feb 25, 2026

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.12.12.dev22427399358

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.12.12.dev22427399358

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.12.12.dev22427399358" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

paddymul and others added 2 commits February 25, 2026 23:07
Addresses Codex review: null join keys broke membership detection when
relying on key null patterns. Now adds non-null boolean marker columns
(__bk_left, __bk_right) before the join, derives membership from those,
then drops them. This is immune to null join keys.

Adds test for nullable join key edge case.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant