Add LLM judge evaluator factory #39

fcogidi · 2026-02-06T21:43:52Z

Summary

This PR adds a reusable LLM-as-a-judge evaluator factory for the evaluation harness, including request/retry configuration, shared grader utilities, and comprehensive unit tests for success and error paths.

Clickup Ticket(s): N/A

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📝 Documentation update
🔧 Refactoring (no functional changes)
⚡ Performance improvement
🧪 Test improvements
🔒 Security fix

Changes Made

Added make_llm_as_judge_evaluator with structured response parsing and mapping to Langfuse-compatible Evaluation objects.
Added grader config/util modules for OpenAI-compatible parse calls, retry handling, prompt/rubric rendering, markdown loading, and deterministic error metrics.
Added unit tests covering successful metric mapping, default/custom rubric behavior, error fallback paths, and confidence validation boundaries.

Testing

Tests pass locally (uv run pytest tests/)
Type checking passes (uv run mypy <src_dir>)
Linting passes (uv run ruff check src_dir/)
Manual testing performed (describe below)

Manual testing details:
N/A

Screenshots/Recordings

N/A

Related Issues

N/A

Deployment Notes

N/A

Checklist

Code follows the project's style guidelines
Self-review of code completed
Documentation updated (if applicable)
No sensitive information (API keys, credentials) exposed

Copilot

Pull request overview

Adds a reusable LLM-as-a-judge evaluator factory to the evaluation harness with shared OpenAI-parse utilities, rubric/prompt rendering, deterministic error metrics, and unit tests for success/error scenarios.

Changes:

Introduces make_llm_as_judge_evaluator with structured response parsing and mapping to Evaluation objects
Adds shared grader utilities for structured parse calls (with retry), markdown loading, prompt rendering, and prompt serialization
Adds pytest coverage for successful mapping, default/custom rubric behavior, and deterministic error metrics

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
aieng-eval-agents/aieng/agent_evals/evaluation/graders/llm_judge.py	Adds the LLM-judge evaluator factory and response-to-evaluations mapping
aieng-eval-agents/aieng/agent_evals/evaluation/graders/_utils.py	Adds shared helpers for structured parse calls, retry logic, rubric rendering, and serialization
aieng-eval-agents/aieng/agent_evals/evaluation/graders/config.py	Introduces request/retry configuration for LLM grader calls
aieng-eval-agents/aieng/agent_evals/evaluation/graders/init.py	Exposes grader factory/types at the package level
aieng-eval-agents/tests/aieng/agent_evals/evaluation/graders/test_llm_judge.py	Adds comprehensive unit tests for the evaluator factory and helpers
aieng-eval-agents/tests/aieng/agent_evals/evaluation/init.py	Establishes test package structure for evaluation modules
aieng-eval-agents/tests/aieng/agent_evals/evaluation/graders/init.py	Establishes test package structure for grader tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

aieng-eval-agents/aieng/agent_evals/evaluation/graders/__init__.py

aieng-eval-agents/aieng/agent_evals/evaluation/graders/llm_judge.py

aieng-eval-agents/aieng/agent_evals/evaluation/graders/_utils.py

aieng-eval-agents/aieng/agent_evals/evaluation/graders/llm_judge.py

amrit110

Looks good, but i'd also wait for @lotif to review this one.

amrit110 · 2026-02-09T16:46:08Z

aieng-eval-agents/aieng/agent_evals/evaluation/graders/llm_judge.py

+    metrics: list[LLMJudgeMetric]
+
+
+def make_llm_as_judge_evaluator(


Nitpick: Use create_* instead since ive used that before in the codebase.

Add LLM judge evaluator and related configurations

827fb3f

fcogidi requested review from amrit110, Copilot and lotif February 6, 2026 21:43

fcogidi self-assigned this Feb 6, 2026

fcogidi added the enhancement New feature or request label Feb 6, 2026

fcogidi marked this pull request as ready for review February 6, 2026 21:44

Copilot AI reviewed Feb 6, 2026

View reviewed changes

fcogidi changed the title ~~Add LLM judge evaluator~~ Add LLM judge evaluator factory Feb 6, 2026

fcogidi and others added 2 commits February 6, 2026 16:56

Address PR comments from copilot

483c530

Merge branch 'main' into fco/add_llm_judge

8ae7889

amrit110 approved these changes Feb 9, 2026

View reviewed changes

Merge branch 'main' into fco/add_llm_judge

f5820da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LLM judge evaluator factory #39

Add LLM judge evaluator factory #39

Uh oh!

fcogidi commented Feb 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amrit110 left a comment

Uh oh!

amrit110 Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		metrics: list[LLMJudgeMetric]


		def make_llm_as_judge_evaluator(

Add LLM judge evaluator factory #39

Are you sure you want to change the base?

Add LLM judge evaluator factory #39

Uh oh!

Conversation

fcogidi commented Feb 6, 2026

Summary

Type of Change

Changes Made

Testing

Screenshots/Recordings

Related Issues

Deployment Notes

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amrit110 left a comment

Choose a reason for hiding this comment

Uh oh!

amrit110 Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants