-
Notifications
You must be signed in to change notification settings - Fork 0
Add LLM judge evaluator factory #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds a reusable LLM-as-a-judge evaluator factory to the evaluation harness with shared OpenAI-parse utilities, rubric/prompt rendering, deterministic error metrics, and unit tests for success/error scenarios.
Changes:
- Introduces
make_llm_as_judge_evaluatorwith structured response parsing and mapping toEvaluationobjects - Adds shared grader utilities for structured parse calls (with retry), markdown loading, prompt rendering, and prompt serialization
- Adds pytest coverage for successful mapping, default/custom rubric behavior, and deterministic error metrics
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| aieng-eval-agents/aieng/agent_evals/evaluation/graders/llm_judge.py | Adds the LLM-judge evaluator factory and response-to-evaluations mapping |
| aieng-eval-agents/aieng/agent_evals/evaluation/graders/_utils.py | Adds shared helpers for structured parse calls, retry logic, rubric rendering, and serialization |
| aieng-eval-agents/aieng/agent_evals/evaluation/graders/config.py | Introduces request/retry configuration for LLM grader calls |
| aieng-eval-agents/aieng/agent_evals/evaluation/graders/init.py | Exposes grader factory/types at the package level |
| aieng-eval-agents/tests/aieng/agent_evals/evaluation/graders/test_llm_judge.py | Adds comprehensive unit tests for the evaluator factory and helpers |
| aieng-eval-agents/tests/aieng/agent_evals/evaluation/init.py | Establishes test package structure for evaluation modules |
| aieng-eval-agents/tests/aieng/agent_evals/evaluation/graders/init.py | Establishes test package structure for grader tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
aieng-eval-agents/aieng/agent_evals/evaluation/graders/__init__.py
Outdated
Show resolved
Hide resolved
aieng-eval-agents/aieng/agent_evals/evaluation/graders/llm_judge.py
Outdated
Show resolved
Hide resolved
aieng-eval-agents/aieng/agent_evals/evaluation/graders/_utils.py
Outdated
Show resolved
Hide resolved
aieng-eval-agents/aieng/agent_evals/evaluation/graders/llm_judge.py
Outdated
Show resolved
Hide resolved
amrit110
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but i'd also wait for @lotif to review this one.
| metrics: list[LLMJudgeMetric] | ||
|
|
||
|
|
||
| def make_llm_as_judge_evaluator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: Use create_* instead since ive used that before in the codebase.
Summary
This PR adds a reusable LLM-as-a-judge evaluator factory for the evaluation harness, including request/retry configuration, shared grader utilities, and comprehensive unit tests for success and error paths.
Clickup Ticket(s): N/A
Type of Change
Changes Made
make_llm_as_judge_evaluatorwith structured response parsing and mapping to Langfuse-compatibleEvaluationobjects.Testing
uv run pytest tests/)uv run mypy <src_dir>)uv run ruff check src_dir/)Manual testing details:
N/A
Screenshots/Recordings
N/A
Related Issues
N/A
Deployment Notes
N/A
Checklist