Specialized agent evaluates and scores outputs from other agents. Provides consistent, objective evaluation and quality control.
Overview
Agent Judge uses a dedicated agent to evaluate outputs, select between alternatives, or make decisions based on defined criteria. This pattern separates execution from evaluation, enabling consistent quality assessment and automated decision-making.
When to Use
- Output evaluation: Assess quality of agent outputs
- Selection tasks: Choose best option from multiple candidates
- Quality gates: Validate outputs meet standards
- A/B testing: Compare different approaches
- Scoring and ranking: Order results by quality
- Automated review: Consistent evaluation criteria
Basic Usage
from azcore import AgentJudge, Agent
# Create judge agent
judge = AgentJudge(
agent_name="Judge",
system_prompt="""Evaluate outputs objectively.
Criteria:
- Accuracy (0-10)
- Clarity (0-10)
- Completeness (0-10)
- Usefulness (0-10)
Provide scores and justification.
Format: SCORE: [total]/40""",
llm=llm,
)
# Evaluate an output
output = "Some agent output to evaluate..."
evaluation = judge.evaluate(output)
print(f"Score: {evaluation.score}")
print(f"Feedback: {evaluation.feedback}")
Configuration Options
evaluation_criteria
Define what to evaluate:
judge = AgentJudge(
system_prompt="""Evaluate based on:
1. Technical Accuracy (weight: 40%)
2. Practical Applicability (weight: 30%)
3. Clarity of Expression (weight: 20%)
4. Innovation (weight: 10%)
Total score: 0-100""",
)
scoring_method
How to score:
# Numeric scoring
judge = AgentJudge(scoring_method="numeric", score_range=(0, 10))
# Letter grades
judge = AgentJudge(scoring_method="letter", grades=["A", "B", "C", "D", "F"])
# Pass/Fail
judge = AgentJudge(scoring_method="binary", threshold=0.7)
# Custom scoring
judge = AgentJudge(
scoring_method="custom",
score_extractor=lambda text: extract_custom_score(text)
)
comparison_mode
For comparing multiple outputs:
judge = AgentJudge(
comparison_mode=True,
system_prompt="Compare the outputs and select the best one."
)
best = judge.select_best([output1, output2, output3])
Advanced Examples
Code Quality Evaluation
from azcore import AgentJudge
# Code quality judge
code_judge = AgentJudge(
agent_name="Code Quality Judge",
system_prompt="""Evaluate code quality comprehensively.
Criteria (each 0-10):
1. Correctness - Does it work?
2. Readability - Is it clear?
3. Efficiency - Is it optimized?
4. Maintainability - Is it sustainable?
5. Best Practices - Does it follow conventions?
6. Error Handling - Are errors managed?
7. Documentation - Is it documented?
8. Testing - Is it testable?
Provide:
- Individual scores
- Total score (0-80)
- Specific feedback for each criterion
- Overall recommendation
Format:
SCORES: [list scores]
TOTAL: [sum]/80
GRADE: [A/B/C/D/F]""",
llm=llm,
)
# Evaluate code
code_sample = """
def calculate_average(numbers):
return sum(numbers) / len(numbers)
"""
evaluation = code_judge.evaluate(code_sample)
print(f"Code Grade: {evaluation.grade}")
print(f"Total Score: {evaluation.score}/80")
print(f"\nFeedback:\n{evaluation.feedback}")
Content Quality Assessment
# Content quality judge
content_judge = AgentJudge(
agent_name="Content Judge",
system_prompt="""Evaluate content quality.
Dimensions (each 0-10):
- Accuracy: Facts correct?
- Clarity: Easy to understand?
- Engagement: Interesting to read?
- Structure: Well-organized?
- Grammar: Proper language?
- Depth: Sufficient detail?
- Sources: Well-cited?
- Originality: Fresh perspective?
Total: 0-80
Verdict: [PUBLISH/REVISE/REJECT]""",
llm=llm,
)
# Evaluate article
article = """[article text here]"""
evaluation = content_judge.evaluate(article)
print(f"Verdict: {evaluation.verdict}")
print(f"Score: {evaluation.score}/80")
print(f"\nStrengths: {evaluation.strengths}")
print(f"Improvements: {evaluation.improvements}")
Solution Comparison
# Comparison judge
comparison_judge = AgentJudge(
agent_name="Solution Comparator",
system_prompt="""Compare multiple solutions and select the best.
For each solution, evaluate:
1. Effectiveness (0-10)
2. Feasibility (0-10)
3. Cost (0-10)
4. Time to implement (0-10)
5. Risk level (0-10)
Select winner and explain why.
Format:
SOLUTION 1: [scores]
SOLUTION 2: [scores]
SOLUTION 3: [scores]
WINNER: Solution [n]
REASON: [explanation]""",
llm=llm,
)
# Compare solutions
solutions = [
"Solution 1: Build in-house from scratch...",
"Solution 2: Use existing open-source...",
"Solution 3: Purchase commercial product...",
]
result = comparison_judge.select_best(solutions)
print(f"Best solution: {result.winner}")
print(f"Reason: {result.reason}")
print(f"\nAll scores: {result.all_scores}")
Quality Gate
# Quality gate judge
quality_gate = AgentJudge(
agent_name="Quality Gate",
system_prompt="""Determine if output meets quality standards.
Requirements:
- Accuracy: Must be factually correct
- Completeness: Must address all requirements
- Clarity: Must be understandable
- Safety: Must not contain harmful content
Check each requirement:
✓ Pass
✗ Fail
Final verdict: PASS/FAIL
If FAIL, list reasons.""",
llm=llm,
)
# Check quality gate
output = """Some output to validate..."""
result = quality_gate.evaluate(output)
if result.verdict == "PASS":
print("Quality gate passed!")
proceed_with_output(output)
else:
print(f"Quality gate failed: {result.failures}")
handle_failure(output, result.failures)
Automated Grading
# Assignment grader
grader = AgentJudge(
agent_name="Assignment Grader",
system_prompt="""Grade the student submission.
Rubric:
1. Understanding (25 points) - Demonstrates comprehension
2. Application (25 points) - Applies concepts correctly
3. Analysis (25 points) - Critical thinking evident
4. Communication (15 points) - Clear presentation
5. Citations (10 points) - Proper references
Total: 100 points
Letter grade: A (90-100), B (80-89), C (70-79), D (60-69), F (<60)
Provide:
- Points for each criterion
- Total score
- Letter grade
- Constructive feedback""",
llm=llm,
)
# Grade submission
submission = """Student's assignment text..."""
grade = grader.evaluate(submission)
print(f"Grade: {grade.letter} ({grade.score}/100)")
print(f"\nBreakdown:")
for criterion, points in grade.breakdown.items():
print(f" {criterion}: {points}")
print(f"\nFeedback:\n{grade.feedback}")
Risk Assessment
# Risk assessor
risk_judge = AgentJudge(
agent_name="Risk Assessor",
system_prompt="""Assess risk level of the proposal.
Risk Categories:
1. Technical Risk (0-10)
2. Financial Risk (0-10)
3. Schedule Risk (0-10)
4. Reputation Risk (0-10)
5. Regulatory Risk (0-10)
For each:
- Likelihood (Low/Medium/High)
- Impact (Low/Medium/High)
- Risk score = Likelihood × Impact
Overall Risk: [Low/Medium/High/Critical]
Recommendation: [APPROVE/CONDITIONAL/REJECT]""",
llm=llm,
)
# Assess proposal
proposal = """Proposal details..."""
assessment = risk_judge.evaluate(proposal)
print(f"Overall Risk: {assessment.risk_level}")
print(f"Recommendation: {assessment.recommendation}")
print(f"\nRisk Breakdown:")
for category, risk in assessment.risks.items():
print(f" {category}: {risk.score} ({risk.likelihood} × {risk.impact})")
Evaluation Patterns
Rubric-Based
Clear criteria with weights:
rubric = """
Evaluate using this rubric:
| Criterion | Weight | Score (0-10) |
|-----------|--------|--------------|
| Accuracy | 40% | |
| Clarity | 30% | |
| Depth | 20% | |
| Style | 10% | |
Weighted Total: [0-10]
"""
Comparative
Rank multiple options:
judge_prompt = """Compare all options:
For each option:
1. List pros and cons
2. Score on key criteria
3. Identify trade-offs
Rank from best to worst.
Explain reasoning.
"""
Binary
Pass/Fail decision:
judge_prompt = """Does the output meet requirements?
Requirements:
- [List requirements]
Check each:
✓ Met
✗ Not met
Verdict: PASS if all met, FAIL otherwise
"""
Calibrated
Compare against reference:
judge_prompt = """Compare output to reference standard.
Reference: {reference}
Output: {output}
How does output compare?
- Better than reference
- Equal to reference
- Worse than reference
Score relative quality: -10 to +10
"""
Best Practices
1. Clear Evaluation Criteria
Be specific about what to evaluate:
judge = AgentJudge(
system_prompt="""Evaluate on these specific criteria:
1. Technical Accuracy (0-10)
- Facts are correct
- Methodology is sound
- No logical errors
2. Practical Usefulness (0-10)
- Actionable recommendations
- Real-world applicable
- Clear next steps
[etc...]""",
)
2. Consistent Scoring Format
Enforce consistent output:
judge_prompt = """MANDATORY FORMAT:
SCORES:
- Criterion 1: [score]/10
- Criterion 2: [score]/10
- Criterion 3: [score]/10
TOTAL: [sum]/30
VERDICT: [PASS/FAIL]
Deviation from this format is not acceptable.
"""
3. Provide Examples
Show good vs. bad examples:
judge_prompt = """Evaluate code quality.
Good example:
```python
def calculate_mean(numbers: List[float]) -> float:
'''Calculate arithmetic mean.'''
return sum(numbers) / len(numbers)
Score: 8/10 - Clear, typed, documented
Bad example:
def calc(n):
return sum(n)/len(n)
Score: 3/10 - Unclear name, no types, no docs """
### 4. Calibration
Validate judge consistency:
```python
# Test on known examples
test_cases = [
("excellent output", expected_score=9),
("poor output", expected_score=3),
("mediocre output", expected_score=5),
]
for output, expected in test_cases:
result = judge.evaluate(output)
assert abs(result.score - expected) < 2, "Judge not calibrated"
5. Multiple Judges
Use ensemble of judges:
judges = [judge1, judge2, judge3]
evaluations = [j.evaluate(output) for j in judges]
final_score = statistics.mean([e.score for e in evaluations])
Performance Considerations
Latency
# Single evaluation: 1 LLM call
evaluation = judge.evaluate(output)
# Multiple evaluations: N LLM calls
evaluations = [judge.evaluate(o) for o in outputs]
# Parallel evaluation
with ThreadPoolExecutor() as executor:
evaluations = list(executor.map(judge.evaluate, outputs))
Cost
# Cost = num_evaluations × judge_cost
# Judge prompts are typically long (detailed criteria)
# Optimize:
# 1. Cache evaluations
# 2. Use cheaper model for simple judgments
# 3. Batch similar evaluations
Consistency
# Check judge consistency
same_output_scores = [judge.evaluate(output) for _ in range(5)]
score_variance = statistics.stdev([s.score for s in same_output_scores])
if score_variance > 1.0:
print("Warning: Judge is inconsistent")
# Consider: lower temperature, clearer criteria, better prompt
Error Handling
Handle evaluation failures:
try:
evaluation = judge.evaluate(output)
except ScoringError as e:
# Couldn't extract score
evaluation = retry_with_clearer_format()
except EvaluationError as e:
# Judge refused to evaluate
evaluation = use_fallback_judge()
Debugging
Inspect Evaluations
evaluation = judge.evaluate(output)
print(f"Score: {evaluation.score}")
print(f"Reasoning: {evaluation.reasoning}")
print(f"Criteria breakdown:")
for criterion, score in evaluation.breakdown.items():
print(f" {criterion}: {score}")
Compare Judges
judges = [judge1, judge2, judge3]
for judge in judges:
eval = judge.evaluate(output)
print(f"{judge.agent_name}: {eval.score}")
# Check inter-judge agreement
scores = [judge.evaluate(output).score for judge in judges]
agreement = 1 - (statistics.stdev(scores) / statistics.mean(scores))
print(f"Inter-judge agreement: {agreement:.1%}")
Integration Patterns
With Self-Consistency
# Generate multiple candidates
candidates = [agent.run(task) for _ in range(5)]
# Judge each one
evaluations = [judge.evaluate(c) for c in candidates]
# Select best
best = max(zip(candidates, evaluations), key=lambda x: x[1].score)
With Reflexion
# Use judge for evaluation in reflexion loop
reflexion = ReflexionAgent(
agent=agent,
evaluator=lambda output: judge.evaluate(output).score,
max_iterations=5,
)
With Reasoning Duo
# Judge acts as critic
duo = ReasoningDuoAgent(
proposer=proposer,
critic=judge, # Judge provides critique
acceptance_criteria=lambda c: c.score >= 8,
)
Limitations
Subjectivity
Judges can be biased:
# Mitigation: Multiple judges, clear criteria, calibration
Context Limitations
May lack context to evaluate:
# Mitigation: Provide full context in evaluation prompt
Scoring Drift
Scores may drift over time:
# Mitigation: Regular calibration with reference examples