Agent Judge Pattern

Specialized agent evaluates and scores outputs from other agents. Provides consistent, objective evaluation and quality control.

Overview

Agent Judge uses a dedicated agent to evaluate outputs, select between alternatives, or make decisions based on defined criteria. This pattern separates execution from evaluation, enabling consistent quality assessment and automated decision-making.

When to Use

Output evaluation: Assess quality of agent outputs
Selection tasks: Choose best option from multiple candidates
Quality gates: Validate outputs meet standards
A/B testing: Compare different approaches
Scoring and ranking: Order results by quality
Automated review: Consistent evaluation criteria

Basic Usage

from azcore import AgentJudge, Agent

# Create judge agent
judge = AgentJudge(
    agent_name="Judge",
    system_prompt="""Evaluate outputs objectively.

    Criteria:
    - Accuracy (0-10)
    - Clarity (0-10)
    - Completeness (0-10)
    - Usefulness (0-10)

    Provide scores and justification.
    Format: SCORE: [total]/40""",
    llm=llm,
)

# Evaluate an output
output = "Some agent output to evaluate..."
evaluation = judge.evaluate(output)

print(f"Score: {evaluation.score}")
print(f"Feedback: {evaluation.feedback}")

Configuration Options

evaluation_criteria

Define what to evaluate:

judge = AgentJudge(
    system_prompt="""Evaluate based on:

    1. Technical Accuracy (weight: 40%)
    2. Practical Applicability (weight: 30%)
    3. Clarity of Expression (weight: 20%)
    4. Innovation (weight: 10%)

    Total score: 0-100""",
)

scoring_method

How to score:

# Numeric scoring
judge = AgentJudge(scoring_method="numeric", score_range=(0, 10))

# Letter grades
judge = AgentJudge(scoring_method="letter", grades=["A", "B", "C", "D", "F"])

# Pass/Fail
judge = AgentJudge(scoring_method="binary", threshold=0.7)

# Custom scoring
judge = AgentJudge(
    scoring_method="custom",
    score_extractor=lambda text: extract_custom_score(text)
)

comparison_mode

For comparing multiple outputs:

judge = AgentJudge(
    comparison_mode=True,
    system_prompt="Compare the outputs and select the best one."
)

best = judge.select_best([output1, output2, output3])

Advanced Examples

Code Quality Evaluation

from azcore import AgentJudge

# Code quality judge
code_judge = AgentJudge(
    agent_name="Code Quality Judge",
    system_prompt="""Evaluate code quality comprehensively.

    Criteria (each 0-10):
    1. Correctness - Does it work?
    2. Readability - Is it clear?
    3. Efficiency - Is it optimized?
    4. Maintainability - Is it sustainable?
    5. Best Practices - Does it follow conventions?
    6. Error Handling - Are errors managed?
    7. Documentation - Is it documented?
    8. Testing - Is it testable?

    Provide:
    - Individual scores
    - Total score (0-80)
    - Specific feedback for each criterion
    - Overall recommendation

    Format:
    SCORES: [list scores]
    TOTAL: [sum]/80
    GRADE: [A/B/C/D/F]""",
    llm=llm,
)

# Evaluate code
code_sample = """
def calculate_average(numbers):
    return sum(numbers) / len(numbers)
"""

evaluation = code_judge.evaluate(code_sample)
print(f"Code Grade: {evaluation.grade}")
print(f"Total Score: {evaluation.score}/80")
print(f"\nFeedback:\n{evaluation.feedback}")

Content Quality Assessment

# Content quality judge
content_judge = AgentJudge(
    agent_name="Content Judge",
    system_prompt="""Evaluate content quality.

    Dimensions (each 0-10):
    - Accuracy: Facts correct?
    - Clarity: Easy to understand?
    - Engagement: Interesting to read?
    - Structure: Well-organized?
    - Grammar: Proper language?
    - Depth: Sufficient detail?
    - Sources: Well-cited?
    - Originality: Fresh perspective?

    Total: 0-80
    Verdict: [PUBLISH/REVISE/REJECT]""",
    llm=llm,
)

# Evaluate article
article = """[article text here]"""
evaluation = content_judge.evaluate(article)

print(f"Verdict: {evaluation.verdict}")
print(f"Score: {evaluation.score}/80")
print(f"\nStrengths: {evaluation.strengths}")
print(f"Improvements: {evaluation.improvements}")

Solution Comparison

# Comparison judge
comparison_judge = AgentJudge(
    agent_name="Solution Comparator",
    system_prompt="""Compare multiple solutions and select the best.

    For each solution, evaluate:
    1. Effectiveness (0-10)
    2. Feasibility (0-10)
    3. Cost (0-10)
    4. Time to implement (0-10)
    5. Risk level (0-10)

    Select winner and explain why.

    Format:
    SOLUTION 1: [scores]
    SOLUTION 2: [scores]
    SOLUTION 3: [scores]
    WINNER: Solution [n]
    REASON: [explanation]""",
    llm=llm,
)

# Compare solutions
solutions = [
    "Solution 1: Build in-house from scratch...",
    "Solution 2: Use existing open-source...",
    "Solution 3: Purchase commercial product...",
]

result = comparison_judge.select_best(solutions)
print(f"Best solution: {result.winner}")
print(f"Reason: {result.reason}")
print(f"\nAll scores: {result.all_scores}")

Quality Gate

# Quality gate judge
quality_gate = AgentJudge(
    agent_name="Quality Gate",
    system_prompt="""Determine if output meets quality standards.

    Requirements:
    - Accuracy: Must be factually correct
    - Completeness: Must address all requirements
    - Clarity: Must be understandable
    - Safety: Must not contain harmful content

    Check each requirement:
    ✓ Pass
    ✗ Fail

    Final verdict: PASS/FAIL
    If FAIL, list reasons.""",
    llm=llm,
)

# Check quality gate
output = """Some output to validate..."""
result = quality_gate.evaluate(output)

if result.verdict == "PASS":
    print("Quality gate passed!")
    proceed_with_output(output)
else:
    print(f"Quality gate failed: {result.failures}")
    handle_failure(output, result.failures)

Automated Grading

# Assignment grader
grader = AgentJudge(
    agent_name="Assignment Grader",
    system_prompt="""Grade the student submission.

    Rubric:
    1. Understanding (25 points) - Demonstrates comprehension
    2. Application (25 points) - Applies concepts correctly
    3. Analysis (25 points) - Critical thinking evident
    4. Communication (15 points) - Clear presentation
    5. Citations (10 points) - Proper references

    Total: 100 points
    Letter grade: A (90-100), B (80-89), C (70-79), D (60-69), F (<60)

    Provide:
    - Points for each criterion
    - Total score
    - Letter grade
    - Constructive feedback""",
    llm=llm,
)

# Grade submission
submission = """Student's assignment text..."""
grade = grader.evaluate(submission)

print(f"Grade: {grade.letter} ({grade.score}/100)")
print(f"\nBreakdown:")
for criterion, points in grade.breakdown.items():
    print(f"  {criterion}: {points}")
print(f"\nFeedback:\n{grade.feedback}")

Risk Assessment

# Risk assessor
risk_judge = AgentJudge(
    agent_name="Risk Assessor",
    system_prompt="""Assess risk level of the proposal.

    Risk Categories:
    1. Technical Risk (0-10)
    2. Financial Risk (0-10)
    3. Schedule Risk (0-10)
    4. Reputation Risk (0-10)
    5. Regulatory Risk (0-10)

    For each:
    - Likelihood (Low/Medium/High)
    - Impact (Low/Medium/High)
    - Risk score = Likelihood × Impact

    Overall Risk: [Low/Medium/High/Critical]
    Recommendation: [APPROVE/CONDITIONAL/REJECT]""",
    llm=llm,
)

# Assess proposal
proposal = """Proposal details..."""
assessment = risk_judge.evaluate(proposal)

print(f"Overall Risk: {assessment.risk_level}")
print(f"Recommendation: {assessment.recommendation}")
print(f"\nRisk Breakdown:")
for category, risk in assessment.risks.items():
    print(f"  {category}: {risk.score} ({risk.likelihood} × {risk.impact})")

Evaluation Patterns

Rubric-Based

Clear criteria with weights:

rubric = """
Evaluate using this rubric:

| Criterion | Weight | Score (0-10) |
|-----------|--------|--------------|
| Accuracy  | 40%    |              |
| Clarity   | 30%    |              |
| Depth     | 20%    |              |
| Style     | 10%    |              |

Weighted Total: [0-10]
"""

Comparative

Rank multiple options:

judge_prompt = """Compare all options:

For each option:
1. List pros and cons
2. Score on key criteria
3. Identify trade-offs

Rank from best to worst.
Explain reasoning.
"""

Binary

Pass/Fail decision:

judge_prompt = """Does the output meet requirements?

Requirements:
- [List requirements]

Check each:
✓ Met
✗ Not met

Verdict: PASS if all met, FAIL otherwise
"""

Calibrated

Compare against reference:

judge_prompt = """Compare output to reference standard.

Reference: {reference}
Output: {output}

How does output compare?
- Better than reference
- Equal to reference
- Worse than reference

Score relative quality: -10 to +10
"""

Best Practices

1. Clear Evaluation Criteria

Be specific about what to evaluate:

judge = AgentJudge(
    system_prompt="""Evaluate on these specific criteria:

    1. Technical Accuracy (0-10)
       - Facts are correct
       - Methodology is sound
       - No logical errors

    2. Practical Usefulness (0-10)
       - Actionable recommendations
       - Real-world applicable
       - Clear next steps

    [etc...]""",
)

2. Consistent Scoring Format

Enforce consistent output:

judge_prompt = """MANDATORY FORMAT:

SCORES:
- Criterion 1: [score]/10
- Criterion 2: [score]/10
- Criterion 3: [score]/10

TOTAL: [sum]/30
VERDICT: [PASS/FAIL]

Deviation from this format is not acceptable.
"""

3. Provide Examples

Show good vs. bad examples:

judge_prompt = """Evaluate code quality.

Good example:
```python
def calculate_mean(numbers: List[float]) -> float:
    '''Calculate arithmetic mean.'''
    return sum(numbers) / len(numbers)

Score: 8/10 - Clear, typed, documented

Bad example:

def calc(n):
    return sum(n)/len(n)

Score: 3/10 - Unclear name, no types, no docs """


### 4. Calibration

Validate judge consistency:

```python
# Test on known examples
test_cases = [
    ("excellent output", expected_score=9),
    ("poor output", expected_score=3),
    ("mediocre output", expected_score=5),
]

for output, expected in test_cases:
    result = judge.evaluate(output)
    assert abs(result.score - expected) < 2, "Judge not calibrated"

5. Multiple Judges

Use ensemble of judges:

judges = [judge1, judge2, judge3]
evaluations = [j.evaluate(output) for j in judges]
final_score = statistics.mean([e.score for e in evaluations])

Performance Considerations

Latency

# Single evaluation: 1 LLM call
evaluation = judge.evaluate(output)

# Multiple evaluations: N LLM calls
evaluations = [judge.evaluate(o) for o in outputs]

# Parallel evaluation
with ThreadPoolExecutor() as executor:
    evaluations = list(executor.map(judge.evaluate, outputs))

Cost

# Cost = num_evaluations × judge_cost
# Judge prompts are typically long (detailed criteria)

# Optimize:
# 1. Cache evaluations
# 2. Use cheaper model for simple judgments
# 3. Batch similar evaluations

Consistency

# Check judge consistency
same_output_scores = [judge.evaluate(output) for _ in range(5)]
score_variance = statistics.stdev([s.score for s in same_output_scores])

if score_variance > 1.0:
    print("Warning: Judge is inconsistent")
    # Consider: lower temperature, clearer criteria, better prompt

Error Handling

Handle evaluation failures:

try:
    evaluation = judge.evaluate(output)
except ScoringError as e:
    # Couldn't extract score
    evaluation = retry_with_clearer_format()
except EvaluationError as e:
    # Judge refused to evaluate
    evaluation = use_fallback_judge()

Debugging

Inspect Evaluations

evaluation = judge.evaluate(output)

print(f"Score: {evaluation.score}")
print(f"Reasoning: {evaluation.reasoning}")
print(f"Criteria breakdown:")
for criterion, score in evaluation.breakdown.items():
    print(f"  {criterion}: {score}")

Compare Judges

judges = [judge1, judge2, judge3]

for judge in judges:
    eval = judge.evaluate(output)
    print(f"{judge.agent_name}: {eval.score}")

# Check inter-judge agreement
scores = [judge.evaluate(output).score for judge in judges]
agreement = 1 - (statistics.stdev(scores) / statistics.mean(scores))
print(f"Inter-judge agreement: {agreement:.1%}")

Integration Patterns

With Self-Consistency

# Generate multiple candidates
candidates = [agent.run(task) for _ in range(5)]

# Judge each one
evaluations = [judge.evaluate(c) for c in candidates]

# Select best
best = max(zip(candidates, evaluations), key=lambda x: x[1].score)

With Reflexion

# Use judge for evaluation in reflexion loop
reflexion = ReflexionAgent(
    agent=agent,
    evaluator=lambda output: judge.evaluate(output).score,
    max_iterations=5,
)

With Reasoning Duo

# Judge acts as critic
duo = ReasoningDuoAgent(
    proposer=proposer,
    critic=judge,  # Judge provides critique
    acceptance_criteria=lambda c: c.score >= 8,
)

Limitations

Subjectivity

Judges can be biased:

# Mitigation: Multiple judges, clear criteria, calibration

Context Limitations

May lack context to evaluate:

# Mitigation: Provide full context in evaluation prompt

Scoring Drift

Scores may drift over time:

# Mitigation: Regular calibration with reference examples

.css-79wky{color:var(--chakra-colors-white);}AzrienLabs