• Getting Started
  • Core Concepts
  • Reinforcement Learning
  • Model Context Protocol (MCP)
  • Workflow Patterns
  • Advanced Agent Patterns
  • Guides

Advanced Agent Patterns

Agent Judge Pattern

Use specialized judge agents to evaluate and rank outputs from multiple agent responses.

Specialized agent evaluates and scores outputs from other agents. Provides consistent, objective evaluation and quality control.

Overview

Agent Judge uses a dedicated agent to evaluate outputs, select between alternatives, or make decisions based on defined criteria. This pattern separates execution from evaluation, enabling consistent quality assessment and automated decision-making.

When to Use

  • Output evaluation: Assess quality of agent outputs
  • Selection tasks: Choose best option from multiple candidates
  • Quality gates: Validate outputs meet standards
  • A/B testing: Compare different approaches
  • Scoring and ranking: Order results by quality
  • Automated review: Consistent evaluation criteria

Basic Usage

from azcore import AgentJudge, Agent

# Create judge agent
judge = AgentJudge(
    agent_name="Judge",
    system_prompt="""Evaluate outputs objectively.

    Criteria:
    - Accuracy (0-10)
    - Clarity (0-10)
    - Completeness (0-10)
    - Usefulness (0-10)

    Provide scores and justification.
    Format: SCORE: [total]/40""",
    llm=llm,
)

# Evaluate an output
output = "Some agent output to evaluate..."
evaluation = judge.evaluate(output)

print(f"Score: {evaluation.score}")
print(f"Feedback: {evaluation.feedback}")

Configuration Options

evaluation_criteria

Define what to evaluate:

judge = AgentJudge(
    system_prompt="""Evaluate based on:

    1. Technical Accuracy (weight: 40%)
    2. Practical Applicability (weight: 30%)
    3. Clarity of Expression (weight: 20%)
    4. Innovation (weight: 10%)

    Total score: 0-100""",
)

scoring_method

How to score:

# Numeric scoring
judge = AgentJudge(scoring_method="numeric", score_range=(0, 10))

# Letter grades
judge = AgentJudge(scoring_method="letter", grades=["A", "B", "C", "D", "F"])

# Pass/Fail
judge = AgentJudge(scoring_method="binary", threshold=0.7)

# Custom scoring
judge = AgentJudge(
    scoring_method="custom",
    score_extractor=lambda text: extract_custom_score(text)
)

comparison_mode

For comparing multiple outputs:

judge = AgentJudge(
    comparison_mode=True,
    system_prompt="Compare the outputs and select the best one."
)

best = judge.select_best([output1, output2, output3])

Advanced Examples

Code Quality Evaluation

from azcore import AgentJudge

# Code quality judge
code_judge = AgentJudge(
    agent_name="Code Quality Judge",
    system_prompt="""Evaluate code quality comprehensively.

    Criteria (each 0-10):
    1. Correctness - Does it work?
    2. Readability - Is it clear?
    3. Efficiency - Is it optimized?
    4. Maintainability - Is it sustainable?
    5. Best Practices - Does it follow conventions?
    6. Error Handling - Are errors managed?
    7. Documentation - Is it documented?
    8. Testing - Is it testable?

    Provide:
    - Individual scores
    - Total score (0-80)
    - Specific feedback for each criterion
    - Overall recommendation

    Format:
    SCORES: [list scores]
    TOTAL: [sum]/80
    GRADE: [A/B/C/D/F]""",
    llm=llm,
)

# Evaluate code
code_sample = """
def calculate_average(numbers):
    return sum(numbers) / len(numbers)
"""

evaluation = code_judge.evaluate(code_sample)
print(f"Code Grade: {evaluation.grade}")
print(f"Total Score: {evaluation.score}/80")
print(f"\nFeedback:\n{evaluation.feedback}")

Content Quality Assessment

# Content quality judge
content_judge = AgentJudge(
    agent_name="Content Judge",
    system_prompt="""Evaluate content quality.

    Dimensions (each 0-10):
    - Accuracy: Facts correct?
    - Clarity: Easy to understand?
    - Engagement: Interesting to read?
    - Structure: Well-organized?
    - Grammar: Proper language?
    - Depth: Sufficient detail?
    - Sources: Well-cited?
    - Originality: Fresh perspective?

    Total: 0-80
    Verdict: [PUBLISH/REVISE/REJECT]""",
    llm=llm,
)

# Evaluate article
article = """[article text here]"""
evaluation = content_judge.evaluate(article)

print(f"Verdict: {evaluation.verdict}")
print(f"Score: {evaluation.score}/80")
print(f"\nStrengths: {evaluation.strengths}")
print(f"Improvements: {evaluation.improvements}")

Solution Comparison

# Comparison judge
comparison_judge = AgentJudge(
    agent_name="Solution Comparator",
    system_prompt="""Compare multiple solutions and select the best.

    For each solution, evaluate:
    1. Effectiveness (0-10)
    2. Feasibility (0-10)
    3. Cost (0-10)
    4. Time to implement (0-10)
    5. Risk level (0-10)

    Select winner and explain why.

    Format:
    SOLUTION 1: [scores]
    SOLUTION 2: [scores]
    SOLUTION 3: [scores]
    WINNER: Solution [n]
    REASON: [explanation]""",
    llm=llm,
)

# Compare solutions
solutions = [
    "Solution 1: Build in-house from scratch...",
    "Solution 2: Use existing open-source...",
    "Solution 3: Purchase commercial product...",
]

result = comparison_judge.select_best(solutions)
print(f"Best solution: {result.winner}")
print(f"Reason: {result.reason}")
print(f"\nAll scores: {result.all_scores}")

Quality Gate

# Quality gate judge
quality_gate = AgentJudge(
    agent_name="Quality Gate",
    system_prompt="""Determine if output meets quality standards.

    Requirements:
    - Accuracy: Must be factually correct
    - Completeness: Must address all requirements
    - Clarity: Must be understandable
    - Safety: Must not contain harmful content

    Check each requirement:
    ✓ Pass
    ✗ Fail

    Final verdict: PASS/FAIL
    If FAIL, list reasons.""",
    llm=llm,
)

# Check quality gate
output = """Some output to validate..."""
result = quality_gate.evaluate(output)

if result.verdict == "PASS":
    print("Quality gate passed!")
    proceed_with_output(output)
else:
    print(f"Quality gate failed: {result.failures}")
    handle_failure(output, result.failures)

Automated Grading

# Assignment grader
grader = AgentJudge(
    agent_name="Assignment Grader",
    system_prompt="""Grade the student submission.

    Rubric:
    1. Understanding (25 points) - Demonstrates comprehension
    2. Application (25 points) - Applies concepts correctly
    3. Analysis (25 points) - Critical thinking evident
    4. Communication (15 points) - Clear presentation
    5. Citations (10 points) - Proper references

    Total: 100 points
    Letter grade: A (90-100), B (80-89), C (70-79), D (60-69), F (<60)

    Provide:
    - Points for each criterion
    - Total score
    - Letter grade
    - Constructive feedback""",
    llm=llm,
)

# Grade submission
submission = """Student's assignment text..."""
grade = grader.evaluate(submission)

print(f"Grade: {grade.letter} ({grade.score}/100)")
print(f"\nBreakdown:")
for criterion, points in grade.breakdown.items():
    print(f"  {criterion}: {points}")
print(f"\nFeedback:\n{grade.feedback}")

Risk Assessment

# Risk assessor
risk_judge = AgentJudge(
    agent_name="Risk Assessor",
    system_prompt="""Assess risk level of the proposal.

    Risk Categories:
    1. Technical Risk (0-10)
    2. Financial Risk (0-10)
    3. Schedule Risk (0-10)
    4. Reputation Risk (0-10)
    5. Regulatory Risk (0-10)

    For each:
    - Likelihood (Low/Medium/High)
    - Impact (Low/Medium/High)
    - Risk score = Likelihood × Impact

    Overall Risk: [Low/Medium/High/Critical]
    Recommendation: [APPROVE/CONDITIONAL/REJECT]""",
    llm=llm,
)

# Assess proposal
proposal = """Proposal details..."""
assessment = risk_judge.evaluate(proposal)

print(f"Overall Risk: {assessment.risk_level}")
print(f"Recommendation: {assessment.recommendation}")
print(f"\nRisk Breakdown:")
for category, risk in assessment.risks.items():
    print(f"  {category}: {risk.score} ({risk.likelihood} × {risk.impact})")

Evaluation Patterns

Rubric-Based

Clear criteria with weights:

rubric = """
Evaluate using this rubric:

| Criterion | Weight | Score (0-10) |
|-----------|--------|--------------|
| Accuracy  | 40%    |              |
| Clarity   | 30%    |              |
| Depth     | 20%    |              |
| Style     | 10%    |              |

Weighted Total: [0-10]
"""

Comparative

Rank multiple options:

judge_prompt = """Compare all options:

For each option:
1. List pros and cons
2. Score on key criteria
3. Identify trade-offs

Rank from best to worst.
Explain reasoning.
"""

Binary

Pass/Fail decision:

judge_prompt = """Does the output meet requirements?

Requirements:
- [List requirements]

Check each:
✓ Met
✗ Not met

Verdict: PASS if all met, FAIL otherwise
"""

Calibrated

Compare against reference:

judge_prompt = """Compare output to reference standard.

Reference: {reference}
Output: {output}

How does output compare?
- Better than reference
- Equal to reference
- Worse than reference

Score relative quality: -10 to +10
"""

Best Practices

1. Clear Evaluation Criteria

Be specific about what to evaluate:

judge = AgentJudge(
    system_prompt="""Evaluate on these specific criteria:

    1. Technical Accuracy (0-10)
       - Facts are correct
       - Methodology is sound
       - No logical errors

    2. Practical Usefulness (0-10)
       - Actionable recommendations
       - Real-world applicable
       - Clear next steps

    [etc...]""",
)

2. Consistent Scoring Format

Enforce consistent output:

judge_prompt = """MANDATORY FORMAT:

SCORES:
- Criterion 1: [score]/10
- Criterion 2: [score]/10
- Criterion 3: [score]/10

TOTAL: [sum]/30
VERDICT: [PASS/FAIL]

Deviation from this format is not acceptable.
"""

3. Provide Examples

Show good vs. bad examples:

judge_prompt = """Evaluate code quality.

Good example:
```python
def calculate_mean(numbers: List[float]) -> float:
    '''Calculate arithmetic mean.'''
    return sum(numbers) / len(numbers)

Score: 8/10 - Clear, typed, documented

Bad example:

def calc(n):
    return sum(n)/len(n)

Score: 3/10 - Unclear name, no types, no docs """


### 4. Calibration

Validate judge consistency:

```python
# Test on known examples
test_cases = [
    ("excellent output", expected_score=9),
    ("poor output", expected_score=3),
    ("mediocre output", expected_score=5),
]

for output, expected in test_cases:
    result = judge.evaluate(output)
    assert abs(result.score - expected) < 2, "Judge not calibrated"

5. Multiple Judges

Use ensemble of judges:

judges = [judge1, judge2, judge3]
evaluations = [j.evaluate(output) for j in judges]
final_score = statistics.mean([e.score for e in evaluations])

Performance Considerations

Latency

# Single evaluation: 1 LLM call
evaluation = judge.evaluate(output)

# Multiple evaluations: N LLM calls
evaluations = [judge.evaluate(o) for o in outputs]

# Parallel evaluation
with ThreadPoolExecutor() as executor:
    evaluations = list(executor.map(judge.evaluate, outputs))

Cost

# Cost = num_evaluations × judge_cost
# Judge prompts are typically long (detailed criteria)

# Optimize:
# 1. Cache evaluations
# 2. Use cheaper model for simple judgments
# 3. Batch similar evaluations

Consistency

# Check judge consistency
same_output_scores = [judge.evaluate(output) for _ in range(5)]
score_variance = statistics.stdev([s.score for s in same_output_scores])

if score_variance > 1.0:
    print("Warning: Judge is inconsistent")
    # Consider: lower temperature, clearer criteria, better prompt

Error Handling

Handle evaluation failures:

try:
    evaluation = judge.evaluate(output)
except ScoringError as e:
    # Couldn't extract score
    evaluation = retry_with_clearer_format()
except EvaluationError as e:
    # Judge refused to evaluate
    evaluation = use_fallback_judge()

Debugging

Inspect Evaluations

evaluation = judge.evaluate(output)

print(f"Score: {evaluation.score}")
print(f"Reasoning: {evaluation.reasoning}")
print(f"Criteria breakdown:")
for criterion, score in evaluation.breakdown.items():
    print(f"  {criterion}: {score}")

Compare Judges

judges = [judge1, judge2, judge3]

for judge in judges:
    eval = judge.evaluate(output)
    print(f"{judge.agent_name}: {eval.score}")

# Check inter-judge agreement
scores = [judge.evaluate(output).score for judge in judges]
agreement = 1 - (statistics.stdev(scores) / statistics.mean(scores))
print(f"Inter-judge agreement: {agreement:.1%}")

Integration Patterns

With Self-Consistency

# Generate multiple candidates
candidates = [agent.run(task) for _ in range(5)]

# Judge each one
evaluations = [judge.evaluate(c) for c in candidates]

# Select best
best = max(zip(candidates, evaluations), key=lambda x: x[1].score)

With Reflexion

# Use judge for evaluation in reflexion loop
reflexion = ReflexionAgent(
    agent=agent,
    evaluator=lambda output: judge.evaluate(output).score,
    max_iterations=5,
)

With Reasoning Duo

# Judge acts as critic
duo = ReasoningDuoAgent(
    proposer=proposer,
    critic=judge,  # Judge provides critique
    acceptance_criteria=lambda c: c.score >= 8,
)

Limitations

Subjectivity

Judges can be biased:

# Mitigation: Multiple judges, clear criteria, calibration

Context Limitations

May lack context to evaluate:

# Mitigation: Provide full context in evaluation prompt

Scoring Drift

Scores may drift over time:

# Mitigation: Regular calibration with reference examples
Edit this page on GitHub
AzrienLabs logo

AzrienLabs

Craftedby Team AzrienLabs