Reward calculators compute feedback signals from agent execution results, providing the learning signal for RL optimization. Azcore includes multiple built-in calculators and supports custom implementations.
🎯 RewardCalculator Interface
All reward calculators implement the RewardCalculator abstract class:
from azcore.rl.rewards import RewardCalculator
class RewardCalculator(ABC):
@abstractmethod
def calculate(
self,
state: Dict[str, Any],
result: Any,
user_query: str,
**kwargs
) -> float:
"""
Calculate reward from execution result.
Returns:
Reward value (typically -1.0 to +1.0)
"""
pass
📊 Built-in Calculators
1. HeuristicRewardCalculator
Rule-based reward calculation using heuristics.
from azcore.rl.rewards import HeuristicRewardCalculator
calculator = HeuristicRewardCalculator(
success_reward=1.0,
failure_reward=-0.5,
empty_penalty=-0.3,
error_patterns=["Error:", "Failed", "Exception"],
min_content_length=10
)
reward = calculator.calculate(state, result, query)
How it works:
- Checks if output is empty/insufficient →
empty_penalty - Checks for error patterns in output →
failure_reward - Checks for tool execution errors →
failure_reward - Otherwise →
success_reward
Use cases:
- Quick setup without external dependencies
- Clear success/failure criteria
- Fast execution (no API calls)
2. LLMRewardCalculator
LLM-based quality evaluation.
from azcore.rl.rewards import LLMRewardCalculator
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
calculator = LLMRewardCalculator(
llm=llm,
score_min=0,
score_max=100,
reward_min=-1.0,
reward_max=1.0
)
reward = calculator.calculate(state, result, query)
How it works:
- Extracts assistant response
- Prompts LLM to score 0-100 based on quality
- Normalizes score to reward range (-1.0 to +1.0)
Custom evaluation prompt:
custom_prompt = """
Rate the response quality (0-100):
Query: {query}
Response: {response}
Consider:
- Accuracy
- Completeness
- Clarity
Score:"""
calculator = LLMRewardCalculator(
llm=llm,
evaluation_prompt_template=custom_prompt
)
Use cases:
- Nuanced quality assessment
- Complex evaluation criteria
- When heuristics are insufficient
Trade-offs:
- More accurate than heuristics
- Slower (requires LLM call)
- Higher cost
3. UserFeedbackRewardCalculator
Explicit user feedback (thumbs up/down, ratings).
from azcore.rl.rewards import UserFeedbackRewardCalculator
calculator = UserFeedbackRewardCalculator(
positive_reward=1.0,
negative_reward=-1.0,
neutral_reward=0.0
)
# Boolean feedback
reward = calculator.calculate(
state, result, query,
user_feedback=True # Thumbs up
)
# String feedback
reward = calculator.calculate(
state, result, query,
user_feedback="positive"
)
# Numeric rating (1-5)
calculator = UserFeedbackRewardCalculator(
use_rating_scale=True,
rating_min=1,
rating_max=5
)
reward = calculator.calculate(
state, result, query,
user_feedback=4 # 4/5 stars
)
Use cases:
- Human-in-the-loop RL
- Direct user preferences
- A/B testing
- Production feedback loops
4. CompositeRewardCalculator
Combines multiple calculators with weights.
from azcore.rl.rewards import CompositeRewardCalculator
calculator = CompositeRewardCalculator([
(HeuristicRewardCalculator(), 0.3), # 30% weight
(LLMRewardCalculator(llm), 0.5), # 50% weight
(UserFeedbackRewardCalculator(), 0.2) # 20% weight
])
reward = calculator.calculate(
state, result, query,
user_feedback="positive" # Optional feedback
)
Use cases:
- Balanced evaluation
- Multiple signal sources
- Gradual transition from heuristics to LLM
- Combining automated and human feedback
🔧 Custom Reward Calculators
Basic Custom Calculator
from azcore.rl.rewards import RewardCalculator
from typing import Dict, Any
class CustomRewardCalculator(RewardCalculator):
def __init__(self, success_keywords, failure_keywords):
self.success_keywords = success_keywords
self.failure_keywords = failure_keywords
def calculate(
self,
state: Dict[str, Any],
result: Any,
user_query: str,
**kwargs
) -> float:
# Extract content
content = self._extract_content(result)
# Check keywords
content_lower = content.lower()
for keyword in self.success_keywords:
if keyword in content_lower:
return 1.0
for keyword in self.failure_keywords:
if keyword in content_lower:
return -1.0
return 0.0 # Neutral
def _extract_content(self, result):
if isinstance(result, str):
return result
if isinstance(result, dict):
messages = result.get("messages", [])
if messages:
return messages[-1].content
return str(result)
# Use custom calculator
calculator = CustomRewardCalculator(
success_keywords=["completed", "success", "done"],
failure_keywords=["error", "failed", "timeout"]
)
Task-Specific Calculator
class CodeExecutionRewardCalculator(RewardCalculator):
"""Reward calculator for code execution tasks."""
def calculate(self, state, result, user_query, **kwargs):
# Extract output
output = self._extract_output(result)
# Check for syntax errors
if "SyntaxError" in output or "IndentationError" in output:
return -1.0
# Check for runtime errors
if "Error" in output or "Exception" in output:
return -0.5
# Check for successful execution
if "exit code: 0" in output.lower():
return 1.0
# Check for expected output
expected = kwargs.get("expected_output")
if expected and expected in output:
return 1.0
return 0.0 # Uncertain
Multi-Criteria Calculator
class MultiCriteriaRewardCalculator(RewardCalculator):
"""Evaluates multiple quality criteria."""
def calculate(self, state, result, user_query, **kwargs):
content = self._extract_content(result)
scores = []
# Criterion 1: Length (completeness)
length_score = self._score_length(content)
scores.append(length_score)
# Criterion 2: Relevance
relevance_score = self._score_relevance(content, user_query)
scores.append(relevance_score)
# Criterion 3: Correctness
correctness_score = self._score_correctness(content)
scores.append(correctness_score)
# Average scores
return sum(scores) / len(scores)
def _score_length(self, content):
# Reward appropriate length
length = len(content)
if length < 50:
return -0.5 # Too short
elif length > 500:
return 0.5 # Maybe too long
else:
return 1.0 # Good length
def _score_relevance(self, content, query):
# Simple keyword matching
query_words = set(query.lower().split())
content_words = set(content.lower().split())
overlap = len(query_words & content_words)
return min(overlap / max(len(query_words), 1), 1.0)
def _score_correctness(self, content):
# Check for error indicators
errors = ["error", "failed", "incorrect"]
if any(e in content.lower() for e in errors):
return -0.5
return 1.0
🎯 Reward Design Best Practices
1. Appropriate Scale
# ✅ GOOD: Clear scale (-1 to +1)
HeuristicRewardCalculator(
success_reward=1.0,
failure_reward=-0.5,
empty_penalty=-0.3
)
# ❌ BAD: Inconsistent scale
HeuristicRewardCalculator(
success_reward=100,
failure_reward=-1,
empty_penalty=-0.001
)
2. Balanced Rewards
# ✅ GOOD: Balanced positive/negative
success_reward=1.0
failure_reward=-0.5 # Less harsh
# ❌ BAD: Heavily skewed
success_reward=0.1
failure_reward=-10.0 # Too harsh!
3. Intermediate Rewards
# ✅ GOOD: Gradual rewards
class GradualRewardCalculator(RewardCalculator):
def calculate(self, state, result, query, **kwargs):
content = self._extract_content(result)
if "perfect" in content.lower():
return 1.0 # Perfect
elif "good" in content.lower():
return 0.7 # Good
elif "okay" in content.lower():
return 0.3 # Okay
elif "error" in content.lower():
return -0.5 # Error
else:
return 0.0 # Neutral
4. Reward Shaping
class ShapedRewardCalculator(RewardCalculator):
"""Provides intermediate rewards for partial progress."""
def calculate(self, state, result, query, **kwargs):
reward = 0.0
# Base reward for any output
if result:
reward += 0.2
# Bonus for using correct tool
if self._used_correct_tool(state):
reward += 0.3
# Bonus for correct output
if self._output_correct(result):
reward += 0.5
return reward
📊 Combining Calculators
Sequential Evaluation
class SequentialRewardCalculator(RewardCalculator):
"""Try multiple calculators in order."""
def __init__(self, calculators):
self.calculators = calculators
def calculate(self, state, result, query, **kwargs):
for calculator in self.calculators:
reward = calculator.calculate(state, result, query, **kwargs)
if reward != 0.0: # First non-neutral reward
return reward
return 0.0 # All neutral
Majority Vote
class VotingRewardCalculator(RewardCalculator):
"""Use majority vote from multiple calculators."""
def __init__(self, calculators):
self.calculators = calculators
def calculate(self, state, result, query, **kwargs):
votes = []
for calc in self.calculators:
reward = calc.calculate(state, result, query, **kwargs)
# Quantize to positive/neutral/negative
if reward > 0.3:
votes.append(1)
elif reward < -0.3:
votes.append(-1)
else:
votes.append(0)
# Return majority vote
from collections import Counter
vote_counts = Counter(votes)
majority = vote_counts.most_common(1)[0][0]
return float(majority)
🚀 Complete Example
from azcore.rl.rewards import (
HeuristicRewardCalculator,
LLMRewardCalculator,
UserFeedbackRewardCalculator,
CompositeRewardCalculator
)
from azcore.rl.rl_manager import RLManager
from langchain_openai import ChatOpenAI
# Create individual calculators
heuristic = HeuristicRewardCalculator(
success_reward=1.0,
failure_reward=-0.5
)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
llm_calc = LLMRewardCalculator(llm=llm)
user_feedback = UserFeedbackRewardCalculator()
# Combine with weights
reward_calc = CompositeRewardCalculator([
(heuristic, 0.4), # Fast, always available
(llm_calc, 0.4), # Accurate evaluation
(user_feedback, 0.2) # Human feedback when available
])
# Use with RL Manager
rl_manager = RLManager(
tool_names=tools,
q_table_path="rl_data/agent.pkl"
)
# Training loop
state = {"messages": [...]}
result = agent.invoke(state)
# Calculate reward
reward = reward_calc.calculate(
state=state,
result=result,
user_query="original query",
user_feedback=None # Optional
)
# Update RL
rl_manager.update(state_key, tool_name, reward)
🎓 Summary
Reward calculators provide:
- HeuristicRewardCalculator: Fast, rule-based feedback
- LLMRewardCalculator: AI-powered quality assessment
- UserFeedbackRewardCalculator: Human-in-the-loop learning
- CompositeRewardCalculator: Combine multiple signals
- Custom Calculators: Task-specific feedback mechanisms
Choose the right calculator(s) based on your accuracy requirements, latency constraints, and available feedback sources.