• Getting Started
  • Core Concepts
  • Reinforcement Learning
  • Model Context Protocol (MCP)
  • Workflow Patterns
  • Advanced Agent Patterns
  • Guides

Reinforcement Learning

Reward Calculators

Feedback mechanisms for RL in Azcore.

Reward calculators compute feedback signals from agent execution results, providing the learning signal for RL optimization. Azcore includes multiple built-in calculators and supports custom implementations.

🎯 RewardCalculator Interface

All reward calculators implement the RewardCalculator abstract class:

from azcore.rl.rewards import RewardCalculator

class RewardCalculator(ABC):
    @abstractmethod
    def calculate(
        self,
        state: Dict[str, Any],
        result: Any,
        user_query: str,
        **kwargs
    ) -> float:
        """
        Calculate reward from execution result.

        Returns:
            Reward value (typically -1.0 to +1.0)
        """
        pass

📊 Built-in Calculators

1. HeuristicRewardCalculator

Rule-based reward calculation using heuristics.

from azcore.rl.rewards import HeuristicRewardCalculator

calculator = HeuristicRewardCalculator(
    success_reward=1.0,
    failure_reward=-0.5,
    empty_penalty=-0.3,
    error_patterns=["Error:", "Failed", "Exception"],
    min_content_length=10
)

reward = calculator.calculate(state, result, query)

How it works:

  1. Checks if output is empty/insufficient → empty_penalty
  2. Checks for error patterns in output → failure_reward
  3. Checks for tool execution errors → failure_reward
  4. Otherwise → success_reward

Use cases:

  • Quick setup without external dependencies
  • Clear success/failure criteria
  • Fast execution (no API calls)

2. LLMRewardCalculator

LLM-based quality evaluation.

from azcore.rl.rewards import LLMRewardCalculator
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

calculator = LLMRewardCalculator(
    llm=llm,
    score_min=0,
    score_max=100,
    reward_min=-1.0,
    reward_max=1.0
)

reward = calculator.calculate(state, result, query)

How it works:

  1. Extracts assistant response
  2. Prompts LLM to score 0-100 based on quality
  3. Normalizes score to reward range (-1.0 to +1.0)

Custom evaluation prompt:

custom_prompt = """
Rate the response quality (0-100):

Query: {query}
Response: {response}

Consider:
- Accuracy
- Completeness
- Clarity

Score:"""

calculator = LLMRewardCalculator(
    llm=llm,
    evaluation_prompt_template=custom_prompt
)

Use cases:

  • Nuanced quality assessment
  • Complex evaluation criteria
  • When heuristics are insufficient

Trade-offs:

  • More accurate than heuristics
  • Slower (requires LLM call)
  • Higher cost

3. UserFeedbackRewardCalculator

Explicit user feedback (thumbs up/down, ratings).

from azcore.rl.rewards import UserFeedbackRewardCalculator

calculator = UserFeedbackRewardCalculator(
    positive_reward=1.0,
    negative_reward=-1.0,
    neutral_reward=0.0
)

# Boolean feedback
reward = calculator.calculate(
    state, result, query,
    user_feedback=True  # Thumbs up
)

# String feedback
reward = calculator.calculate(
    state, result, query,
    user_feedback="positive"
)

# Numeric rating (1-5)
calculator = UserFeedbackRewardCalculator(
    use_rating_scale=True,
    rating_min=1,
    rating_max=5
)
reward = calculator.calculate(
    state, result, query,
    user_feedback=4  # 4/5 stars
)

Use cases:

  • Human-in-the-loop RL
  • Direct user preferences
  • A/B testing
  • Production feedback loops

4. CompositeRewardCalculator

Combines multiple calculators with weights.

from azcore.rl.rewards import CompositeRewardCalculator

calculator = CompositeRewardCalculator([
    (HeuristicRewardCalculator(), 0.3),       # 30% weight
    (LLMRewardCalculator(llm), 0.5),          # 50% weight
    (UserFeedbackRewardCalculator(), 0.2)     # 20% weight
])

reward = calculator.calculate(
    state, result, query,
    user_feedback="positive"  # Optional feedback
)

Use cases:

  • Balanced evaluation
  • Multiple signal sources
  • Gradual transition from heuristics to LLM
  • Combining automated and human feedback

🔧 Custom Reward Calculators

Basic Custom Calculator

from azcore.rl.rewards import RewardCalculator
from typing import Dict, Any

class CustomRewardCalculator(RewardCalculator):
    def __init__(self, success_keywords, failure_keywords):
        self.success_keywords = success_keywords
        self.failure_keywords = failure_keywords

    def calculate(
        self,
        state: Dict[str, Any],
        result: Any,
        user_query: str,
        **kwargs
    ) -> float:
        # Extract content
        content = self._extract_content(result)

        # Check keywords
        content_lower = content.lower()

        for keyword in self.success_keywords:
            if keyword in content_lower:
                return 1.0

        for keyword in self.failure_keywords:
            if keyword in content_lower:
                return -1.0

        return 0.0  # Neutral

    def _extract_content(self, result):
        if isinstance(result, str):
            return result
        if isinstance(result, dict):
            messages = result.get("messages", [])
            if messages:
                return messages[-1].content
        return str(result)

# Use custom calculator
calculator = CustomRewardCalculator(
    success_keywords=["completed", "success", "done"],
    failure_keywords=["error", "failed", "timeout"]
)

Task-Specific Calculator

class CodeExecutionRewardCalculator(RewardCalculator):
    """Reward calculator for code execution tasks."""

    def calculate(self, state, result, user_query, **kwargs):
        # Extract output
        output = self._extract_output(result)

        # Check for syntax errors
        if "SyntaxError" in output or "IndentationError" in output:
            return -1.0

        # Check for runtime errors
        if "Error" in output or "Exception" in output:
            return -0.5

        # Check for successful execution
        if "exit code: 0" in output.lower():
            return 1.0

        # Check for expected output
        expected = kwargs.get("expected_output")
        if expected and expected in output:
            return 1.0

        return 0.0  # Uncertain

Multi-Criteria Calculator

class MultiCriteriaRewardCalculator(RewardCalculator):
    """Evaluates multiple quality criteria."""

    def calculate(self, state, result, user_query, **kwargs):
        content = self._extract_content(result)

        scores = []

        # Criterion 1: Length (completeness)
        length_score = self._score_length(content)
        scores.append(length_score)

        # Criterion 2: Relevance
        relevance_score = self._score_relevance(content, user_query)
        scores.append(relevance_score)

        # Criterion 3: Correctness
        correctness_score = self._score_correctness(content)
        scores.append(correctness_score)

        # Average scores
        return sum(scores) / len(scores)

    def _score_length(self, content):
        # Reward appropriate length
        length = len(content)
        if length < 50:
            return -0.5  # Too short
        elif length > 500:
            return 0.5  # Maybe too long
        else:
            return 1.0  # Good length

    def _score_relevance(self, content, query):
        # Simple keyword matching
        query_words = set(query.lower().split())
        content_words = set(content.lower().split())
        overlap = len(query_words & content_words)
        return min(overlap / max(len(query_words), 1), 1.0)

    def _score_correctness(self, content):
        # Check for error indicators
        errors = ["error", "failed", "incorrect"]
        if any(e in content.lower() for e in errors):
            return -0.5
        return 1.0

🎯 Reward Design Best Practices

1. Appropriate Scale

# ✅ GOOD: Clear scale (-1 to +1)
HeuristicRewardCalculator(
    success_reward=1.0,
    failure_reward=-0.5,
    empty_penalty=-0.3
)

# ❌ BAD: Inconsistent scale
HeuristicRewardCalculator(
    success_reward=100,
    failure_reward=-1,
    empty_penalty=-0.001
)

2. Balanced Rewards

# ✅ GOOD: Balanced positive/negative
success_reward=1.0
failure_reward=-0.5  # Less harsh

# ❌ BAD: Heavily skewed
success_reward=0.1
failure_reward=-10.0  # Too harsh!

3. Intermediate Rewards

# ✅ GOOD: Gradual rewards
class GradualRewardCalculator(RewardCalculator):
    def calculate(self, state, result, query, **kwargs):
        content = self._extract_content(result)

        if "perfect" in content.lower():
            return 1.0  # Perfect
        elif "good" in content.lower():
            return 0.7  # Good
        elif "okay" in content.lower():
            return 0.3  # Okay
        elif "error" in content.lower():
            return -0.5  # Error
        else:
            return 0.0  # Neutral

4. Reward Shaping

class ShapedRewardCalculator(RewardCalculator):
    """Provides intermediate rewards for partial progress."""

    def calculate(self, state, result, query, **kwargs):
        reward = 0.0

        # Base reward for any output
        if result:
            reward += 0.2

        # Bonus for using correct tool
        if self._used_correct_tool(state):
            reward += 0.3

        # Bonus for correct output
        if self._output_correct(result):
            reward += 0.5

        return reward

📊 Combining Calculators

Sequential Evaluation

class SequentialRewardCalculator(RewardCalculator):
    """Try multiple calculators in order."""

    def __init__(self, calculators):
        self.calculators = calculators

    def calculate(self, state, result, query, **kwargs):
        for calculator in self.calculators:
            reward = calculator.calculate(state, result, query, **kwargs)
            if reward != 0.0:  # First non-neutral reward
                return reward
        return 0.0  # All neutral

Majority Vote

class VotingRewardCalculator(RewardCalculator):
    """Use majority vote from multiple calculators."""

    def __init__(self, calculators):
        self.calculators = calculators

    def calculate(self, state, result, query, **kwargs):
        votes = []
        for calc in self.calculators:
            reward = calc.calculate(state, result, query, **kwargs)
            # Quantize to positive/neutral/negative
            if reward > 0.3:
                votes.append(1)
            elif reward < -0.3:
                votes.append(-1)
            else:
                votes.append(0)

        # Return majority vote
        from collections import Counter
        vote_counts = Counter(votes)
        majority = vote_counts.most_common(1)[0][0]

        return float(majority)

🚀 Complete Example

from azcore.rl.rewards import (
    HeuristicRewardCalculator,
    LLMRewardCalculator,
    UserFeedbackRewardCalculator,
    CompositeRewardCalculator
)
from azcore.rl.rl_manager import RLManager
from langchain_openai import ChatOpenAI

# Create individual calculators
heuristic = HeuristicRewardCalculator(
    success_reward=1.0,
    failure_reward=-0.5
)

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
llm_calc = LLMRewardCalculator(llm=llm)

user_feedback = UserFeedbackRewardCalculator()

# Combine with weights
reward_calc = CompositeRewardCalculator([
    (heuristic, 0.4),      # Fast, always available
    (llm_calc, 0.4),       # Accurate evaluation
    (user_feedback, 0.2)   # Human feedback when available
])

# Use with RL Manager
rl_manager = RLManager(
    tool_names=tools,
    q_table_path="rl_data/agent.pkl"
)

# Training loop
state = {"messages": [...]}
result = agent.invoke(state)

# Calculate reward
reward = reward_calc.calculate(
    state=state,
    result=result,
    user_query="original query",
    user_feedback=None  # Optional
)

# Update RL
rl_manager.update(state_key, tool_name, reward)

🎓 Summary

Reward calculators provide:

  • HeuristicRewardCalculator: Fast, rule-based feedback
  • LLMRewardCalculator: AI-powered quality assessment
  • UserFeedbackRewardCalculator: Human-in-the-loop learning
  • CompositeRewardCalculator: Combine multiple signals
  • Custom Calculators: Task-specific feedback mechanisms

Choose the right calculator(s) based on your accuracy requirements, latency constraints, and available feedback sources.

Edit this page on GitHub
AzrienLabs logo

AzrienLabs

Craftedby Team AzrienLabs