• Getting Started
  • Core Concepts
  • Reinforcement Learning
  • Model Context Protocol (MCP)
  • Workflow Patterns
  • Advanced Agent Patterns
  • Guides

Reinforcement Learning

Reinforcement Learning Overview

Introduction to RL capabilities in Azcore.

Azcore includes a sophisticated Reinforcement Learning (RL) system that enables agents and teams to learn and improve their tool selection over time. Through Q-learning and adaptive exploration strategies, your AI systems become more efficient and effective with experience.

šŸŽÆ What is RL in Azcore?

Reinforcement Learning in Azcore allows agents to:

  • Learn from experience which tools work best for different queries
  • Optimize tool selection automatically over time
  • Adapt to changing environments and user preferences
  • Reduce costs by selecting the right tools efficiently
  • Improve performance through continual learning

Key Concepts

Q-Learning: Value-based RL algorithm that learns action quality State: Query or context that determines tool selection Action: Selection of specific tools to use Reward: Feedback signal indicating success/failure Policy: Strategy for selecting tools (exploration vs exploitation)

šŸ—ļø Architecture

User Query
    ↓
RLManager.select_tools()
    ā”œā”€ State Representation (Embeddings)
    ā”œā”€ Q-Table Lookup
    └─ Exploration Strategy
    ↓
Selected Tools
    ↓
Agent Execution
    ↓
Result + Reward Calculation
    ↓
RLManager.update()
    └─ Q-Value Update

Components

  1. RLManager: Core Q-learning engine for tool selection
  2. Reward Calculators: Compute feedback signals from results
  3. State Representation: Semantic embeddings for generalization
  4. Exploration Strategies: Balance exploration vs exploitation
  5. Q-Table Persistence: Save and load learned knowledge

šŸ’” Why Use RL?

Without RL

# Agent always uses all tools
agent = factory.create_react_agent(
    name="agent",
    tools=[tool1, tool2, tool3, tool4, tool5]  # Always uses all!
)

# Problems:
# - Unnecessary API calls
# - Higher costs
# - Slower response times
# - No optimization over time

With RL

# Agent learns optimal tool selection
rl_agent = factory.create_react_agent(
    name="rl_agent",
    tools=[tool1, tool2, tool3, tool4, tool5],
    rl_enabled=True,
    rl_manager=rl_manager,
    reward_calculator=reward_calc
)

# Benefits:
# - Learns which tools work for which queries
# - Reduces unnecessary tool calls
# - Lower costs through optimization
# - Faster responses
# - Improves automatically over time

šŸ“Š How It Works

1. Initial State (Random Exploration)

# First few queries - exploring randomly
Query: "What's the weather?"
→ Selects: [weather_tool, search_tool]  # Random exploration
Result: Success!
Reward: +1.0

2. Learning Phase (Building Knowledge)

# After 10-20 queries - learning patterns
Query: "Weather in NYC"
→ Q-Table: {weather_tool: 0.8, search_tool: 0.2}
→ Selects: [weather_tool]  # Starting to learn!
Result: Success!
Reward: +1.0
→ Q-Table updated: {weather_tool: 0.85, search_tool: 0.2}

3. Exploitation Phase (Using Knowledge)

# After 100+ queries - optimized selection
Query: "Temperature in London"
→ Q-Table: {weather_tool: 0.95, search_tool: 0.15}
→ Selects: [weather_tool]  # Confident choice!
Result: Success!

šŸŽ® Exploration vs Exploitation

Epsilon-Greedy Strategy

rl_manager = RLManager(
    tool_names=["tool1", "tool2", "tool3"],
    exploration_rate=0.2  # 20% exploration, 80% exploitation
)

# 80% of the time: Use best tools (exploit)
# 20% of the time: Try random tools (explore)

Exploration Strategies

Epsilon-Greedy (Default)

  • Simple, effective
  • Fixed exploration rate
  • Good for most use cases

Epsilon-Decay

  • Starts high, gradually decreases
  • More exploration early, less later
  • Good for continuous learning

UCB (Upper Confidence Bound)

  • Balances exploration intelligently
  • Favors under-explored tools
  • Good for systematic exploration

Thompson Sampling

  • Probabilistic approach
  • Bayesian optimization
  • Good for multi-armed bandits

šŸ”‘ Key Features

Semantic State Matching

# Learns from similar queries
Query 1: "What's the weather in Paris?"
→ Selects: [weather_tool]
→ Success! Reward: +1.0

Query 2: "Temperature in London?"  # Similar to Query 1
→ Uses knowledge from similar past queries!
→ Selects: [weather_tool]  # Smart generalization

Persistent Learning

# Q-table saved to disk
rl_manager = RLManager(
    tool_names=tools,
    q_table_path="rl_data/agent_qtable.pkl"
)

# Knowledge persists across sessions!
# - Day 1: Learns from 100 queries
# - Day 2: Continues from Day 1 knowledge
# - Day 30: Highly optimized tool selection

Batch Updates

# Update Q-values for multiple tools at once
rl_manager.update_batch(
    state_key="query_key",
    actions=["tool1", "tool2"],
    reward=1.0
)

šŸ“ˆ Benefits

Cost Reduction

# Without RL: 5 tools Ɨ $0.01 = $0.05 per query
# With RL: 2 tools Ɨ $0.01 = $0.02 per query
# Savings: 60% reduction in API costs!

Performance Improvement

# Without RL: 5 tools Ɨ 200ms = 1000ms
# With RL: 2 tools Ɨ 200ms = 400ms
# Improvement: 60% faster responses!

Automatic Optimization

No manual intervention needed - the system learns automatically from experience.

šŸš€ Quick Example

from azcore.rl.rl_manager import RLManager
from azcore.rl.rewards import HeuristicRewardCalculator
from azcore.agents.agent_factory import AgentFactory
from langchain_openai import ChatOpenAI

# Setup LLM
llm = ChatOpenAI(model="gpt-4")

# Define tools
tools = [search_web, calculate, fetch_weather, send_email]

# Create RL components
rl_manager = RLManager(
    tool_names=[t.name for t in tools],
    q_table_path="rl_data/agent.pkl",
    exploration_rate=0.15,
    use_embeddings=True  # Semantic matching
)

reward_calculator = HeuristicRewardCalculator(
    success_reward=1.0,
    failure_penalty=-0.5
)

# Create RL-enabled agent
factory = AgentFactory(default_llm=llm)
agent = factory.create_react_agent(
    name="rl_agent",
    tools=tools,
    rl_enabled=True,
    rl_manager=rl_manager,
    reward_calculator=reward_calculator
)

# Use agent - it learns automatically!
for query in queries:
    result = agent.invoke({"messages": [HumanMessage(content=query)]})
    # Agent learns from each interaction

šŸ“š What's Next?

šŸŽ“ Summary

Azcore's RL system provides:

  • Q-Learning: Industry-standard value-based RL
  • Semantic Matching: Generalization across similar queries
  • Multiple Strategies: Epsilon-greedy, UCB, Thompson sampling
  • Persistent Storage: Knowledge preserved across sessions
  • Easy Integration: Drop-in enhancement for agents/teams
  • Automatic Learning: No manual tuning required

Transform your agents from static tool users to intelligent, adaptive systems that continuously improve through experience.

Edit this page on GitHub
AzrienLabs logo

AzrienLabs

Craftedby Team AzrienLabs