Azcore includes a sophisticated Reinforcement Learning (RL) system that enables agents and teams to learn and improve their tool selection over time. Through Q-learning and adaptive exploration strategies, your AI systems become more efficient and effective with experience.
šÆ What is RL in Azcore?
Reinforcement Learning in Azcore allows agents to:
- Learn from experience which tools work best for different queries
- Optimize tool selection automatically over time
- Adapt to changing environments and user preferences
- Reduce costs by selecting the right tools efficiently
- Improve performance through continual learning
Key Concepts
Q-Learning: Value-based RL algorithm that learns action quality State: Query or context that determines tool selection Action: Selection of specific tools to use Reward: Feedback signal indicating success/failure Policy: Strategy for selecting tools (exploration vs exploitation)
šļø Architecture
User Query
ā
RLManager.select_tools()
āā State Representation (Embeddings)
āā Q-Table Lookup
āā Exploration Strategy
ā
Selected Tools
ā
Agent Execution
ā
Result + Reward Calculation
ā
RLManager.update()
āā Q-Value Update
Components
- RLManager: Core Q-learning engine for tool selection
- Reward Calculators: Compute feedback signals from results
- State Representation: Semantic embeddings for generalization
- Exploration Strategies: Balance exploration vs exploitation
- Q-Table Persistence: Save and load learned knowledge
š” Why Use RL?
Without RL
# Agent always uses all tools
agent = factory.create_react_agent(
name="agent",
tools=[tool1, tool2, tool3, tool4, tool5] # Always uses all!
)
# Problems:
# - Unnecessary API calls
# - Higher costs
# - Slower response times
# - No optimization over time
With RL
# Agent learns optimal tool selection
rl_agent = factory.create_react_agent(
name="rl_agent",
tools=[tool1, tool2, tool3, tool4, tool5],
rl_enabled=True,
rl_manager=rl_manager,
reward_calculator=reward_calc
)
# Benefits:
# - Learns which tools work for which queries
# - Reduces unnecessary tool calls
# - Lower costs through optimization
# - Faster responses
# - Improves automatically over time
š How It Works
1. Initial State (Random Exploration)
# First few queries - exploring randomly
Query: "What's the weather?"
ā Selects: [weather_tool, search_tool] # Random exploration
Result: Success!
Reward: +1.0
2. Learning Phase (Building Knowledge)
# After 10-20 queries - learning patterns
Query: "Weather in NYC"
ā Q-Table: {weather_tool: 0.8, search_tool: 0.2}
ā Selects: [weather_tool] # Starting to learn!
Result: Success!
Reward: +1.0
ā Q-Table updated: {weather_tool: 0.85, search_tool: 0.2}
3. Exploitation Phase (Using Knowledge)
# After 100+ queries - optimized selection
Query: "Temperature in London"
ā Q-Table: {weather_tool: 0.95, search_tool: 0.15}
ā Selects: [weather_tool] # Confident choice!
Result: Success!
š® Exploration vs Exploitation
Epsilon-Greedy Strategy
rl_manager = RLManager(
tool_names=["tool1", "tool2", "tool3"],
exploration_rate=0.2 # 20% exploration, 80% exploitation
)
# 80% of the time: Use best tools (exploit)
# 20% of the time: Try random tools (explore)
Exploration Strategies
Epsilon-Greedy (Default)
- Simple, effective
- Fixed exploration rate
- Good for most use cases
Epsilon-Decay
- Starts high, gradually decreases
- More exploration early, less later
- Good for continuous learning
UCB (Upper Confidence Bound)
- Balances exploration intelligently
- Favors under-explored tools
- Good for systematic exploration
Thompson Sampling
- Probabilistic approach
- Bayesian optimization
- Good for multi-armed bandits
š Key Features
Semantic State Matching
# Learns from similar queries
Query 1: "What's the weather in Paris?"
ā Selects: [weather_tool]
ā Success! Reward: +1.0
Query 2: "Temperature in London?" # Similar to Query 1
ā Uses knowledge from similar past queries!
ā Selects: [weather_tool] # Smart generalization
Persistent Learning
# Q-table saved to disk
rl_manager = RLManager(
tool_names=tools,
q_table_path="rl_data/agent_qtable.pkl"
)
# Knowledge persists across sessions!
# - Day 1: Learns from 100 queries
# - Day 2: Continues from Day 1 knowledge
# - Day 30: Highly optimized tool selection
Batch Updates
# Update Q-values for multiple tools at once
rl_manager.update_batch(
state_key="query_key",
actions=["tool1", "tool2"],
reward=1.0
)
š Benefits
Cost Reduction
# Without RL: 5 tools Ć $0.01 = $0.05 per query
# With RL: 2 tools Ć $0.01 = $0.02 per query
# Savings: 60% reduction in API costs!
Performance Improvement
# Without RL: 5 tools Ć 200ms = 1000ms
# With RL: 2 tools Ć 200ms = 400ms
# Improvement: 60% faster responses!
Automatic Optimization
No manual intervention needed - the system learns automatically from experience.
š Quick Example
from azcore.rl.rl_manager import RLManager
from azcore.rl.rewards import HeuristicRewardCalculator
from azcore.agents.agent_factory import AgentFactory
from langchain_openai import ChatOpenAI
# Setup LLM
llm = ChatOpenAI(model="gpt-4")
# Define tools
tools = [search_web, calculate, fetch_weather, send_email]
# Create RL components
rl_manager = RLManager(
tool_names=[t.name for t in tools],
q_table_path="rl_data/agent.pkl",
exploration_rate=0.15,
use_embeddings=True # Semantic matching
)
reward_calculator = HeuristicRewardCalculator(
success_reward=1.0,
failure_penalty=-0.5
)
# Create RL-enabled agent
factory = AgentFactory(default_llm=llm)
agent = factory.create_react_agent(
name="rl_agent",
tools=tools,
rl_enabled=True,
rl_manager=rl_manager,
reward_calculator=reward_calculator
)
# Use agent - it learns automatically!
for query in queries:
result = agent.invoke({"messages": [HumanMessage(content=query)]})
# Agent learns from each interaction
š What's Next?
- Getting Started: Set up your first RL system
- RL Manager: Deep dive into RLManager
- Reward Calculators: Feedback mechanisms
- Training Workflows: Training best practices
- Production: Deploy RL systems in production
š Summary
Azcore's RL system provides:
- Q-Learning: Industry-standard value-based RL
- Semantic Matching: Generalization across similar queries
- Multiple Strategies: Epsilon-greedy, UCB, Thompson sampling
- Persistent Storage: Knowledge preserved across sessions
- Easy Integration: Drop-in enhancement for agents/teams
- Automatic Learning: No manual tuning required
Transform your agents from static tool users to intelligent, adaptive systems that continuously improve through experience.