• Getting Started
  • Core Concepts
  • Reinforcement Learning
  • Model Context Protocol (MCP)
  • Workflow Patterns
  • Advanced Agent Patterns
  • Guides

Reinforcement Learning

Exploration Strategies

Balancing exploration and exploitation in Azcore RL.

Exploration strategies determine how the RL system balances trying new tools (exploration) vs using known best tools (exploitation). Azcore supports multiple strategies, each with different trade-offs.

🎯 The Exploration-Exploitation Dilemma

The Problem

# Scenario: Agent has 3 tools
tools = ["search", "calculate", "weather"]

# After initial learning:
Q("What's 2+2?", "calculate") = 0.9  # High Q-value
Q("What's 2+2?", "search") = 0.1     # Low Q-value
Q("What's 2+2?", "weather") = 0.05   # Low Q-value

# Dilemma:
# - Exploit: Always use "calculate" (known good)
# - Explore: Try "search" or "weather" (might be better!)

Why Exploration Matters

Pure Exploitation (always use best known):

  • ❌ Gets stuck in local optima
  • ❌ Never discovers better strategies
  • ❌ Can't adapt to changes

Pure Exploration (always random):

  • ❌ Ignores learned knowledge
  • ❌ Poor performance
  • ❌ Wastes resources

Balanced Approach:

  • ✅ Uses learned knowledge (exploit)
  • ✅ Discovers improvements (explore)
  • ✅ Adapts over time

🎮 Exploration Strategies

Azcore provides 4 strategies:

from azcore.rl.rl_manager import RLManager, ExplorationStrategy

# 1. Epsilon-Greedy
rl_manager = RLManager(
    tool_names=tools,
    exploration_strategy=ExplorationStrategy.EPSILON_GREEDY
)

# 2. Epsilon-Decay
rl_manager = RLManager(
    tool_names=tools,
    exploration_strategy=ExplorationStrategy.EPSILON_DECAY
)

# 3. UCB (Upper Confidence Bound)
rl_manager = RLManager(
    tool_names=tools,
    exploration_strategy=ExplorationStrategy.UCB
)

# 4. Thompson Sampling
rl_manager = RLManager(
    tool_names=tools,
    exploration_strategy=ExplorationStrategy.THOMPSON_SAMPLING
)

1️⃣ Epsilon-Greedy (Default)

Simple, effective baseline strategy.

How It Works

if random() < epsilon:
    # EXPLORE: Select random tools
    selected = random.sample(tools, k=random.randint(1, 3))
else:
    # EXPLOIT: Select best Q-value tools
    selected = top_k_tools_by_q_value(k=3)

Configuration

rl_manager = RLManager(
    tool_names=tools,
    exploration_strategy=ExplorationStrategy.EPSILON_GREEDY,
    exploration_rate=0.15  # 15% exploration, 85% exploitation
)

# Behavior:
# - 15% of time: Random tool selection
# - 85% of time: Best known tools

When to Use

Good for:

  • Simple, predictable behavior
  • Stable environments
  • Quick prototyping
  • Baseline comparisons

Not ideal for:

  • Need adaptive exploration
  • Continuous learning scenarios

Example

# Setup
rl_manager = RLManager(
    tool_names=["tool1", "tool2", "tool3"],
    exploration_strategy=ExplorationStrategy.EPSILON_GREEDY,
    exploration_rate=0.2  # 20% exploration
)

# 100 tool selections:
# - ~20 will be random (exploration)
# - ~80 will be best tools (exploitation)

for i in range(100):
    selected, _ = rl_manager.select_tools(f"Query {i}", top_n=2)
    # Consistent 20/80 split throughout

2️⃣ Epsilon-Decay

Starts with high exploration, gradually decreases.

How It Works

Episode 0:   epsilon = 0.3  (30% exploration)
Episode 100: epsilon = 0.15 (15% exploration)
Episode 500: epsilon = 0.05 (5% exploration)
Episode 1000+: epsilon = 0.01 (1% exploration, minimum)

Configuration

rl_manager = RLManager(
    tool_names=tools,
    exploration_strategy=ExplorationStrategy.EPSILON_DECAY,
    exploration_rate=0.3,          # Starting rate (30%)
    epsilon_decay_rate=0.995,      # Decay factor per episode
    min_exploration_rate=0.01      # Floor (1%)
)

# Decay formula: epsilon = max(min_rate, epsilon * decay_rate)

Decay Schedule Example

epsilon = 0.3
decay_rate = 0.995
min_rate = 0.01

for episode in [0, 50, 100, 200, 500, 1000]:
    epsilon = max(min_rate, 0.3 * (decay_rate ** episode))
    print(f"Episode {episode:4d}: epsilon = {epsilon:.3f}")

# Output:
# Episode    0: epsilon = 0.300
# Episode   50: epsilon = 0.259
# Episode  100: epsilon = 0.224
# Episode  200: epsilon = 0.168
# Episode  500: epsilon = 0.082
# Episode 1000: epsilon = 0.027

Manual Decay

# Manually trigger decay
rl_manager.anneal_exploration(
    decay_rate=0.99,
    min_rate=0.01
)

# Use in training loop
for epoch in range(100):
    train_epoch(rl_manager)

    # Decay after each epoch
    rl_manager.anneal_exploration(decay_rate=0.95)

When to Use

Good for:

  • Continuous learning
  • Production systems
  • Long-running agents
  • Adaptive behavior

Not ideal for:

  • Fixed exploration needs
  • Short training periods

Example

# Setup with decay
rl_manager = RLManager(
    tool_names=tools,
    exploration_strategy=ExplorationStrategy.EPSILON_DECAY,
    exploration_rate=0.5,      # Start high
    epsilon_decay_rate=0.99,
    min_exploration_rate=0.01
)

# Training loop
for episode in range(1000):
    query = f"Training query {episode}"
    selected, state_key = rl_manager.select_tools(query)

    # Train...
    reward = get_reward()
    for tool in selected:
        rl_manager.update(state_key, tool, reward)

    # Exploration rate decreases automatically
    if episode % 100 == 0:
        print(f"Episode {episode}: epsilon = {rl_manager.exploration_rate:.3f}")

# Output shows decreasing exploration

3️⃣ UCB (Upper Confidence Bound)

Systematically explores under-explored tools.

How It Works

# UCB Score for each tool:
UCB(tool) = Q(tool) + c * sqrt(log(total_visits) / tool_visits)
            ↑         ↑
         Exploitation  Exploration bonus

# Select tools with highest UCB scores

Configuration

rl_manager = RLManager(
    tool_names=tools,
    exploration_strategy=ExplorationStrategy.UCB,
    ucb_c=2.0  # Exploration constant (higher = more exploration)
)

# Common c values:
# - c = 1.0: Conservative exploration
# - c = 2.0: Balanced (recommended)
# - c = 3.0: Aggressive exploration

How UCB Balances

# Example state with 3 tools:
Tool 1: Q=0.8, visits=100  → UCB = 0.8 + 2*sqrt(log(200)/100) = 0.97
Tool 2: Q=0.6, visits=50   → UCB = 0.6 + 2*sqrt(log(200)/50) = 1.03 ✓ Selected!
Tool 3: Q=0.4, visits=10   → UCB = 0.4 + 2*sqrt(log(200)/10) = 1.49 ✓ Selected!

# Tool 2 and 3 get exploration bonus for being under-explored
# Even though Tool 1 has highest Q-value

When to Use

Good for:

  • Systematic exploration
  • Multi-armed bandit problems
  • Needs theoretical guarantees
  • Balanced exploration/exploitation

Not ideal for:

  • Simplicity requirements
  • Very large action spaces

Example

# Setup UCB
rl_manager = RLManager(
    tool_names=["tool1", "tool2", "tool3", "tool4"],
    exploration_strategy=ExplorationStrategy.UCB,
    ucb_c=2.0
)

# Simulate learning
for i in range(100):
    selected, state_key = rl_manager.select_tools(f"Query {i}", top_n=2)

    # UCB automatically balances:
    # - Early: Explores all tools
    # - Middle: Focuses on promising tools
    # - Late: Exploits best tools (but still explores occasionally)

    # Simulate rewards
    rewards = {"tool1": 0.9, "tool2": 0.5, "tool3": 0.7, "tool4": 0.3}
    for tool in selected:
        rl_manager.update(state_key, tool, rewards[tool])

# Check visit distribution
print("Visit counts:")
for tool in rl_manager.tool_names:
    visits = sum(
        rl_manager.visit_counts[state][tool]
        for state in rl_manager.q_table.keys()
    )
    print(f"  {tool}: {visits} visits")

# Output shows balanced exploration across all tools

4️⃣ Thompson Sampling

Bayesian probabilistic approach.

How It Works

# For each tool, maintain:
# - alpha: successes count
# - beta: failures count

# Sample from Beta distribution:
sampled_value = beta_distribution(alpha, beta)

# Select tools with highest sampled values

Configuration

rl_manager = RLManager(
    tool_names=tools,
    exploration_strategy=ExplorationStrategy.THOMPSON_SAMPLING
)

# No additional parameters needed
# Alpha and beta are updated automatically based on rewards

How Thompson Sampling Learns

# Initial state (uniform prior):
alpha = 1.0, beta = 1.0  # No knowledge

# After positive reward (+1.0):
alpha += 1.0  → alpha = 2.0, beta = 1.0

# After negative reward (-0.5):
beta += 0.5  → alpha = 2.0, beta = 1.5

# Beta distribution becomes more peaked around true success rate

When to Use

Good for:

  • Optimal exploration (Bayesian optimal)
  • Multi-armed bandits
  • Reward uncertainty quantification
  • Advanced use cases

Not ideal for:

  • Simplicity requirements
  • Interpretability needs

Example

# Setup Thompson Sampling
rl_manager = RLManager(
    tool_names=["tool1", "tool2", "tool3"],
    exploration_strategy=ExplorationStrategy.THOMPSON_SAMPLING
)

# Simulate learning
for i in range(100):
    selected, state_key = rl_manager.select_tools(f"Query {i}", top_n=1)

    # True success rates (unknown to agent):
    true_rates = {"tool1": 0.9, "tool2": 0.5, "tool3": 0.7}

    # Simulate reward based on true rate
    import random
    for tool in selected:
        reward = 1.0 if random.random() < true_rates[tool] else -1.0
        rl_manager.update(state_key, tool, reward)

# Check learned distributions
state_key = "Query 0"
for tool in rl_manager.tool_names:
    alpha = rl_manager.alpha_params[state_key][tool]
    beta = rl_manager.beta_params[state_key][tool]
    estimated_rate = alpha / (alpha + beta)
    print(f"{tool}: α={alpha:.1f}, β={beta:.1f}, "
          f"estimated_rate={estimated_rate:.2f}")

# Output shows convergence to true rates

📊 Strategy Comparison

StrategyComplexityPerformanceAdaptabilityUse Case
Epsilon-Greedy⭐ Low⭐⭐⭐ Good⭐⭐ FixedSimple, stable
Epsilon-Decay⭐⭐ Low⭐⭐⭐⭐ Better⭐⭐⭐⭐ AdaptiveProduction, continuous learning
UCB⭐⭐⭐ Medium⭐⭐⭐⭐ Better⭐⭐⭐ SystematicBalanced exploration
Thompson⭐⭐⭐⭐ High⭐⭐⭐⭐⭐ Best⭐⭐⭐⭐ BayesianOptimal exploration

🔄 Changing Strategies

Runtime Strategy Change

# Start with epsilon-decay
rl_manager = RLManager(
    tool_names=tools,
    exploration_strategy=ExplorationStrategy.EPSILON_DECAY
)

# Train for 100 episodes
train(rl_manager, episodes=100)

# Switch to epsilon-greedy for stable production
rl_manager.set_exploration_strategy(ExplorationStrategy.EPSILON_GREEDY)
rl_manager.exploration_rate = 0.05  # Low exploration

# Use in production
deploy(rl_manager)

Strategy Selection Guide

def recommend_strategy(scenario):
    if scenario == "prototyping":
        return ExplorationStrategy.EPSILON_GREEDY, 0.15

    elif scenario == "continuous_learning":
        return ExplorationStrategy.EPSILON_DECAY, 0.3

    elif scenario == "systematic_exploration":
        return ExplorationStrategy.UCB, None

    elif scenario == "optimal_performance":
        return ExplorationStrategy.THOMPSON_SAMPLING, None

    elif scenario == "production":
        return ExplorationStrategy.EPSILON_DECAY, 0.05

# Usage
strategy, rate = recommend_strategy("production")
rl_manager = RLManager(
    tool_names=tools,
    exploration_strategy=strategy,
    exploration_rate=rate if rate else 0.15
)

🎯 Best Practices

1. Start with Epsilon-Greedy

# ✅ Begin with simple baseline
rl_manager = RLManager(
    tool_names=tools,
    exploration_strategy=ExplorationStrategy.EPSILON_GREEDY,
    exploration_rate=0.15
)

2. Use Epsilon-Decay for Production

# ✅ Adapt exploration over time
rl_manager = RLManager(
    tool_names=tools,
    exploration_strategy=ExplorationStrategy.EPSILON_DECAY,
    exploration_rate=0.2,
    epsilon_decay_rate=0.995,
    min_exploration_rate=0.01
)

3. Tune Exploration Rate

# Development/Training: Higher exploration
exploration_rate = 0.3

# Production: Lower exploration
exploration_rate = 0.05

# A/B testing: Vary exploration
exploration_rates = [0.05, 0.1, 0.15, 0.2]

4. Monitor Exploration

# Track exploration rate over time
exploration_history = []

for episode in range(1000):
    train_episode(rl_manager)
    exploration_history.append(rl_manager.exploration_rate)

# Plot decay curve
import matplotlib.pyplot as plt
plt.plot(exploration_history)
plt.xlabel("Episode")
plt.ylabel("Exploration Rate")
plt.title("Exploration Decay")
plt.show()

🚀 Complete Example

from azcore.rl.rl_manager import RLManager, ExplorationStrategy
from azcore.rl.rewards import HeuristicRewardCalculator

# Compare strategies
strategies = [
    (ExplorationStrategy.EPSILON_GREEDY, "Epsilon-Greedy"),
    (ExplorationStrategy.EPSILON_DECAY, "Epsilon-Decay"),
    (ExplorationStrategy.UCB, "UCB"),
    (ExplorationStrategy.THOMPSON_SAMPLING, "Thompson Sampling")
]

results = {}

for strategy, name in strategies:
    print(f"\n=== Training with {name} ===")

    # Create RL manager
    rl_manager = RLManager(
        tool_names=["tool1", "tool2", "tool3"],
        exploration_strategy=strategy,
        exploration_rate=0.2 if strategy != ExplorationStrategy.UCB else 0.0,
        q_table_path=f"rl_data/{name.lower().replace(' ', '_')}.pkl"
    )

    # Train
    correct = 0
    for i in range(100):
        query = f"Query {i}"
        selected, state_key = rl_manager.select_tools(query, top_n=1)

        # Simulate reward (tool1 is best)
        reward = 1.0 if "tool1" in selected else -0.5
        if "tool1" in selected:
            correct += 1

        for tool in selected:
            rl_manager.update(state_key, tool, reward)

    accuracy = correct / 100
    results[name] = accuracy
    print(f"Accuracy: {accuracy:.2%}")

# Compare results
print("\n=== Strategy Comparison ===")
for name, accuracy in sorted(results.items(), key=lambda x: x[1], reverse=True):
    print(f"{name:20s}: {accuracy:.2%}")

🎓 Summary

Exploration strategies in Azcore:

  • Epsilon-Greedy: Simple, effective baseline
  • Epsilon-Decay: Adaptive exploration for production
  • UCB: Systematic, balanced exploration
  • Thompson Sampling: Optimal Bayesian exploration

Choose based on your needs:

  • Simplicity → Epsilon-Greedy
  • Production → Epsilon-Decay
  • Balanced → UCB
  • Optimal → Thompson Sampling

All strategies are production-ready and can be changed at runtime.

Edit this page on GitHub
AzrienLabs logo

AzrienLabs

Craftedby Team AzrienLabs