Generate multiple independent solutions and select the most consistent answer through voting or aggregation. Improves accuracy and reliability for tasks with verifiable answers.
Overview
Self-Consistency is a powerful technique where an agent generates multiple independent solutions to the same problem, then selects the most consistent or frequent answer. This pattern is particularly effective for tasks with objective answers where multiple reasoning paths can lead to the same conclusion.
When to Use
- Mathematical reasoning: Math problems, calculations, logic puzzles
- Factual questions: Questions with definitive answers
- Classification: Multiple-choice or categorization tasks
- Fact verification: Checking claims against knowledge
- Code generation: When tests can validate correctness
- Decision making: Selecting between discrete options
Basic Usage
from azcore import SelfConsistencyAgent, Agent
# Create base agent
base_agent = Agent(
agent_name="Reasoner",
system_prompt="Solve the problem step by step. Show your reasoning.",
llm=llm,
)
# Create self-consistency agent
self_consistency = SelfConsistencyAgent(
agent=base_agent,
num_generations=5, # Generate 5 solutions
voting_strategy="majority", # Use majority voting
)
# Generate multiple solutions and select best
result = self_consistency.run("What is 15% of 240?")
print(f"Answer: {result.final_answer}")
print(f"Confidence: {result.confidence}")
print(f"All answers: {result.all_answers}")
Configuration Options
num_generations
Number of independent solutions to generate:
self_consistency = SelfConsistencyAgent(
agent=base_agent,
num_generations=10, # More generations = higher confidence
)
Recommendations:
- Simple tasks: 3-5 generations
- Medium complexity: 5-7 generations
- Complex/critical: 10+ generations
voting_strategy
How to select the final answer:
# Majority voting (default)
self_consistency = SelfConsistencyAgent(
agent=base_agent,
voting_strategy="majority", # Most frequent answer wins
)
# Weighted voting by confidence
self_consistency = SelfConsistencyAgent(
agent=base_agent,
voting_strategy="weighted", # Consider agent confidence scores
)
# Unanimous consensus
self_consistency = SelfConsistencyAgent(
agent=base_agent,
voting_strategy="unanimous", # All must agree
)
# Custom voting function
def custom_vote(answers):
# Your voting logic
return selected_answer
self_consistency = SelfConsistencyAgent(
agent=base_agent,
voting_strategy="custom",
voting_function=custom_vote,
)
temperature
Control diversity of solutions:
self_consistency = SelfConsistencyAgent(
agent=base_agent,
num_generations=5,
temperature=0.8, # Higher = more diverse solutions
)
parallel_execution
Run generations in parallel:
self_consistency = SelfConsistencyAgent(
agent=base_agent,
num_generations=10,
parallel_execution=True, # Faster but more resource-intensive
)
Advanced Examples
Mathematical Reasoning
from azcore import Agent, SelfConsistencyAgent
# Math-focused agent
math_agent = Agent(
agent_name="Math Solver",
system_prompt="""Solve math problems step by step.
Format your final answer as: ANSWER: [number]
Show all work:
1. Identify the problem type
2. Set up equations
3. Solve step by step
4. Verify the answer""",
llm=llm,
)
# Self-consistency for math
math_solver = SelfConsistencyAgent(
agent=math_agent,
num_generations=7,
voting_strategy="majority",
answer_extractor=lambda text: re.search(r'ANSWER:\s*(\d+\.?\d*)', text).group(1),
)
# Solve math problem
problem = """
A store is having a 25% off sale. If an item originally costs $80,
and you have a coupon for an additional $10 off the sale price,
how much will you pay?
"""
result = math_solver.run(problem)
print(f"Answer: ${result.final_answer}")
print(f"Agreement: {result.agreement_rate:.1%}")
print(f"\nAll solutions:")
for i, answer in enumerate(result.all_answers):
print(f" {i+1}. ${answer}")
Multiple Choice Questions
# Multiple choice agent
mc_agent = Agent(
agent_name="Multiple Choice",
system_prompt="""Answer multiple choice questions.
Think through each option carefully.
Eliminate incorrect answers.
Select the best answer.
Format: ANSWER: [A/B/C/D]""",
llm=llm,
)
# Self-consistency for MCQ
mc_solver = SelfConsistencyAgent(
agent=mc_agent,
num_generations=5,
voting_strategy="majority",
answer_extractor=lambda text: re.search(r'ANSWER:\s*([A-D])', text).group(1),
)
# Answer question
question = """
Which of the following is NOT a benefit of using microservices architecture?
A) Independent deployment of services
B) Technology diversity across services
C) Simpler system complexity
D) Better fault isolation
Provide your reasoning and answer.
"""
result = mc_solver.run(question)
print(f"Selected answer: {result.final_answer}")
print(f"Confidence: {result.confidence:.1%}")
Code Generation with Verification
# Code generation agent
code_agent = Agent(
agent_name="Code Generator",
system_prompt="""Generate Python code to solve the problem.
Requirements:
- Write clean, readable code
- Include error handling
- Add docstrings
- Provide example usage""",
llm=llm,
tools=[code_execution_tool],
)
# Self-consistency for code
code_generator = SelfConsistencyAgent(
agent=code_agent,
num_generations=5,
voting_strategy="test_passing", # Select code that passes tests
test_suite=test_cases,
)
# Generate code
problem = """
Write a function that finds the longest common subsequence of two strings.
Example:
lcs("ABCDGH", "AEDFHR") should return "ADH"
"""
result = code_generator.run(problem)
print("Generated code:")
print(result.final_answer)
print(f"\nTests passed: {result.tests_passed}/{result.total_tests}")
Fact Verification
# Fact checker agent
fact_checker = Agent(
agent_name="Fact Checker",
system_prompt="""Verify if the claim is true or false.
Research the claim thoroughly.
Cite reliable sources.
Consider counter-evidence.
Format: VERDICT: [TRUE/FALSE/UNCERTAIN]
Confidence: [HIGH/MEDIUM/LOW]""",
llm=llm,
tools=[search_tool, wikipedia_tool],
)
# Self-consistency for facts
fact_verifier = SelfConsistencyAgent(
agent=fact_checker,
num_generations=5,
voting_strategy="weighted", # Weight by stated confidence
confidence_extractor=lambda text: (
"HIGH" if "Confidence: HIGH" in text else
"MEDIUM" if "Confidence: MEDIUM" in text else
"LOW"
),
)
# Verify claim
claim = "The Great Wall of China is visible from space with the naked eye."
result = fact_verifier.run(f"Verify this claim: {claim}")
print(f"Verdict: {result.final_answer}")
print(f"Consensus: {result.agreement_rate:.1%}")
print(f"\nIndividual verdicts:")
for i, verdict in enumerate(result.all_answers):
print(f" Attempt {i+1}: {verdict}")
Classification Task
# Sentiment classifier
sentiment_agent = Agent(
agent_name="Sentiment Analyzer",
system_prompt="""Classify the sentiment of the text.
Categories: POSITIVE, NEGATIVE, NEUTRAL
Consider:
- Overall tone
- Emotional words
- Context
- Sarcasm or irony
Format: SENTIMENT: [category]""",
llm=llm,
)
# Self-consistency for classification
classifier = SelfConsistencyAgent(
agent=sentiment_agent,
num_generations=5,
voting_strategy="majority",
answer_extractor=lambda text: re.search(r'SENTIMENT:\s*(\w+)', text).group(1),
)
# Classify sentiment
text = """
The product arrived late and the packaging was damaged, but the customer
service team was incredibly helpful and immediately sent a replacement.
"""
result = classifier.run(text)
print(f"Sentiment: {result.final_answer}")
print(f"Agreement: {result.agreement_rate:.1%}")
print(f"\nVote distribution:")
for sentiment, count in result.vote_distribution.items():
print(f" {sentiment}: {count} votes")
Voting Strategies
Majority Voting
Simple and effective - most frequent answer wins:
def majority_vote(answers):
from collections import Counter
counts = Counter(answers)
return counts.most_common(1)[0][0]
Best for:
- Discrete answers (A/B/C, True/False)
- Classification tasks
- Multiple choice questions
Weighted Voting
Consider confidence scores:
def weighted_vote(answers_with_confidence):
scores = {}
for answer, confidence in answers_with_confidence:
scores[answer] = scores.get(answer, 0) + confidence
return max(scores, key=scores.get)
Best for:
- When agents provide confidence scores
- Nuanced decision making
- Combining varying quality outputs
Unanimous Consensus
All generations must agree:
def unanimous_vote(answers):
if len(set(answers)) == 1:
return answers[0]
else:
raise NoConsensusError("Generations do not agree")
Best for:
- Critical decisions
- Safety-critical applications
- High-stakes scenarios
Threshold-Based
Require minimum agreement:
def threshold_vote(answers, threshold=0.6):
from collections import Counter
counts = Counter(answers)
most_common, count = counts.most_common(1)[0]
if count / len(answers) >= threshold:
return most_common
else:
raise InsufficientConsensusError(f"Agreement below {threshold:.0%}")
Best for:
- Balancing confidence and diversity
- Quality control
- Adjustable strictness
Answer Extraction
Extract the final answer from agent outputs:
Pattern Matching
import re
def extract_answer(text):
# Look for "ANSWER: X" pattern
match = re.search(r'ANSWER:\s*(.+)', text, re.IGNORECASE)
if match:
return match.group(1).strip()
return None
Last Line
def extract_last_line(text):
lines = text.strip().split('\n')
return lines[-1] if lines else None
JSON Extraction
import json
def extract_json_answer(text):
# Extract JSON object
match = re.search(r'\{.*\}', text, re.DOTALL)
if match:
data = json.loads(match.group(0))
return data.get('answer')
return None
Numerical Extraction
def extract_number(text):
# Find all numbers
numbers = re.findall(r'-?\d+\.?\d*', text)
# Return last number (usually the answer)
return float(numbers[-1]) if numbers else None
Best Practices
1. Clear Answer Format
Instruct agents to format answers consistently:
system_prompt = """
Solve the problem and provide your answer in this format:
FINAL ANSWER: [your answer here]
This format is mandatory for answer extraction.
"""
2. Appropriate num_generations
Balance quality vs. cost:
# Simple task: fewer generations
easy_task = SelfConsistencyAgent(agent=agent, num_generations=3)
# Complex task: more generations
hard_task = SelfConsistencyAgent(agent=agent, num_generations=10)
# Critical task: many generations
critical_task = SelfConsistencyAgent(agent=agent, num_generations=20)
3. Temperature Settings
Higher temperature increases diversity:
# Low diversity (fast convergence)
agent_low = Agent(temperature=0.3)
sc_low = SelfConsistencyAgent(agent=agent_low, num_generations=5)
# High diversity (explore more paths)
agent_high = Agent(temperature=0.9)
sc_high = SelfConsistencyAgent(agent=agent_high, num_generations=5)
4. Monitor Agreement Rate
Low agreement indicates uncertain answer:
result = self_consistency.run(task)
if result.agreement_rate < 0.5:
print("Warning: Low agreement. Answer may be unreliable.")
# Consider increasing num_generations or reformulating task
5. Use Parallel Execution
For speed when possible:
self_consistency = SelfConsistencyAgent(
agent=agent,
num_generations=10,
parallel_execution=True, # Faster
max_workers=5, # Control parallelism
)
Performance Considerations
Latency
# Sequential: latency = num_generations × agent_time
sc_sequential = SelfConsistencyAgent(
agent=agent,
num_generations=10,
parallel_execution=False,
)
# Latency: ~10x single agent
# Parallel: latency ≈ agent_time
sc_parallel = SelfConsistencyAgent(
agent=agent,
num_generations=10,
parallel_execution=True,
)
# Latency: ~1x single agent (with more resources)
Cost
# Cost = num_generations × per_agent_cost
# Monitor token usage
result = self_consistency.run(task)
print(f"Total tokens: {result.total_tokens}")
print(f"Cost: ${result.estimated_cost}")
Quality vs. Generations
Diminishing returns after certain point:
Agreement Rate vs. Num Generations:
100% | ████████████
| ███
| ██
|█
0% |________________
0 5 10 15 20
Generations
Optimal: 5-10 generations for most tasks
Error Handling
Handle cases where consensus isn't reached:
try:
result = self_consistency.run(task)
except NoConsensusError:
# Fallback strategy
result = fallback_agent.run(task)
except InsufficientGenerationsError:
# Too few valid generations
result = retry_with_more_generations()
Debugging
Inspect All Generations
result = self_consistency.run(task)
print("All generations:")
for i, generation in enumerate(result.generations):
print(f"\n=== Generation {i+1} ===")
print(generation.output)
print(f"Extracted answer: {generation.extracted_answer}")
Analyze Vote Distribution
print("\nVote distribution:")
for answer, count in sorted(result.vote_distribution.items(),
key=lambda x: x[1], reverse=True):
percentage = (count / result.num_generations) * 100
print(f"{answer}: {count} votes ({percentage:.1f}%)")
Check Reasoning Paths
# Group generations by answer
from collections import defaultdict
by_answer = defaultdict(list)
for gen in result.generations:
by_answer[gen.extracted_answer].append(gen.reasoning)
print(f"\nAnswer: {result.final_answer}")
print("Supporting reasoning paths:")
for reasoning in by_answer[result.final_answer]:
print(f"\n- {reasoning[:200]}...")
Limitations
Not Suitable For:
- Open-ended generation: No "correct" answer to vote on
- Creative tasks: Diversity is the goal, not consensus
- Very simple tasks: Overkill and wasteful
- Tasks without extractable answers: Can't identify agreement
Better Alternatives:
- Open-ended → Use Reflexion or Reasoning Duo
- Creative → Use single agent or Mixture of Agents
- Simple → Use single agent call
- Complex reasoning → Combine with Reflexion
Research Background
Based on "Self-Consistency Improves Chain of Thought Reasoning in Language Models":
- Paper: arxiv.org/abs/2203.11171
- Authors: Wang et al., 2023
- Key finding: Sampling diverse reasoning paths and marginalizing out reasoning improves accuracy