Agent Evaluation
Test, measure, and improve AI agent performance systematically
Agent Evaluation
TL;DR
You can't improve what you don't measure. Create benchmark suites (standard tasks, edge cases, adversarial tests), run your agent against them, and track metrics: success rate, time, quality score. Compare across agent versions to catch regressions.
| Difficulty | Advanced |
| Time | ~2 days |
| Code Size | ~500 LOC |
| Prerequisites | Tool Calling Agent |
Why Evaluate Agents?
Without systematic evaluation, you are guessing whether your agent actually works. Manual spot-checking creates a false sense of confidence: your agent handles the 5 prompts you tested, but silently fails on the 500 it will see in production.
Evaluation answers concrete questions:
- Is a ReAct agent better than a basic chain? Run both against the same benchmark suite and compare success rates, latency, and token cost.
- Does your agent pick the right tools? Track
required_toolsvs actual tool calls to measure selection accuracy. - Did your last change break something? Regression testing catches performance degradation before users do.
Testing Approaches
Vibes-Based Testing
Run a few prompts manually, eyeball the output, ship it. No record of what was tested, no way to detect regressions, no confidence in edge cases.
Systematic Evaluation
RecommendedDefine benchmark suites with validation functions, run them automatically, track metrics over time. Every agent version gets a score you can compare against the last.
What You'll Learn
- Designing agent evaluation benchmarks
- Measuring task completion and quality
- Testing agent robustness and safety
- Continuous evaluation in production
Tech Stack
| Component | Technology | Why |
|---|---|---|
| Testing | pytest + custom | Combine standard test runner with agent-specific assertions |
| Metrics | Custom framework | Tailored success/quality/efficiency metrics for agent behavior |
| Tracing | LangSmith | Trace every LLM call, tool use, and reasoning step |
| Monitoring | Prometheus | Time-series metrics for production regression detection |
Evaluation Framework
Agent Evaluation Pipeline
Project Structure
agent-evaluation/
├── src/
│ ├── __init__.py
│ ├── benchmarks/
│ │ ├── __init__.py
│ │ ├── task_suite.py # Standard tasks
│ │ ├── edge_cases.py # Edge case tests
│ │ └── adversarial.py # Adversarial tests
│ ├── metrics/
│ │ ├── __init__.py
│ │ ├── success.py # Success metrics
│ │ ├── efficiency.py # Cost/time metrics
│ │ └── quality.py # Output quality
│ ├── evaluator.py # Main evaluator
│ └── reporter.py # Results reporting
├── tests/
├── benchmarks/ # Benchmark data
└── requirements.txtImplementation
Step 1: Setup
openai>=1.0.0
langchain>=0.1.0
pytest>=7.0.0
pandas>=2.0.0
numpy>=1.24.0
langsmith>=0.0.60
pydantic>=2.0.0
rich>=13.0.0Step 2: Benchmark Definition
"""
Benchmark definitions for agent evaluation.
"""
from dataclasses import dataclass, field
from typing import Optional, Callable, Any
from enum import Enum
class Difficulty(Enum):
EASY = "easy"
MEDIUM = "medium"
HARD = "hard"
class Category(Enum):
RESEARCH = "research"
REASONING = "reasoning"
CODING = "coding"
WRITING = "writing"
MATH = "math"
MULTI_STEP = "multi_step"
@dataclass
class BenchmarkTask:
"""A single evaluation task."""
id: str
name: str
description: str
input: str
expected_output: Optional[str] = None
validation_fn: Optional[Callable[[str], bool]] = None
category: Category = Category.REASONING
difficulty: Difficulty = Difficulty.MEDIUM
max_time_seconds: int = 60
required_tools: list[str] = field(default_factory=list)
metadata: dict = field(default_factory=dict)
@dataclass
class BenchmarkResult:
"""Result of running a benchmark task."""
task_id: str
success: bool
output: str
time_seconds: float
tokens_used: int
tool_calls: int
error: Optional[str] = None
score: float = 0.0 # 0-1 quality score
class BenchmarkSuite:
"""Collection of benchmark tasks."""
def __init__(self, name: str):
self.name = name
self.tasks: list[BenchmarkTask] = []
def add_task(self, task: BenchmarkTask) -> None:
self.tasks.append(task)
def get_by_category(self, category: Category) -> list[BenchmarkTask]:
return [t for t in self.tasks if t.category == category]
def get_by_difficulty(self, difficulty: Difficulty) -> list[BenchmarkTask]:
return [t for t in self.tasks if t.difficulty == difficulty]Understanding Benchmark Design:
Anatomy of a Good Benchmark Task
Identity
Input/Output
Metadata
Validation Function Strategies:
Ways to Validate Agent Output
Exact Match (strict)
expected_output="Paris" — Best for simple factual questions.
Keyword Check (flexible)
validation_fn=lambda x: "paris" in x.lower() — Best for questions with multiple valid phrasings.
Multiple Keywords (comprehensive)
validation_fn=lambda x: "guido" in x.lower() and "1991" in x — Best for multi-fact questions.
Regex Pattern (precise)
validation_fn=lambda x: re.search(r'\$?\d+\.?\d*', x) is not None — Best for numeric or formatted outputs.
LLM-as-Judge (semantic)
Recommendedvalidation_fn=lambda x: llm_judge(x, expected) > 0.8 — Best for open-ended or creative tasks.
| Field | Purpose | Example |
|---|---|---|
category | Group tasks by type | RESEARCH, REASONING, CODING |
difficulty | Track performance by complexity | EASY, MEDIUM, HARD |
max_time_seconds | Prevent infinite loops | 30-120 seconds typical |
required_tools | Verify correct tool usage | ["search", "calculate"] |
Step 3: Task Suite
A good benchmark suite covers a range of categories and difficulty levels so you can pinpoint exactly where your agent struggles. Each task is self-contained: it has an input, a way to verify the output, and metadata for slicing results.
"""
Standard benchmark tasks for agent evaluation.
"""
from . import BenchmarkTask, BenchmarkSuite, Category, Difficulty
def create_standard_suite() -> BenchmarkSuite:
"""Create the standard evaluation benchmark suite."""
suite = BenchmarkSuite("standard")
# Research tasks
suite.add_task(BenchmarkTask(
id="research_001",
name="Simple Fact Lookup",
description="Find a simple factual answer",
input="What is the capital of France?",
expected_output="Paris",
validation_fn=lambda x: "paris" in x.lower(),
category=Category.RESEARCH,
difficulty=Difficulty.EASY,
max_time_seconds=30
))
suite.add_task(BenchmarkTask(
id="research_002",
name="Multi-fact Research",
description="Combine multiple facts",
input="Who created Python and what year was it first released?",
validation_fn=lambda x: "guido" in x.lower() and ("1991" in x or "1989" in x),
category=Category.RESEARCH,
difficulty=Difficulty.MEDIUM,
max_time_seconds=60
))
# Reasoning tasks
suite.add_task(BenchmarkTask(
id="reasoning_001",
name="Simple Logic",
description="Basic logical reasoning",
input="If all cats are mammals, and Whiskers is a cat, is Whiskers a mammal?",
validation_fn=lambda x: "yes" in x.lower() or "mammal" in x.lower(),
category=Category.REASONING,
difficulty=Difficulty.EASY
))
suite.add_task(BenchmarkTask(
id="reasoning_002",
name="Multi-step Reasoning",
description="Chain of reasoning steps",
input="""
Alice is taller than Bob.
Bob is taller than Charlie.
David is shorter than Charlie.
Who is the tallest?
""",
validation_fn=lambda x: "alice" in x.lower(),
category=Category.REASONING,
difficulty=Difficulty.MEDIUM
))
# Math tasks
suite.add_task(BenchmarkTask(
id="math_001",
name="Basic Calculation",
description="Simple arithmetic",
input="What is 247 + 589?",
expected_output="836",
validation_fn=lambda x: "836" in x,
category=Category.MATH,
difficulty=Difficulty.EASY,
required_tools=["calculate"]
))
suite.add_task(BenchmarkTask(
id="math_002",
name="Word Problem",
description="Math word problem",
input="A store sells apples for $2 each. If I buy 15 apples and pay with $50, how much change do I get?",
validation_fn=lambda x: "20" in x or "twenty" in x.lower(),
category=Category.MATH,
difficulty=Difficulty.MEDIUM
))
# Multi-step tasks
suite.add_task(BenchmarkTask(
id="multi_001",
name="Research and Calculate",
description="Combine research with calculation",
input="How many years ago was Python first released? (Current year is 2024)",
validation_fn=lambda x: "33" in x or "35" in x, # 1991 or 1989
category=Category.MULTI_STEP,
difficulty=Difficulty.MEDIUM,
required_tools=["search", "calculate"]
))
suite.add_task(BenchmarkTask(
id="multi_002",
name="Complex Research Task",
description="Multi-step research and synthesis",
input="Compare the population of Tokyo and New York City. Which is larger and by approximately how much?",
validation_fn=lambda x: "tokyo" in x.lower() and any(c.isdigit() for c in x),
category=Category.MULTI_STEP,
difficulty=Difficulty.HARD,
max_time_seconds=120
))
return suiteBenchmark Design Principles:
┌─────────────────────────────────────────────────────┐
│ A Well-Designed Benchmark Suite │
├─────────────────────────────────────────────────────┤
│ │
│ Categories (breadth) Difficulty (depth) │
│ ├── RESEARCH ├── EASY │
│ ├── REASONING │ └── Simple facts │
│ ├── MATH ├── MEDIUM │
│ └── MULTI_STEP │ └── Multi-fact │
│ └── HARD │
│ └── Synthesis tasks │
│ │
│ Good suite has 5-10 tasks per category, │
│ spread across all difficulty levels. │
└─────────────────────────────────────────────────────┘| Design Choice | Reasoning |
|---|---|
| Mix easy and hard tasks | Easy tasks establish a baseline; hard tasks differentiate agents |
Use required_tools | Verifies the agent chose the right tool, not just the right answer |
Keep max_time_seconds tight | Forces efficient behavior; an agent that takes 2 minutes per query is unusable |
Flexible validation_fn | Keyword checks are more robust than exact match for natural language output |
Step 4: Metrics
The metrics module turns raw results into actionable numbers. The key insight: a single "accuracy" number hides important detail. Breaking metrics down by category and difficulty reveals where your agent actually fails.
"""
Evaluation metrics for agents.
"""
from dataclasses import dataclass
from typing import Optional
import numpy as np
@dataclass
class EvaluationMetrics:
"""Comprehensive evaluation metrics."""
# Success metrics
success_rate: float
task_completion_rate: float
# Efficiency metrics
avg_time_seconds: float
avg_tokens: int
avg_tool_calls: float
# Quality metrics
avg_quality_score: float
# Breakdown by category
category_scores: dict[str, float]
difficulty_scores: dict[str, float]
# Failure analysis
failure_reasons: dict[str, int]
def calculate_metrics(results: list) -> EvaluationMetrics:
"""Calculate comprehensive metrics from results."""
if not results:
return EvaluationMetrics(
success_rate=0, task_completion_rate=0,
avg_time_seconds=0, avg_tokens=0, avg_tool_calls=0,
avg_quality_score=0, category_scores={},
difficulty_scores={}, failure_reasons={}
)
successes = [r for r in results if r.success]
# Success metrics
success_rate = len(successes) / len(results)
# Efficiency
times = [r.time_seconds for r in results]
tokens = [r.tokens_used for r in results]
tools = [r.tool_calls for r in results]
# Quality
quality_scores = [r.score for r in results if r.score > 0]
# Category breakdown
category_scores = {}
for r in results:
cat = r.task_id.split("_")[0]
if cat not in category_scores:
category_scores[cat] = []
category_scores[cat].append(1 if r.success else 0)
category_scores = {k: np.mean(v) for k, v in category_scores.items()}
# Failure analysis
failure_reasons = {}
for r in results:
if not r.success and r.error:
reason = r.error[:50]
failure_reasons[reason] = failure_reasons.get(reason, 0) + 1
return EvaluationMetrics(
success_rate=success_rate,
task_completion_rate=success_rate,
avg_time_seconds=np.mean(times),
avg_tokens=int(np.mean(tokens)),
avg_tool_calls=np.mean(tools),
avg_quality_score=np.mean(quality_scores) if quality_scores else 0,
category_scores=category_scores,
difficulty_scores={},
failure_reasons=failure_reasons
)Understanding the Metrics:
┌──────────────────────────────────────────────────────┐
│ Metric Categories │
├──────────────────────────────────────────────────────┤
│ │
│ Success Efficiency Quality │
│ ────── ────────── ─────── │
│ success_rate avg_time_seconds avg_quality │
│ (binary: did (wall clock per (0-1 score, │
│ it work?) task) semantic) │
│ avg_tokens │
│ (cost proxy) │
│ avg_tool_calls │
│ (complexity) │
│ │
│ An agent with 95% success but 60s avg_time │
│ may be worse than one with 90% success and 5s. │
└──────────────────────────────────────────────────────┘How to Interpret Results:
| Metric | Good Sign | Red Flag |
|---|---|---|
success_rate | Consistent across categories | High overall but 0% in one category |
avg_time_seconds | Under 10s for simple tasks | Over 60s for easy tasks (likely looping) |
avg_tool_calls | 1-3 per task | 10+ per task (agent is thrashing) |
failure_reasons | Few unique reasons | Same error repeated (systematic bug) |
category_scores | Even distribution | One category far below others |
Step 5: Evaluator
"""
Main agent evaluator.
"""
import time
from typing import Callable, Any
from dataclasses import dataclass
from .benchmarks import BenchmarkSuite, BenchmarkTask, BenchmarkResult
from .metrics import calculate_metrics, EvaluationMetrics
@dataclass
class EvaluationConfig:
"""Configuration for evaluation run."""
timeout_per_task: int = 120
retry_failed: bool = False
max_retries: int = 1
parallel: bool = False
verbose: bool = True
class AgentEvaluator:
"""
Evaluates agent performance on benchmark suites.
"""
def __init__(
self,
agent_fn: Callable[[str], str],
config: EvaluationConfig = None
):
"""
Initialize evaluator.
Args:
agent_fn: Function that takes input string and returns output
config: Evaluation configuration
"""
self.agent_fn = agent_fn
self.config = config or EvaluationConfig()
self.results: list[BenchmarkResult] = []
def evaluate(self, suite: BenchmarkSuite) -> EvaluationMetrics:
"""
Run evaluation on a benchmark suite.
Args:
suite: The benchmark suite to evaluate
Returns:
Comprehensive evaluation metrics
"""
self.results = []
if self.config.verbose:
print(f"\n📊 Evaluating on {suite.name} ({len(suite.tasks)} tasks)")
print("=" * 50)
for task in suite.tasks:
result = self._evaluate_task(task)
self.results.append(result)
if self.config.verbose:
status = "✅" if result.success else "❌"
print(f"{status} {task.name}: {result.score:.2f}")
metrics = calculate_metrics(self.results)
if self.config.verbose:
self._print_summary(metrics)
return metrics
def _evaluate_task(self, task: BenchmarkTask) -> BenchmarkResult:
"""Evaluate a single task."""
start_time = time.time()
try:
# Run the agent
output = self.agent_fn(task.input)
elapsed = time.time() - start_time
# Validate output
success = False
score = 0.0
if task.validation_fn:
success = task.validation_fn(output)
score = 1.0 if success else 0.0
elif task.expected_output:
success = task.expected_output.lower() in output.lower()
score = 1.0 if success else 0.0
else:
# No validation - assume success if we got output
success = len(output) > 0
score = 0.5
return BenchmarkResult(
task_id=task.id,
success=success,
output=output,
time_seconds=elapsed,
tokens_used=0, # Would need agent to report this
tool_calls=0,
score=score
)
except Exception as e:
return BenchmarkResult(
task_id=task.id,
success=False,
output="",
time_seconds=time.time() - start_time,
tokens_used=0,
tool_calls=0,
error=str(e),
score=0.0
)
def _print_summary(self, metrics: EvaluationMetrics) -> None:
"""Print evaluation summary."""
print("\n" + "=" * 50)
print("📈 EVALUATION SUMMARY")
print("=" * 50)
print(f"Success Rate: {metrics.success_rate:.1%}")
print(f"Avg Time: {metrics.avg_time_seconds:.2f}s")
print(f"Avg Quality: {metrics.avg_quality_score:.2f}")
if metrics.category_scores:
print("\nBy Category:")
for cat, score in metrics.category_scores.items():
print(f" {cat}: {score:.1%}")
if metrics.failure_reasons:
print("\nTop Failure Reasons:")
sorted_failures = sorted(
metrics.failure_reasons.items(),
key=lambda x: x[1],
reverse=True
)[:3]
for reason, count in sorted_failures:
print(f" {reason}: {count}")Understanding the Evaluation Flow:
Evaluation Execution Flow (For Each Task in Suite)
The Agent-Agnostic Pattern:
# The evaluator doesn't care HOW your agent works
# It only needs a function: input → output
def my_react_agent(input: str) -> str:
return react_loop(input)
def my_planning_agent(input: str) -> str:
return plan_and_execute(input)
# Both work with the same evaluator!
evaluator = AgentEvaluator(agent_fn=my_react_agent)
evaluator = AgentEvaluator(agent_fn=my_planning_agent)| Design Choice | Benefit |
|---|---|
| Function interface | Any agent implementation works |
| Timeout handling | Catches infinite loops |
| Exception catching | Agent crashes don't crash evaluator |
| Category tracking | Identifies weak areas |
Step 6: Reporter
"""
Generate evaluation reports.
"""
from dataclasses import dataclass
from typing import Optional
import json
from datetime import datetime
from .metrics import EvaluationMetrics
from .benchmarks import BenchmarkResult
@dataclass
class EvaluationReport:
"""Complete evaluation report."""
agent_name: str
timestamp: str
metrics: EvaluationMetrics
results: list[BenchmarkResult]
config: dict
class ReportGenerator:
"""Generate various report formats."""
def __init__(self, agent_name: str):
self.agent_name = agent_name
def generate(
self,
metrics: EvaluationMetrics,
results: list[BenchmarkResult],
config: dict = None
) -> EvaluationReport:
"""Generate a complete evaluation report."""
return EvaluationReport(
agent_name=self.agent_name,
timestamp=datetime.now().isoformat(),
metrics=metrics,
results=results,
config=config or {}
)
def to_markdown(self, report: EvaluationReport) -> str:
"""Generate markdown report."""
md = f"""# Agent Evaluation Report
**Agent:** {report.agent_name}
**Date:** {report.timestamp}
## Summary Metrics
| Metric | Value |
|--------|-------|
| Success Rate | {report.metrics.success_rate:.1%} |
| Avg Time | {report.metrics.avg_time_seconds:.2f}s |
| Avg Quality | {report.metrics.avg_quality_score:.2f} |
## Category Breakdown
| Category | Success Rate |
|----------|--------------|
"""
for cat, score in report.metrics.category_scores.items():
md += f"| {cat} | {score:.1%} |\n"
md += "\n## Individual Results\n\n"
for r in report.results:
status = "✅" if r.success else "❌"
md += f"- {status} **{r.task_id}**: {r.score:.2f} ({r.time_seconds:.1f}s)\n"
return md
def to_json(self, report: EvaluationReport) -> str:
"""Generate JSON report."""
return json.dumps({
"agent_name": report.agent_name,
"timestamp": report.timestamp,
"metrics": {
"success_rate": report.metrics.success_rate,
"avg_time": report.metrics.avg_time_seconds,
"avg_quality": report.metrics.avg_quality_score,
"category_scores": report.metrics.category_scores
},
"results": [
{
"task_id": r.task_id,
"success": r.success,
"score": r.score,
"time": r.time_seconds
}
for r in report.results
]
}, indent=2)Why Two Output Formats?
| Format | Use Case |
|---|---|
| Markdown | Human-readable reports for code reviews, PRs, and team communication |
| JSON | Machine-readable for CI/CD pipelines, dashboards, and automated regression detection |
In practice, you would save the JSON after every evaluation run and compare it programmatically against previous runs to detect regressions before merging code.
Running Evaluations
from src.evaluator import AgentEvaluator, EvaluationConfig
from src.benchmarks.task_suite import create_standard_suite
from src.reporter import ReportGenerator
# Your agent function
def my_agent(input_text: str) -> str:
# Your agent implementation
return agent.run(input_text)
# Create evaluator
evaluator = AgentEvaluator(
agent_fn=my_agent,
config=EvaluationConfig(verbose=True)
)
# Run evaluation
suite = create_standard_suite()
metrics = evaluator.evaluate(suite)
# Generate report
reporter = ReportGenerator("MyAgent v1.0")
report = reporter.generate(metrics, evaluator.results)
print(reporter.to_markdown(report))Key Metrics
| Metric | Description | Target |
|---|---|---|
| Success Rate | Tasks completed correctly | > 80% |
| Avg Time | Time per task | < 30s |
| Quality Score | Output quality rating | > 0.8 |
| Tool Efficiency | Tools used per task | < 5 |
Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| Benchmark Suite | Collection of test tasks | Standardized way to measure agents |
| Validation Function | Checks if output is correct | Automates success determination |
| Success Rate | % of tasks completed correctly | Primary performance metric |
| Quality Score | How good the output is (0-1) | Measures beyond pass/fail |
| Category Breakdown | Success by task type | Identifies agent strengths/weaknesses |
| Regression Testing | Compare across versions | Catches performance degradation |
| Failure Analysis | Why tasks failed | Guides improvement efforts |
Summary
You've built a comprehensive agent evaluation framework that:
- Defines standardized benchmark tasks
- Measures success, efficiency, and quality
- Generates detailed reports
- Enables regression testing across agent versions
Agent Security & Safe Deployment
Build secure autonomous agents with threat modeling, guardrails, and safe tool access
Bull vs Bear Market Analyst
Build an adversarial debate agent where Bull and Bear personas argue opposing investment theses, with a Judge scoring arguments and a Synthesizer providing balanced risk-aware recommendations