SLM Evaluation & Benchmarking
Compare and evaluate different small language models
SLM Evaluation & Benchmarking
Build a comprehensive benchmarking suite to evaluate and compare small language models across quality, speed, and resource usage metrics. Learn to make data-driven decisions about which model best fits your use case.
TL;DR
Benchmark SLMs across three dimensions: quality (accuracy, F1), speed (latency, throughput), and resources (memory, CPU). Key metrics: tokens/second (20-50 typical), time-to-first-token (<500ms good), and peak memory. Use standard datasets (MMLU, HellaSwag) for comparability. The "best" model depends on your tradeoff: Phi-3 for accuracy, Gemma for speed, SmolLM for constrained environments.
Project Overview
| Aspect | Details |
|---|---|
| Difficulty | Beginner |
| Time | 3-4 hours |
| Prerequisites | Local SLM Setup |
| What You'll Build | Model benchmarking framework with automated evaluation and reporting |
What You'll Learn
- Standard benchmark datasets (MMLU, HellaSwag, TruthfulQA)
- Quality metrics (accuracy, perplexity, F1 score)
- Performance metrics (latency, throughput, memory usage)
- Building automated benchmarking pipelines
- Visualization and comparison reports
- Task-specific evaluation strategies
Architecture Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ Benchmarking Framework Architecture │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Benchmark Suite │ │
│ │ ┌──────────────┐ ┌───────────────┐ ┌──────────────┐ │ │
│ │ │ Datasets │ │Task Definitions│ │Model Registry│ │ │
│ │ │ (MMLU, etc.) │ │ (MC, QA, etc) │ │ (Phi, Qwen) │ │ │
│ │ └──────┬───────┘ └───────┬───────┘ └──────┬───────┘ │ │
│ └───────────┼──────────────────┼─────────────────┼──────────────────────┘ │
│ └──────────────────┼─────────────────┘ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Evaluation Engine │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────────────┐ │ │
│ │ │ Test Runner │ │ │
│ │ │ Run samples ──▶ Collect responses ──▶ Measure │ │ │
│ │ └─────────────────────────┬────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ┌───────────────┴───────────────┐ │ │
│ │ ▼ ▼ │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ │Metrics Collector│ │Resource Profiler│ │ │
│ │ │ • Accuracy │ │ • Memory (MB) │ │ │
│ │ │ • F1 Score │ │ • CPU % │ │ │
│ │ │ • Latency │ │ • GPU usage │ │ │
│ │ └────────┬────────┘ └────────┬────────┘ │ │
│ └─────────────┼──────────────────────────────┼──────────────────────────┘ │
│ └─────────────┬────────────────┘ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Analysis & Reporting │ │
│ │ │ Results │ │ Visualizer │ │ Report │ │ │
│ │ │ Aggregator │─▶│ (charts) │─▶│ Generator │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Project Setup
Install Dependencies
# Create project directory
mkdir slm-benchmarking && cd slm-benchmarking
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install ollama datasets evaluate pandas numpy matplotlib seaborn
pip install psutil gputil tqdm rich tabulate scikit-learnPull Models to Benchmark
# Pull various SLMs for comparison
ollama pull phi3:mini # 2.3GB - Microsoft
ollama pull qwen2.5:3b # 2.0GB - Alibaba
ollama pull gemma2:2b # 1.6GB - Google
ollama pull llama3.2:3b # 2.0GB - Meta
ollama pull smollm2:1.7b # 1.0GB - HuggingFacePart 1: Benchmark Framework Core
Model Registry
Create a registry to manage models being evaluated.
# models.py
from dataclasses import dataclass
from typing import Optional, List, Dict, Any
import ollama
@dataclass
class ModelInfo:
"""Information about a model to benchmark."""
name: str
provider: str
parameters: str
quantization: str = "Q4_K_M"
context_length: int = 4096
ollama_name: Optional[str] = None
@property
def id(self) -> str:
return self.ollama_name or self.name
# Model registry
MODELS = {
"phi3-mini": ModelInfo(
name="Phi-3 Mini",
provider="Microsoft",
parameters="3.8B",
ollama_name="phi3:mini",
context_length=4096
),
"qwen2.5-3b": ModelInfo(
name="Qwen 2.5",
provider="Alibaba",
parameters="3B",
ollama_name="qwen2.5:3b",
context_length=8192
),
"gemma2-2b": ModelInfo(
name="Gemma 2",
provider="Google",
parameters="2B",
ollama_name="gemma2:2b",
context_length=8192
),
"llama3.2-3b": ModelInfo(
name="Llama 3.2",
provider="Meta",
parameters="3B",
ollama_name="llama3.2:3b",
context_length=8192
),
"smollm2-1.7b": ModelInfo(
name="SmolLM 2",
provider="HuggingFace",
parameters="1.7B",
ollama_name="smollm2:1.7b",
context_length=2048
),
}
class ModelClient:
"""Wrapper for model inference."""
def __init__(self, model_info: ModelInfo):
self.model_info = model_info
def generate(
self,
prompt: str,
max_tokens: int = 256,
temperature: float = 0.0
) -> Dict[str, Any]:
"""Generate a response and return with metadata."""
import time
start_time = time.perf_counter()
response = ollama.chat(
model=self.model_info.id,
messages=[{"role": "user", "content": prompt}],
options={
"temperature": temperature,
"num_predict": max_tokens,
}
)
end_time = time.perf_counter()
return {
"content": response["message"]["content"],
"latency_ms": (end_time - start_time) * 1000,
"prompt_tokens": response.get("prompt_eval_count", 0),
"completion_tokens": response.get("eval_count", 0),
"total_tokens": response.get("prompt_eval_count", 0) + response.get("eval_count", 0)
}
def check_available(self) -> bool:
"""Check if model is available locally."""
try:
models = ollama.list()
available = [m["name"] for m in models.get("models", [])]
return any(self.model_info.id in m for m in available)
except Exception:
return False
def get_model(model_key: str) -> ModelClient:
"""Get a model client by key."""
if model_key not in MODELS:
raise ValueError(f"Unknown model: {model_key}")
return ModelClient(MODELS[model_key])
def list_models() -> List[ModelInfo]:
"""List all registered models."""
return list(MODELS.values())What's Happening Here?
The Model Registry provides a unified interface for managing different SLMs:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Model Registry Architecture │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ MODELS Dictionary (Model Catalog) │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ "phi3-mini" → ModelInfo(name="Phi-3 Mini", params="3.8B", ...) │ │
│ │ "qwen2.5-3b" → ModelInfo(name="Qwen 2.5", params="3B", ...) │ │
│ │ "gemma2-2b" → ModelInfo(name="Gemma 2", params="2B", ...) │ │
│ │ "llama3.2-3b" → ModelInfo(name="Llama 3.2", params="3B", ...) │ │
│ │ "smollm2-1.7b" → ModelInfo(name="SmolLM 2", params="1.7B", ...) │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ModelClient (Unified Interface) │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ generate(prompt) ─────────────────────────────────────────────────────► │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ 1. Start timer │ │ │
│ │ │ 2. Call ollama.chat(model=..., messages=[...]) │ │ │
│ │ │ 3. Stop timer │ │ │
│ │ │ 4. Return {content, latency_ms, tokens} │ │ │
│ │ └─────────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ Same interface regardless of which model you use! │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Model Characteristics Explained:
| Model | Strengths | Weaknesses | Best Use Case |
|---|---|---|---|
| Phi-3 Mini (3.8B) | Strong reasoning, math | Slower inference | Complex tasks, coding |
| Qwen 2.5 (3B) | Multilingual, extraction | Higher memory | International apps |
| Gemma 2 (2B) | Fast, efficient | Lower accuracy | Real-time applications |
| Llama 3.2 (3B) | Balanced performance | Moderate speed | General purpose |
| SmolLM 2 (1.7B) | Very fast, tiny | Limited capability | Edge devices, simple tasks |
Why Use ModelInfo Dataclass?
┌─────────────────────────────────────────────────────────────────────────────┐
│ Consistent metadata for every model │
│ │
│ @dataclass │
│ class ModelInfo: │
│ name: str ← Human-readable display name │
│ provider: str ← Company/org that made it │
│ parameters: str ← Model size (affects quality & speed) │
│ quantization: str ← Compression level (Q4_K_M = good balance) │
│ context_length: int ← Max tokens model can process at once │
│ ollama_name: str ← Exact name to pass to Ollama │
│ │
│ This metadata helps with: │
│ • Generating reports with consistent model names │
│ • Checking if model fits your hardware (context_length × 2 ≈ RAM needed) │
│ • Understanding tradeoffs before running benchmarks │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Benchmark Dataset Manager
Create a dataset loader for standard benchmarks.
# datasets_manager.py
from dataclasses import dataclass
from typing import List, Dict, Any, Optional, Iterator
from enum import Enum
import json
import random
class TaskType(str, Enum):
MULTIPLE_CHOICE = "multiple_choice"
TEXT_GENERATION = "text_generation"
CLASSIFICATION = "classification"
QA = "qa"
SUMMARIZATION = "summarization"
@dataclass
class BenchmarkSample:
"""A single benchmark sample."""
id: str
prompt: str
expected: str
task_type: TaskType
category: Optional[str] = None
metadata: Dict[str, Any] = None
@dataclass
class BenchmarkDataset:
"""A benchmark dataset."""
name: str
description: str
task_type: TaskType
samples: List[BenchmarkSample]
def __len__(self) -> int:
return len(self.samples)
def __iter__(self) -> Iterator[BenchmarkSample]:
return iter(self.samples)
def sample(self, n: int, seed: int = 42) -> "BenchmarkDataset":
"""Get a random sample of the dataset."""
random.seed(seed)
sampled = random.sample(self.samples, min(n, len(self.samples)))
return BenchmarkDataset(
name=self.name,
description=self.description,
task_type=self.task_type,
samples=sampled
)
# Built-in benchmark datasets
def create_mmlu_subset() -> BenchmarkDataset:
"""Create MMLU-style multiple choice questions."""
samples = [
BenchmarkSample(
id="mmlu_1",
prompt="""Question: What is the capital of France?
A) London
B) Berlin
C) Paris
D) Madrid
Answer with just the letter (A, B, C, or D):""",
expected="C",
task_type=TaskType.MULTIPLE_CHOICE,
category="geography"
),
BenchmarkSample(
id="mmlu_2",
prompt="""Question: Which planet is known as the Red Planet?
A) Venus
B) Mars
C) Jupiter
D) Saturn
Answer with just the letter (A, B, C, or D):""",
expected="B",
task_type=TaskType.MULTIPLE_CHOICE,
category="astronomy"
),
BenchmarkSample(
id="mmlu_3",
prompt="""Question: What is the chemical symbol for gold?
A) Ag
B) Fe
C) Au
D) Cu
Answer with just the letter (A, B, C, or D):""",
expected="C",
task_type=TaskType.MULTIPLE_CHOICE,
category="chemistry"
),
BenchmarkSample(
id="mmlu_4",
prompt="""Question: Who wrote "Romeo and Juliet"?
A) Charles Dickens
B) William Shakespeare
C) Jane Austen
D) Mark Twain
Answer with just the letter (A, B, C, or D):""",
expected="B",
task_type=TaskType.MULTIPLE_CHOICE,
category="literature"
),
BenchmarkSample(
id="mmlu_5",
prompt="""Question: What is the derivative of x^2?
A) x
B) 2x
C) x^2
D) 2
Answer with just the letter (A, B, C, or D):""",
expected="B",
task_type=TaskType.MULTIPLE_CHOICE,
category="math"
),
BenchmarkSample(
id="mmlu_6",
prompt="""Question: What is the time complexity of binary search?
A) O(n)
B) O(n^2)
C) O(log n)
D) O(1)
Answer with just the letter (A, B, C, or D):""",
expected="C",
task_type=TaskType.MULTIPLE_CHOICE,
category="computer_science"
),
BenchmarkSample(
id="mmlu_7",
prompt="""Question: What is the largest organ in the human body?
A) Heart
B) Brain
C) Liver
D) Skin
Answer with just the letter (A, B, C, or D):""",
expected="D",
task_type=TaskType.MULTIPLE_CHOICE,
category="biology"
),
BenchmarkSample(
id="mmlu_8",
prompt="""Question: In which year did World War II end?
A) 1943
B) 1944
C) 1945
D) 1946
Answer with just the letter (A, B, C, or D):""",
expected="C",
task_type=TaskType.MULTIPLE_CHOICE,
category="history"
),
]
return BenchmarkDataset(
name="MMLU-Mini",
description="Subset of MMLU-style questions",
task_type=TaskType.MULTIPLE_CHOICE,
samples=samples
)
def create_commonsense_qa() -> BenchmarkDataset:
"""Create commonsense reasoning questions."""
samples = [
BenchmarkSample(
id="csqa_1",
prompt="""Question: Where would you put a plant that needs sunlight?
A) In a closet
B) Under a bed
C) Near a window
D) In a basement
Answer with just the letter:""",
expected="C",
task_type=TaskType.MULTIPLE_CHOICE,
category="commonsense"
),
BenchmarkSample(
id="csqa_2",
prompt="""Question: What do you use to cut paper?
A) Hammer
B) Scissors
C) Spoon
D) Pillow
Answer with just the letter:""",
expected="B",
task_type=TaskType.MULTIPLE_CHOICE,
category="commonsense"
),
BenchmarkSample(
id="csqa_3",
prompt="""Question: If it's raining, what should you bring outside?
A) Sunglasses
B) Umbrella
C) Sunscreen
D) Ice cream
Answer with just the letter:""",
expected="B",
task_type=TaskType.MULTIPLE_CHOICE,
category="commonsense"
),
BenchmarkSample(
id="csqa_4",
prompt="""Question: What happens to water when it freezes?
A) It becomes a gas
B) It becomes solid
C) It disappears
D) It gets warmer
Answer with just the letter:""",
expected="B",
task_type=TaskType.MULTIPLE_CHOICE,
category="commonsense"
),
]
return BenchmarkDataset(
name="CommonsenseQA-Mini",
description="Commonsense reasoning questions",
task_type=TaskType.MULTIPLE_CHOICE,
samples=samples
)
def create_coding_benchmark() -> BenchmarkDataset:
"""Create coding knowledge questions."""
samples = [
BenchmarkSample(
id="code_1",
prompt="""What does this Python code output?
```python
x = [1, 2, 3]
print(len(x))
```
Answer with just the output:""",
expected="3",
task_type=TaskType.QA,
category="python"
),
BenchmarkSample(
id="code_2",
prompt="""What does this Python code output?
```python
x = "hello"
print(x.upper())
```
Answer with just the output:""",
expected="HELLO",
task_type=TaskType.QA,
category="python"
),
BenchmarkSample(
id="code_3",
prompt="""What does this Python code output?
```python
x = [1, 2, 3]
x.append(4)
print(x[-1])
```
Answer with just the output:""",
expected="4",
task_type=TaskType.QA,
category="python"
),
BenchmarkSample(
id="code_4",
prompt="""What does this Python code output?
```python
x = {"a": 1, "b": 2}
print(x.get("c", 0))
```
Answer with just the output:""",
expected="0",
task_type=TaskType.QA,
category="python"
),
]
return BenchmarkDataset(
name="CodingQA-Mini",
description="Basic coding knowledge questions",
task_type=TaskType.QA,
samples=samples
)
def create_classification_benchmark() -> BenchmarkDataset:
"""Create sentiment classification samples."""
samples = [
BenchmarkSample(
id="sent_1",
prompt="""Classify the sentiment of this text as positive, negative, or neutral.
Text: "I absolutely loved this movie! The acting was superb."
Sentiment:""",
expected="positive",
task_type=TaskType.CLASSIFICATION,
category="sentiment"
),
BenchmarkSample(
id="sent_2",
prompt="""Classify the sentiment of this text as positive, negative, or neutral.
Text: "This product broke after one day. Complete waste of money."
Sentiment:""",
expected="negative",
task_type=TaskType.CLASSIFICATION,
category="sentiment"
),
BenchmarkSample(
id="sent_3",
prompt="""Classify the sentiment of this text as positive, negative, or neutral.
Text: "The meeting is scheduled for 3pm tomorrow."
Sentiment:""",
expected="neutral",
task_type=TaskType.CLASSIFICATION,
category="sentiment"
),
BenchmarkSample(
id="sent_4",
prompt="""Classify the sentiment of this text as positive, negative, or neutral.
Text: "Worst customer service I've ever experienced."
Sentiment:""",
expected="negative",
task_type=TaskType.CLASSIFICATION,
category="sentiment"
),
BenchmarkSample(
id="sent_5",
prompt="""Classify the sentiment of this text as positive, negative, or neutral.
Text: "This restaurant exceeded all my expectations!"
Sentiment:""",
expected="positive",
task_type=TaskType.CLASSIFICATION,
category="sentiment"
),
]
return BenchmarkDataset(
name="Sentiment-Mini",
description="Sentiment classification benchmark",
task_type=TaskType.CLASSIFICATION,
samples=samples
)
# Dataset registry
DATASETS = {
"mmlu": create_mmlu_subset,
"commonsense": create_commonsense_qa,
"coding": create_coding_benchmark,
"sentiment": create_classification_benchmark,
}
def get_dataset(name: str) -> BenchmarkDataset:
"""Get a dataset by name."""
if name not in DATASETS:
raise ValueError(f"Unknown dataset: {name}")
return DATASETS[name]()
def list_datasets() -> List[str]:
"""List all available datasets."""
return list(DATASETS.keys())What's Happening Here?
The Dataset Manager provides standardized benchmarks for consistent evaluation:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Benchmark Dataset Structure │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ BenchmarkDataset │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ name: "MMLU-Mini" │ │
│ │ description: "Subset of MMLU-style questions" │ │
│ │ task_type: TaskType.MULTIPLE_CHOICE │ │
│ │ samples: [ │ │
│ │ BenchmarkSample( │ │
│ │ id: "mmlu_1", │ │
│ │ prompt: "Question: What is the capital of France?\nA) London...", │ │
│ │ expected: "C", │ │
│ │ category: "geography" │ │
│ │ ), │ │
│ │ BenchmarkSample(...), │ │
│ │ ... │ │
│ │ ] │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ Why This Structure? │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ • id: Track which questions the model got right/wrong │ │
│ │ • prompt: Exact text sent to model (includes formatting!) │ │
│ │ • expected: Ground truth for automatic scoring │ │
│ │ • task_type: Determines which metric to use (accuracy vs F1) │ │
│ │ • category: Enable per-topic analysis (math vs history) │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Task Types and Their Purposes:
| Task Type | Example | Metric Used | What It Tests |
|---|---|---|---|
MULTIPLE_CHOICE | MMLU, HellaSwag | MC Accuracy | Knowledge, reasoning |
TEXT_GENERATION | Summarization | ROUGE, BLEU | Fluency, coherence |
CLASSIFICATION | Sentiment | F1, Precision, Recall | Categorization |
QA | Coding questions | Exact match | Factual accuracy |
Understanding MMLU-Style Prompts:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Prompt Engineering for Benchmarks │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Prompt Format (CRITICAL for reliable extraction): │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Question: What is the capital of France? │ │
│ │ A) London ◄── Options clearly labeled │ │
│ │ B) Berlin │ │
│ │ C) Paris │ │
│ │ D) Madrid │ │
│ │ │ │
│ │ Answer with just the letter (A, B, C, or D): ◄── EXPLICIT instruction │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ Without the explicit instruction, models might respond: │
│ ❌ "The capital of France is Paris" │
│ ❌ "C) Paris is the capital" │
│ ❌ "Based on my knowledge, the answer is C..." │
│ │
│ With the instruction, models respond: │
│ ✓ "C" │
│ ✓ "C)" │
│ ✓ "C." │
│ │
│ All of these can be parsed by the MultipleChoiceAccuracy metric! │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Part 2: Metrics and Evaluation
Metrics Collectors
Implement various evaluation metrics.
# metrics.py
from typing import List, Dict, Any, Optional
from dataclasses import dataclass, field
import re
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
@dataclass
class MetricResult:
"""Result of a metric calculation."""
name: str
value: float
details: Dict[str, Any] = field(default_factory=dict)
class MetricCalculator:
"""Base class for metric calculators."""
def calculate(
self,
predictions: List[str],
references: List[str]
) -> MetricResult:
raise NotImplementedError
class AccuracyMetric(MetricCalculator):
"""Exact match accuracy."""
def __init__(self, normalize: bool = True):
self.normalize = normalize
def _normalize(self, text: str) -> str:
"""Normalize text for comparison."""
if not self.normalize:
return text
# Lowercase, strip whitespace, remove punctuation
text = text.lower().strip()
text = re.sub(r'[^\w\s]', '', text)
return text
def calculate(
self,
predictions: List[str],
references: List[str]
) -> MetricResult:
correct = 0
total = len(predictions)
for pred, ref in zip(predictions, references):
norm_pred = self._normalize(pred)
norm_ref = self._normalize(ref)
# Check for exact match or if reference is contained
if norm_pred == norm_ref or norm_ref in norm_pred:
correct += 1
accuracy = correct / total if total > 0 else 0
return MetricResult(
name="accuracy",
value=accuracy,
details={"correct": correct, "total": total}
)
class MultipleChoiceAccuracy(MetricCalculator):
"""Accuracy for multiple choice questions."""
def _extract_answer(self, text: str) -> str:
"""Extract the answer letter from model output."""
text = text.strip().upper()
# Direct match
if text in ['A', 'B', 'C', 'D', 'E']:
return text
# Look for patterns like "A)", "A.", "(A)"
patterns = [
r'^([A-E])\)',
r'^([A-E])\.',
r'^\(([A-E])\)',
r'^([A-E])\s',
r'answer[:\s]+([A-E])',
r'([A-E])\s*$',
]
for pattern in patterns:
match = re.search(pattern, text, re.IGNORECASE)
if match:
return match.group(1).upper()
# First letter if it's A-E
if text and text[0] in 'ABCDE':
return text[0]
return ""
def calculate(
self,
predictions: List[str],
references: List[str]
) -> MetricResult:
correct = 0
total = len(predictions)
details_list = []
for pred, ref in zip(predictions, references):
extracted = self._extract_answer(pred)
is_correct = extracted == ref.upper()
if is_correct:
correct += 1
details_list.append({
"predicted": extracted,
"expected": ref,
"correct": is_correct
})
accuracy = correct / total if total > 0 else 0
return MetricResult(
name="mc_accuracy",
value=accuracy,
details={
"correct": correct,
"total": total,
"breakdown": details_list
}
)
class ClassificationMetrics(MetricCalculator):
"""Classification metrics (precision, recall, F1)."""
def __init__(self, labels: List[str] = None):
self.labels = labels or ["positive", "negative", "neutral"]
def _normalize_label(self, text: str) -> str:
"""Normalize predicted label."""
text = text.lower().strip()
for label in self.labels:
if label in text:
return label
return text
def calculate(
self,
predictions: List[str],
references: List[str]
) -> MetricResult:
# Normalize predictions
norm_preds = [self._normalize_label(p) for p in predictions]
norm_refs = [r.lower().strip() for r in references]
# Calculate metrics
accuracy = accuracy_score(norm_refs, norm_preds)
precision = precision_score(
norm_refs, norm_preds, average='weighted', zero_division=0
)
recall = recall_score(
norm_refs, norm_preds, average='weighted', zero_division=0
)
f1 = f1_score(
norm_refs, norm_preds, average='weighted', zero_division=0
)
return MetricResult(
name="classification",
value=f1, # Primary metric
details={
"accuracy": accuracy,
"precision": precision,
"recall": recall,
"f1": f1
}
)
class LatencyMetrics:
"""Calculate latency statistics."""
@staticmethod
def calculate(latencies_ms: List[float]) -> MetricResult:
if not latencies_ms:
return MetricResult(
name="latency",
value=0,
details={}
)
arr = np.array(latencies_ms)
return MetricResult(
name="latency",
value=float(np.mean(arr)),
details={
"mean_ms": float(np.mean(arr)),
"median_ms": float(np.median(arr)),
"p95_ms": float(np.percentile(arr, 95)),
"p99_ms": float(np.percentile(arr, 99)),
"min_ms": float(np.min(arr)),
"max_ms": float(np.max(arr)),
"std_ms": float(np.std(arr))
}
)
class ThroughputMetrics:
"""Calculate throughput statistics."""
@staticmethod
def calculate(
total_tokens: int,
total_time_seconds: float
) -> MetricResult:
tokens_per_second = total_tokens / total_time_seconds if total_time_seconds > 0 else 0
return MetricResult(
name="throughput",
value=tokens_per_second,
details={
"tokens_per_second": tokens_per_second,
"total_tokens": total_tokens,
"total_time_seconds": total_time_seconds
}
)
# Metric registry by task type
def get_metric_for_task(task_type: str) -> MetricCalculator:
"""Get appropriate metric calculator for a task type."""
from datasets_manager import TaskType
if task_type == TaskType.MULTIPLE_CHOICE:
return MultipleChoiceAccuracy()
elif task_type == TaskType.CLASSIFICATION:
return ClassificationMetrics()
else:
return AccuracyMetric()What's Happening Here?
The Metrics module handles the tricky task of extracting and scoring model outputs:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Answer Extraction Challenge │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Model outputs are messy! The same answer can appear many ways: │
│ │
│ All of these mean "C": │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ "C" ◄── Clean single letter │ │
│ │ "C)" ◄── Letter with parenthesis │ │
│ │ "C." ◄── Letter with period │ │
│ │ "(C)" ◄── Letter in parentheses │ │
│ │ "The answer is C" ◄── Sentence with letter │ │
│ │ "Based on my analysis, C is..." ◄── Verbose explanation │ │
│ │ "c" ◄── Lowercase letter │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ MultipleChoiceAccuracy._extract_answer() handles all these: │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ 1. Uppercase the text │ │
│ │ 2. Check if it's a single letter A-E │ │
│ │ 3. Try regex patterns: r'^([A-E])\)', r'answer[:\s]+([A-E])', etc. │ │
│ │ 4. Fall back to first character if A-E │ │
│ │ 5. Return empty string if no match (counted as wrong) │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Understanding Latency Metrics:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Latency Distribution Explained │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Why measure multiple latency values? │
│ │
│ Sample latencies: [120, 125, 130, 128, 122, 450, 127, 124, 126, 123] ms │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Metric │ Value │ What It Tells You │ │
│ ├─────────────────────────────────────────────────────────────────────────┤ │
│ │ mean_ms │ 157.5 │ Average experience (includes outliers) │ │
│ │ median_ms │ 125.5 │ Typical experience (ignores outliers) │ │
│ │ p95_ms │ 450 │ 5% of requests slower than this │ │
│ │ p99_ms │ 450 │ 1% of requests slower than this │ │
│ │ min_ms │ 120 │ Best case scenario │ │
│ │ max_ms │ 450 │ Worst case scenario (cold start, etc.) │ │
│ │ std_ms │ 102.8 │ How consistent is performance? │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ For SLA definitions: │
│ • Use median for "typical user experience" │
│ • Use P95 for "worst acceptable experience" │
│ • Use P99 for capacity planning (tail latency) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Classification Metrics Deep Dive:
| Metric | Formula | When to Use |
|---|---|---|
| Accuracy | correct / total | Balanced classes |
| Precision | TP / (TP + FP) | Cost of false positives high |
| Recall | TP / (TP + FN) | Cost of false negatives high |
| F1 | 2 × (P × R) / (P + R) | Imbalanced classes, need balance |
Resource Profiler
Monitor memory and resource usage during evaluation.
# profiler.py
import psutil
import time
from dataclasses import dataclass, field
from typing import List, Dict, Any, Optional
from contextlib import contextmanager
import threading
@dataclass
class ResourceSnapshot:
"""A snapshot of resource usage."""
timestamp: float
cpu_percent: float
memory_mb: float
memory_percent: float
gpu_memory_mb: Optional[float] = None
gpu_utilization: Optional[float] = None
@dataclass
class ResourceProfile:
"""Complete resource profile for a run."""
snapshots: List[ResourceSnapshot] = field(default_factory=list)
peak_memory_mb: float = 0
avg_cpu_percent: float = 0
duration_seconds: float = 0
def summarize(self) -> Dict[str, Any]:
"""Get summary statistics."""
if not self.snapshots:
return {}
memories = [s.memory_mb for s in self.snapshots]
cpus = [s.cpu_percent for s in self.snapshots]
return {
"duration_seconds": self.duration_seconds,
"peak_memory_mb": max(memories) if memories else 0,
"avg_memory_mb": sum(memories) / len(memories) if memories else 0,
"avg_cpu_percent": sum(cpus) / len(cpus) if cpus else 0,
"sample_count": len(self.snapshots)
}
class ResourceProfiler:
"""Profile resource usage during model execution."""
def __init__(self, interval_seconds: float = 0.1):
self.interval = interval_seconds
self._snapshots: List[ResourceSnapshot] = []
self._running = False
self._thread: Optional[threading.Thread] = None
self._process = psutil.Process()
def _get_gpu_stats(self) -> tuple:
"""Get GPU stats if available."""
try:
import GPUtil
gpus = GPUtil.getGPUs()
if gpus:
gpu = gpus[0]
return gpu.memoryUsed, gpu.load * 100
except (ImportError, Exception):
pass
return None, None
def _sample(self):
"""Take a resource snapshot."""
gpu_mem, gpu_util = self._get_gpu_stats()
return ResourceSnapshot(
timestamp=time.time(),
cpu_percent=self._process.cpu_percent(),
memory_mb=self._process.memory_info().rss / (1024 * 1024),
memory_percent=self._process.memory_percent(),
gpu_memory_mb=gpu_mem,
gpu_utilization=gpu_util
)
def _monitor_loop(self):
"""Background monitoring loop."""
while self._running:
self._snapshots.append(self._sample())
time.sleep(self.interval)
def start(self):
"""Start profiling."""
self._snapshots = []
self._running = True
# Get initial CPU reading
self._process.cpu_percent()
self._thread = threading.Thread(target=self._monitor_loop)
self._thread.start()
def stop(self) -> ResourceProfile:
"""Stop profiling and return results."""
self._running = False
if self._thread:
self._thread.join()
profile = ResourceProfile(snapshots=self._snapshots)
if self._snapshots:
profile.peak_memory_mb = max(s.memory_mb for s in self._snapshots)
profile.avg_cpu_percent = sum(s.cpu_percent for s in self._snapshots) / len(self._snapshots)
profile.duration_seconds = self._snapshots[-1].timestamp - self._snapshots[0].timestamp
return profile
@contextmanager
def profile(self):
"""Context manager for profiling."""
self.start()
try:
yield self
finally:
self._profile = self.stop()
def get_profile(self) -> ResourceProfile:
"""Get the last profile."""
return getattr(self, '_profile', ResourceProfile())What's Happening Here?
The Resource Profiler monitors system usage during model inference:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Resource Profiling Flow │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ profiler.start() │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Background Thread Started │ │
│ │ │ │
│ │ while running: │ │
│ │ snapshot = { │ │
│ │ timestamp: time.time(), │ │
│ │ cpu_percent: process.cpu_percent(), ◄── How much CPU? │ │
│ │ memory_mb: process.memory_info().rss, ◄── How much RAM? │ │
│ │ gpu_memory_mb: GPUtil.getGPUs()[0].memoryUsed ◄── GPU VRAM? │ │
│ │ } │ │
│ │ snapshots.append(snapshot) │ │
│ │ sleep(0.1 seconds) ◄── Sample 10x per second │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ (model inference runs in main thread) │
│ │ │
│ ▼ │
│ profiler.stop() │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Aggregate Snapshots: │ │
│ │ • peak_memory_mb = max(all memory readings) │ │
│ │ • avg_cpu_percent = mean(all CPU readings) │ │
│ │ • duration_seconds = last_timestamp - first_timestamp │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Why Peak Memory Matters:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Memory Usage Patterns │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Memory over time for model inference: │
│ │
│ Memory (MB) │
│ ^ │
│ 2500 │ ┌────────────┐ ◄── Peak: Model fully loaded │
│ │ /│ │ │
│ 2000 │ / │ │ │
│ │ / │ │\ │
│ 1500 │ / │ │ \ │
│ │ / │ Running │ \ │
│ 1000 │ / │ │ \ │
│ │ / │ │ \ │
│ 500 │___/ │ │ \___ │
│ │ Load │ │ Unload │
│ 0 └───────────┴────────────┴─────────────────► Time │
│ Model Inference Cleanup │
│ Loading │
│ │
│ Peak memory determines: │
│ • Minimum RAM/VRAM needed to run the model │
│ • Whether model fits on your hardware │
│ • How many concurrent requests you can handle │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Resource Requirements by Model Size:
| Model Size | Typical RAM | Min GPU VRAM | Concurrent Limit (8GB RAM) |
|---|---|---|---|
| 1-2B params | 1-2 GB | 2-4 GB | 4-6 instances |
| 3-4B params | 2-3 GB | 4-6 GB | 2-3 instances |
| 7B params | 4-6 GB | 8-12 GB | 1 instance |
Part 3: Benchmark Runner
Complete Benchmark Runner
Combine all components into a unified runner.
# benchmark_runner.py
from dataclasses import dataclass, field
from typing import List, Dict, Any, Optional
from datetime import datetime
import json
import time
from tqdm import tqdm
from rich.console import Console
from rich.table import Table
from models import ModelClient, ModelInfo, get_model, MODELS
from datasets_manager import BenchmarkDataset, BenchmarkSample, TaskType, get_dataset
from metrics import (
MetricResult, AccuracyMetric, MultipleChoiceAccuracy,
ClassificationMetrics, LatencyMetrics, ThroughputMetrics,
get_metric_for_task
)
from profiler import ResourceProfiler, ResourceProfile
@dataclass
class SampleResult:
"""Result for a single sample."""
sample_id: str
prompt: str
expected: str
prediction: str
correct: bool
latency_ms: float
tokens: int
@dataclass
class BenchmarkResult:
"""Complete benchmark result for a model on a dataset."""
model_name: str
dataset_name: str
timestamp: str
sample_results: List[SampleResult]
quality_metrics: Dict[str, MetricResult]
performance_metrics: Dict[str, MetricResult]
resource_profile: Optional[Dict[str, Any]] = None
def to_dict(self) -> Dict[str, Any]:
"""Convert to dictionary."""
return {
"model": self.model_name,
"dataset": self.dataset_name,
"timestamp": self.timestamp,
"samples": len(self.sample_results),
"quality_metrics": {
k: {"value": v.value, "details": v.details}
for k, v in self.quality_metrics.items()
},
"performance_metrics": {
k: {"value": v.value, "details": v.details}
for k, v in self.performance_metrics.items()
},
"resource_profile": self.resource_profile
}
class BenchmarkRunner:
"""Run benchmarks on models."""
def __init__(
self,
profile_resources: bool = True,
verbose: bool = True
):
self.profile_resources = profile_resources
self.verbose = verbose
self.console = Console()
self.results: List[BenchmarkResult] = []
def run_single(
self,
model_key: str,
dataset_name: str,
max_samples: Optional[int] = None
) -> BenchmarkResult:
"""Run benchmark for a single model on a single dataset."""
model = get_model(model_key)
dataset = get_dataset(dataset_name)
if max_samples:
dataset = dataset.sample(max_samples)
if self.verbose:
self.console.print(f"\n[bold]Running {model.model_info.name} on {dataset.name}[/bold]")
# Initialize profiler
profiler = ResourceProfiler() if self.profile_resources else None
sample_results = []
latencies = []
total_tokens = 0
start_time = time.time()
if profiler:
profiler.start()
# Run evaluation
iterator = tqdm(dataset.samples, desc="Evaluating") if self.verbose else dataset.samples
for sample in iterator:
response = model.generate(sample.prompt)
prediction = response["content"]
latency = response["latency_ms"]
tokens = response["total_tokens"]
latencies.append(latency)
total_tokens += tokens
# Check correctness based on task type
metric = get_metric_for_task(dataset.task_type)
result = metric.calculate([prediction], [sample.expected])
sample_results.append(SampleResult(
sample_id=sample.id,
prompt=sample.prompt[:100] + "...",
expected=sample.expected,
prediction=prediction[:200],
correct=result.value > 0.5,
latency_ms=latency,
tokens=tokens
))
end_time = time.time()
if profiler:
resource_profile = profiler.stop().summarize()
else:
resource_profile = None
# Calculate metrics
predictions = [r.prediction for r in sample_results]
references = [s.expected for s in dataset.samples[:len(predictions)]]
quality_metric = get_metric_for_task(dataset.task_type)
quality_result = quality_metric.calculate(predictions, references)
latency_result = LatencyMetrics.calculate(latencies)
throughput_result = ThroughputMetrics.calculate(
total_tokens, end_time - start_time
)
result = BenchmarkResult(
model_name=model.model_info.name,
dataset_name=dataset.name,
timestamp=datetime.now().isoformat(),
sample_results=sample_results,
quality_metrics={quality_result.name: quality_result},
performance_metrics={
"latency": latency_result,
"throughput": throughput_result
},
resource_profile=resource_profile
)
self.results.append(result)
return result
def run_comparison(
self,
model_keys: List[str],
dataset_name: str,
max_samples: Optional[int] = None
) -> List[BenchmarkResult]:
"""Run benchmark comparing multiple models."""
results = []
for model_key in model_keys:
result = self.run_single(model_key, dataset_name, max_samples)
results.append(result)
return results
def run_full_suite(
self,
model_keys: List[str],
dataset_names: List[str],
max_samples: Optional[int] = None
) -> List[BenchmarkResult]:
"""Run full benchmark suite."""
results = []
for dataset_name in dataset_names:
for model_key in model_keys:
result = self.run_single(model_key, dataset_name, max_samples)
results.append(result)
return results
def print_comparison(self, results: List[BenchmarkResult] = None):
"""Print comparison table."""
results = results or self.results
if not results:
self.console.print("[yellow]No results to display[/yellow]")
return
# Group by dataset
datasets = {}
for r in results:
if r.dataset_name not in datasets:
datasets[r.dataset_name] = []
datasets[r.dataset_name].append(r)
for dataset_name, dataset_results in datasets.items():
table = Table(title=f"Results: {dataset_name}")
table.add_column("Model", style="cyan")
table.add_column("Accuracy", justify="right")
table.add_column("Latency (ms)", justify="right")
table.add_column("Throughput (tok/s)", justify="right")
table.add_column("Memory (MB)", justify="right")
for r in dataset_results:
# Get primary quality metric
quality_value = 0
for metric in r.quality_metrics.values():
quality_value = metric.value
break
latency = r.performance_metrics.get("latency")
throughput = r.performance_metrics.get("throughput")
latency_str = f"{latency.details['mean_ms']:.0f}" if latency else "N/A"
throughput_str = f"{throughput.value:.1f}" if throughput else "N/A"
memory_str = f"{r.resource_profile['peak_memory_mb']:.0f}" if r.resource_profile else "N/A"
table.add_row(
r.model_name,
f"{quality_value:.1%}",
latency_str,
throughput_str,
memory_str
)
self.console.print(table)
def save_results(self, filepath: str):
"""Save results to JSON."""
data = [r.to_dict() for r in self.results]
with open(filepath, 'w') as f:
json.dump(data, f, indent=2)
self.console.print(f"[green]Results saved to {filepath}[/green]")
# Example usage
if __name__ == "__main__":
runner = BenchmarkRunner(profile_resources=True)
# Run on multiple models
models = ["phi3-mini", "qwen2.5-3b", "gemma2-2b"]
datasets = ["mmlu", "sentiment"]
results = runner.run_full_suite(models, datasets, max_samples=10)
# Print comparison
runner.print_comparison()
# Save results
runner.save_results("benchmark_results.json")Part 4: Visualization and Reporting
Visualization Module
Create charts and visualizations for benchmark results.
# visualizer.py
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from typing import List, Dict, Any, Optional
from pathlib import Path
class BenchmarkVisualizer:
"""Create visualizations for benchmark results."""
def __init__(self, results: List[Dict[str, Any]], output_dir: str = "reports"):
self.results = results
self.output_dir = Path(output_dir)
self.output_dir.mkdir(exist_ok=True)
# Set style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")
def _results_to_df(self) -> pd.DataFrame:
"""Convert results to DataFrame."""
rows = []
for r in self.results:
row = {
"model": r["model"],
"dataset": r["dataset"],
}
# Quality metrics
for name, metric in r.get("quality_metrics", {}).items():
row[f"quality_{name}"] = metric["value"]
# Performance metrics
for name, metric in r.get("performance_metrics", {}).items():
if name == "latency":
row["latency_mean_ms"] = metric["details"].get("mean_ms", 0)
row["latency_p95_ms"] = metric["details"].get("p95_ms", 0)
elif name == "throughput":
row["throughput_tps"] = metric["value"]
# Resource metrics
if r.get("resource_profile"):
row["peak_memory_mb"] = r["resource_profile"].get("peak_memory_mb", 0)
row["avg_cpu_percent"] = r["resource_profile"].get("avg_cpu_percent", 0)
rows.append(row)
return pd.DataFrame(rows)
def plot_accuracy_comparison(self, save: bool = True) -> plt.Figure:
"""Plot accuracy comparison across models and datasets."""
df = self._results_to_df()
# Find accuracy column
accuracy_col = [c for c in df.columns if "accuracy" in c.lower()]
if not accuracy_col:
accuracy_col = [c for c in df.columns if "quality" in c.lower()]
if not accuracy_col:
print("No accuracy column found")
return None
accuracy_col = accuracy_col[0]
fig, ax = plt.subplots(figsize=(10, 6))
pivot = df.pivot(index="model", columns="dataset", values=accuracy_col)
pivot.plot(kind="bar", ax=ax, width=0.8)
ax.set_ylabel("Accuracy")
ax.set_xlabel("Model")
ax.set_title("Model Accuracy by Dataset")
ax.set_ylim(0, 1)
ax.legend(title="Dataset", bbox_to_anchor=(1.02, 1))
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
# Add value labels
for container in ax.containers:
ax.bar_label(container, fmt='%.2f', fontsize=8)
plt.tight_layout()
if save:
fig.savefig(self.output_dir / "accuracy_comparison.png", dpi=150, bbox_inches='tight')
return fig
def plot_latency_comparison(self, save: bool = True) -> plt.Figure:
"""Plot latency comparison."""
df = self._results_to_df()
fig, ax = plt.subplots(figsize=(10, 6))
# Create grouped bar chart
x = np.arange(len(df["model"].unique()))
width = 0.35
models = df["model"].unique()
datasets = df["dataset"].unique()
for i, dataset in enumerate(datasets):
data = df[df["dataset"] == dataset]
values = [data[data["model"] == m]["latency_mean_ms"].values[0]
if len(data[data["model"] == m]) > 0 else 0
for m in models]
ax.bar(x + i * width, values, width, label=dataset)
ax.set_ylabel("Latency (ms)")
ax.set_xlabel("Model")
ax.set_title("Mean Latency by Model and Dataset")
ax.set_xticks(x + width / 2)
ax.set_xticklabels(models, rotation=45, ha='right')
ax.legend(title="Dataset")
plt.tight_layout()
if save:
fig.savefig(self.output_dir / "latency_comparison.png", dpi=150, bbox_inches='tight')
return fig
def plot_tradeoff(self, save: bool = True) -> plt.Figure:
"""Plot accuracy vs latency tradeoff."""
df = self._results_to_df()
# Find accuracy column
accuracy_col = [c for c in df.columns if "accuracy" in c.lower() or "quality" in c.lower()]
accuracy_col = accuracy_col[0] if accuracy_col else None
if not accuracy_col or "latency_mean_ms" not in df.columns:
print("Required columns not found")
return None
fig, ax = plt.subplots(figsize=(10, 6))
# Get unique models
models = df["model"].unique()
colors = sns.color_palette("husl", len(models))
model_colors = dict(zip(models, colors))
for _, row in df.iterrows():
ax.scatter(
row["latency_mean_ms"],
row[accuracy_col],
c=[model_colors[row["model"]]],
s=100,
alpha=0.7
)
ax.annotate(
f'{row["model"]}\n({row["dataset"]})',
(row["latency_mean_ms"], row[accuracy_col]),
textcoords="offset points",
xytext=(5, 5),
fontsize=8
)
ax.set_xlabel("Latency (ms)")
ax.set_ylabel("Accuracy")
ax.set_title("Accuracy vs Latency Tradeoff")
# Add legend
handles = [plt.scatter([], [], c=[c], label=m) for m, c in model_colors.items()]
ax.legend(handles=handles, title="Model", bbox_to_anchor=(1.02, 1))
plt.tight_layout()
if save:
fig.savefig(self.output_dir / "tradeoff.png", dpi=150, bbox_inches='tight')
return fig
def plot_resource_usage(self, save: bool = True) -> plt.Figure:
"""Plot resource usage comparison."""
df = self._results_to_df()
if "peak_memory_mb" not in df.columns:
print("No resource data available")
return None
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Memory usage
ax1 = axes[0]
df_grouped = df.groupby("model")["peak_memory_mb"].mean().sort_values()
df_grouped.plot(kind="barh", ax=ax1, color=sns.color_palette("husl", len(df_grouped)))
ax1.set_xlabel("Peak Memory (MB)")
ax1.set_title("Peak Memory Usage by Model")
# CPU usage
ax2 = axes[1]
if "avg_cpu_percent" in df.columns:
df_grouped = df.groupby("model")["avg_cpu_percent"].mean().sort_values()
df_grouped.plot(kind="barh", ax=ax2, color=sns.color_palette("husl", len(df_grouped)))
ax2.set_xlabel("Average CPU (%)")
ax2.set_title("Average CPU Usage by Model")
plt.tight_layout()
if save:
fig.savefig(self.output_dir / "resource_usage.png", dpi=150, bbox_inches='tight')
return fig
def generate_radar_chart(self, save: bool = True) -> plt.Figure:
"""Generate radar chart comparing models across metrics."""
df = self._results_to_df()
# Aggregate by model
models = df["model"].unique()
# Metrics to include (normalize each)
metrics = []
if "quality_mc_accuracy" in df.columns:
metrics.append(("quality_mc_accuracy", "Accuracy", False))
if "latency_mean_ms" in df.columns:
metrics.append(("latency_mean_ms", "Speed", True)) # Inverse
if "throughput_tps" in df.columns:
metrics.append(("throughput_tps", "Throughput", False))
if "peak_memory_mb" in df.columns:
metrics.append(("peak_memory_mb", "Memory Eff.", True)) # Inverse
if len(metrics) < 3:
print("Not enough metrics for radar chart")
return None
# Prepare data
num_metrics = len(metrics)
angles = np.linspace(0, 2 * np.pi, num_metrics, endpoint=False).tolist()
angles += angles[:1] # Complete the loop
fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))
for model in models:
model_data = df[df["model"] == model]
values = []
for col, _, inverse in metrics:
if col in model_data.columns:
val = model_data[col].mean()
# Normalize to 0-1
col_min = df[col].min()
col_max = df[col].max()
if col_max > col_min:
normalized = (val - col_min) / (col_max - col_min)
else:
normalized = 0.5
if inverse:
normalized = 1 - normalized
values.append(normalized)
else:
values.append(0)
values += values[:1] # Complete the loop
ax.plot(angles, values, 'o-', linewidth=2, label=model)
ax.fill(angles, values, alpha=0.1)
# Labels
metric_labels = [m[1] for m in metrics]
ax.set_xticks(angles[:-1])
ax.set_xticklabels(metric_labels)
ax.set_ylim(0, 1)
ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.0))
ax.set_title("Model Comparison Radar", y=1.08)
plt.tight_layout()
if save:
fig.savefig(self.output_dir / "radar_chart.png", dpi=150, bbox_inches='tight')
return fig
def generate_report(self) -> str:
"""Generate a complete HTML report."""
# Generate all plots
self.plot_accuracy_comparison()
self.plot_latency_comparison()
self.plot_tradeoff()
self.plot_resource_usage()
self.plot_radar_chart()
df = self._results_to_df()
html = f"""
<!DOCTYPE html>
<html>
<head>
<title>SLM Benchmark Report</title>
<style>
body {{ font-family: Arial, sans-serif; margin: 40px; }}
h1 {{ color: #333; }}
h2 {{ color: #666; margin-top: 30px; }}
table {{ border-collapse: collapse; width: 100%; margin: 20px 0; }}
th, td {{ border: 1px solid #ddd; padding: 12px; text-align: left; }}
th {{ background-color: #4CAF50; color: white; }}
tr:nth-child(even) {{ background-color: #f2f2f2; }}
img {{ max-width: 100%; margin: 20px 0; border: 1px solid #ddd; }}
.metric {{ font-size: 24px; font-weight: bold; color: #4CAF50; }}
.summary {{ display: flex; gap: 20px; flex-wrap: wrap; }}
.summary-card {{ background: #f9f9f9; padding: 20px; border-radius: 8px; flex: 1; min-width: 200px; }}
</style>
</head>
<body>
<h1>SLM Benchmark Report</h1>
<p>Generated: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}</p>
<h2>Summary</h2>
<div class="summary">
<div class="summary-card">
<div class="metric">{len(df['model'].unique())}</div>
<div>Models Tested</div>
</div>
<div class="summary-card">
<div class="metric">{len(df['dataset'].unique())}</div>
<div>Datasets Used</div>
</div>
<div class="summary-card">
<div class="metric">{len(self.results)}</div>
<div>Total Benchmarks</div>
</div>
</div>
<h2>Results Table</h2>
{df.to_html(index=False, float_format='%.3f')}
<h2>Accuracy Comparison</h2>
<img src="accuracy_comparison.png" alt="Accuracy Comparison">
<h2>Latency Comparison</h2>
<img src="latency_comparison.png" alt="Latency Comparison">
<h2>Accuracy vs Latency Tradeoff</h2>
<img src="tradeoff.png" alt="Tradeoff Chart">
<h2>Resource Usage</h2>
<img src="resource_usage.png" alt="Resource Usage">
<h2>Model Comparison Radar</h2>
<img src="radar_chart.png" alt="Radar Chart">
</body>
</html>
"""
report_path = self.output_dir / "report.html"
with open(report_path, 'w') as f:
f.write(html)
print(f"Report saved to {report_path}")
return str(report_path)Part 5: Complete CLI Application
Main Application
Create a command-line interface for running benchmarks.
# main.py
import argparse
import json
from pathlib import Path
from rich.console import Console
from rich.panel import Panel
from models import list_models, MODELS
from datasets_manager import list_datasets
from benchmark_runner import BenchmarkRunner
from visualizer import BenchmarkVisualizer
console = Console()
def list_available():
"""List available models and datasets."""
console.print("\n[bold cyan]Available Models:[/bold cyan]")
for key, info in MODELS.items():
console.print(f" • {key}: {info.name} ({info.parameters}, {info.provider})")
console.print("\n[bold cyan]Available Datasets:[/bold cyan]")
for name in list_datasets():
console.print(f" • {name}")
def run_benchmark(args):
"""Run benchmark with specified parameters."""
console.print(Panel.fit(
"[bold]SLM Benchmarking Suite[/bold]",
border_style="cyan"
))
# Parse models and datasets
models = args.models.split(",") if args.models else list(MODELS.keys())
datasets = args.datasets.split(",") if args.datasets else list_datasets()
console.print(f"\n[cyan]Models:[/cyan] {', '.join(models)}")
console.print(f"[cyan]Datasets:[/cyan] {', '.join(datasets)}")
console.print(f"[cyan]Max samples:[/cyan] {args.max_samples or 'All'}")
# Run benchmarks
runner = BenchmarkRunner(
profile_resources=not args.no_profile,
verbose=not args.quiet
)
results = runner.run_full_suite(
model_keys=models,
dataset_names=datasets,
max_samples=args.max_samples
)
# Print comparison
runner.print_comparison()
# Save raw results
if args.output:
runner.save_results(args.output)
# Generate report
if args.report:
results_data = [r.to_dict() for r in results]
visualizer = BenchmarkVisualizer(results_data, output_dir=args.report_dir)
report_path = visualizer.generate_report()
console.print(f"\n[green]Report generated: {report_path}[/green]")
return results
def compare_models(args):
"""Quick comparison of specified models."""
console.print(f"\n[bold]Comparing models on {args.dataset}[/bold]\n")
models = args.models.split(",")
runner = BenchmarkRunner(profile_resources=True)
results = runner.run_comparison(
model_keys=models,
dataset_name=args.dataset,
max_samples=args.max_samples
)
runner.print_comparison()
# Find best model
best_accuracy = 0
best_model = None
for r in results:
for metric in r.quality_metrics.values():
if metric.value > best_accuracy:
best_accuracy = metric.value
best_model = r.model_name
console.print(f"\n[bold green]Best model: {best_model} ({best_accuracy:.1%} accuracy)[/bold green]")
def main():
parser = argparse.ArgumentParser(
description="SLM Benchmarking Suite",
formatter_class=argparse.RawDescriptionHelpFormatter
)
subparsers = parser.add_subparsers(dest="command", help="Commands")
# List command
list_parser = subparsers.add_parser("list", help="List available models and datasets")
# Run command
run_parser = subparsers.add_parser("run", help="Run benchmarks")
run_parser.add_argument("-m", "--models", help="Comma-separated model keys")
run_parser.add_argument("-d", "--datasets", help="Comma-separated dataset names")
run_parser.add_argument("-n", "--max-samples", type=int, help="Max samples per dataset")
run_parser.add_argument("-o", "--output", help="Output JSON file")
run_parser.add_argument("--report", action="store_true", help="Generate HTML report")
run_parser.add_argument("--report-dir", default="reports", help="Report output directory")
run_parser.add_argument("--no-profile", action="store_true", help="Disable resource profiling")
run_parser.add_argument("-q", "--quiet", action="store_true", help="Quiet mode")
# Compare command
compare_parser = subparsers.add_parser("compare", help="Quick model comparison")
compare_parser.add_argument("-m", "--models", required=True, help="Comma-separated models")
compare_parser.add_argument("-d", "--dataset", default="mmlu", help="Dataset to use")
compare_parser.add_argument("-n", "--max-samples", type=int, default=20)
args = parser.parse_args()
if args.command == "list":
list_available()
elif args.command == "run":
run_benchmark(args)
elif args.command == "compare":
compare_models(args)
else:
parser.print_help()
if __name__ == "__main__":
main()Example Usage
# List available options
python main.py list
# Run full benchmark suite
python main.py run --report
# Run specific models on specific datasets
python main.py run -m phi3-mini,qwen2.5-3b -d mmlu,sentiment -n 50
# Quick comparison
python main.py compare -m phi3-mini,qwen2.5-3b,gemma2-2b -d mmlu -n 30
# Save results to file
python main.py run -m phi3-mini -d mmlu -o results.jsonModel Selection Guide
Based on typical benchmark results:
| Model | Best For | Accuracy | Speed | Memory |
|---|---|---|---|---|
| Phi-3 Mini | Reasoning, Math | High | Medium | 2.3GB |
| Qwen 2.5 3B | Multilingual, Extraction | High | Medium | 2.0GB |
| Gemma 2 2B | Generation, Fast tasks | Medium | Fast | 1.6GB |
| Llama 3.2 3B | General purpose | High | Medium | 2.0GB |
| SmolLM 2 1.7B | Resource-constrained | Medium | Very Fast | 1.0GB |
Exercises
-
Add More Datasets: Implement loading from HuggingFace datasets like MMLU, HellaSwag, or ARC
-
Custom Metrics: Create a custom metric for your specific use case (e.g., code execution accuracy)
-
Automated Model Selection: Build a recommender that suggests the best model based on task requirements
-
Continuous Benchmarking: Set up a scheduled job to track model performance over time
Loading HuggingFace Datasets
# huggingface_datasets.py
from datasets import load_dataset
from datasets_manager import BenchmarkDataset, BenchmarkSample, TaskType
def load_mmlu_from_hf(subject: str = "elementary_mathematics", split: str = "test") -> BenchmarkDataset:
"""Load MMLU dataset from HuggingFace."""
dataset = load_dataset("cais/mmlu", subject, split=split)
samples = []
for i, item in enumerate(dataset):
choices = item["choices"]
prompt = f"""Question: {item["question"]}
A) {choices[0]}
B) {choices[1]}
C) {choices[2]}
D) {choices[3]}
Answer with just the letter (A, B, C, or D):"""
answer_map = {0: "A", 1: "B", 2: "C", 3: "D"}
expected = answer_map[item["answer"]]
samples.append(BenchmarkSample(
id=f"mmlu_{subject}_{i}",
prompt=prompt,
expected=expected,
task_type=TaskType.MULTIPLE_CHOICE,
category=subject
))
return BenchmarkDataset(
name=f"MMLU-{subject}",
description=f"MMLU {subject} subset",
task_type=TaskType.MULTIPLE_CHOICE,
samples=samples
)Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| MMLU | Massive Multitask Language Understanding benchmark | Standard test for general knowledge |
| Accuracy | % of correct answers | Primary quality metric |
| F1 Score | Harmonic mean of precision and recall | Better for imbalanced classes |
| Latency | Time from prompt to response complete | User experience metric |
| Time-to-First-Token | Time until first token generated | Perceived responsiveness |
| Throughput | Tokens per second generated | Capacity planning metric |
| P95/P99 Latency | 95th/99th percentile latency | Worst-case performance |
| Peak Memory | Maximum RAM used during inference | Hardware requirements |
| Quantization Impact | Quality loss from Q8→Q4→Q2 | Usually under 5% for Q4_K_M |
| Radar Chart | Multi-dimensional model comparison | Visualize tradeoffs at once |
Next Steps
- SLM Fine-tuning - Customize models for better task performance
- SLM-Powered RAG - Combine SLMs with retrieval
- Production SLM System - Deploy SLMs at scale