SLM Evaluation & Benchmarking

Build a comprehensive benchmarking suite to evaluate and compare small language models across quality, speed, and resource usage metrics. Learn to make data-driven decisions about which model best fits your use case.

TL;DR

Benchmark SLMs across three dimensions: quality (accuracy, F1), speed (latency, throughput), and resources (memory, CPU). Key metrics: tokens/second (20-50 typical), time-to-first-token (<500ms good), and peak memory. Use standard datasets (MMLU, HellaSwag) for comparability. The "best" model depends on your tradeoff: Phi-3 for accuracy, Gemma for speed, SmolLM for constrained environments.

Project Overview

Aspect	Details
Difficulty	Beginner
Time	3-4 hours
Prerequisites	Local SLM Setup
What You'll Build	Model benchmarking framework with automated evaluation and reporting

What You'll Learn

Standard benchmark datasets (MMLU, HellaSwag, TruthfulQA)
Quality metrics (accuracy, perplexity, F1 score)
Performance metrics (latency, throughput, memory usage)
Building automated benchmarking pipelines
Visualization and comparison reports
Task-specific evaluation strategies

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                     Benchmarking Framework Architecture                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   ┌───────────────────────────────────────────────────────────────────────┐  │
│   │                         Benchmark Suite                               │  │
│   │    ┌──────────────┐  ┌───────────────┐  ┌──────────────┐              │  │
│   │    │   Datasets   │  │Task Definitions│  │Model Registry│             │  │
│   │    │ (MMLU, etc.) │  │ (MC, QA, etc) │  │ (Phi, Qwen)  │              │  │
│   │    └──────┬───────┘  └───────┬───────┘  └──────┬───────┘              │  │
│   └───────────┼──────────────────┼─────────────────┼──────────────────────┘  │
│               └──────────────────┼─────────────────┘                         │
│                                  ▼                                           │
│   ┌───────────────────────────────────────────────────────────────────────┐  │
│   │                       Evaluation Engine                               │  │
│   │                                                                       │  │
│   │    ┌──────────────────────────────────────────────────────────┐       │  │
│   │    │                    Test Runner                           │       │  │
│   │    │    Run samples ──▶ Collect responses ──▶ Measure         │       │  │
│   │    └─────────────────────────┬────────────────────────────────┘       │  │
│   │                              │                                        │  │
│   │              ┌───────────────┴───────────────┐                        │  │
│   │              ▼                               ▼                        │  │
│   │    ┌─────────────────┐             ┌─────────────────┐                │  │
│   │    │Metrics Collector│             │Resource Profiler│                │  │
│   │    │ • Accuracy      │             │ • Memory (MB)   │                │  │
│   │    │ • F1 Score      │             │ • CPU %         │                │  │
│   │    │ • Latency       │             │ • GPU usage     │                │  │
│   │    └────────┬────────┘             └────────┬────────┘                │  │
│   └─────────────┼──────────────────────────────┼──────────────────────────┘  │
│                 └─────────────┬────────────────┘                             │
│                               ▼                                              │
│   ┌───────────────────────────────────────────────────────────────────────┐  │
│   │                      Analysis & Reporting                             │  │
│   │    │   Results    │  │  Visualizer  │  │   Report     │               │  │
│   │    │  Aggregator  │─▶│  (charts)    │─▶│  Generator   │               │  │
│   │    └──────────────┘  └──────────────┘  └──────────────┘               │  │
│   └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Project Setup

Install Dependencies

# Create project directory
mkdir slm-benchmarking && cd slm-benchmarking

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install ollama datasets evaluate pandas numpy matplotlib seaborn
pip install psutil gputil tqdm rich tabulate scikit-learn

Pull Models to Benchmark

# Pull various SLMs for comparison
ollama pull phi3:mini           # 2.3GB - Microsoft
ollama pull qwen2.5:3b          # 2.0GB - Alibaba
ollama pull gemma2:2b           # 1.6GB - Google
ollama pull llama3.2:3b         # 2.0GB - Meta
ollama pull smollm2:1.7b        # 1.0GB - HuggingFace

Part 1: Benchmark Framework Core

Model Registry

Create a registry to manage models being evaluated.

# models.py
from dataclasses import dataclass
from typing import Optional, List, Dict, Any
import ollama


@dataclass
class ModelInfo:
    """Information about a model to benchmark."""
    name: str
    provider: str
    parameters: str
    quantization: str = "Q4_K_M"
    context_length: int = 4096
    ollama_name: Optional[str] = None

    @property
    def id(self) -> str:
        return self.ollama_name or self.name


# Model registry
MODELS = {
    "phi3-mini": ModelInfo(
        name="Phi-3 Mini",
        provider="Microsoft",
        parameters="3.8B",
        ollama_name="phi3:mini",
        context_length=4096
    ),
    "qwen2.5-3b": ModelInfo(
        name="Qwen 2.5",
        provider="Alibaba",
        parameters="3B",
        ollama_name="qwen2.5:3b",
        context_length=8192
    ),
    "gemma2-2b": ModelInfo(
        name="Gemma 2",
        provider="Google",
        parameters="2B",
        ollama_name="gemma2:2b",
        context_length=8192
    ),
    "llama3.2-3b": ModelInfo(
        name="Llama 3.2",
        provider="Meta",
        parameters="3B",
        ollama_name="llama3.2:3b",
        context_length=8192
    ),
    "smollm2-1.7b": ModelInfo(
        name="SmolLM 2",
        provider="HuggingFace",
        parameters="1.7B",
        ollama_name="smollm2:1.7b",
        context_length=2048
    ),
}


class ModelClient:
    """Wrapper for model inference."""

    def __init__(self, model_info: ModelInfo):
        self.model_info = model_info

    def generate(
        self,
        prompt: str,
        max_tokens: int = 256,
        temperature: float = 0.0
    ) -> Dict[str, Any]:
        """Generate a response and return with metadata."""
        import time

        start_time = time.perf_counter()

        response = ollama.chat(
            model=self.model_info.id,
            messages=[{"role": "user", "content": prompt}],
            options={
                "temperature": temperature,
                "num_predict": max_tokens,
            }
        )

        end_time = time.perf_counter()

        return {
            "content": response["message"]["content"],
            "latency_ms": (end_time - start_time) * 1000,
            "prompt_tokens": response.get("prompt_eval_count", 0),
            "completion_tokens": response.get("eval_count", 0),
            "total_tokens": response.get("prompt_eval_count", 0) + response.get("eval_count", 0)
        }

    def check_available(self) -> bool:
        """Check if model is available locally."""
        try:
            models = ollama.list()
            available = [m["name"] for m in models.get("models", [])]
            return any(self.model_info.id in m for m in available)
        except Exception:
            return False


def get_model(model_key: str) -> ModelClient:
    """Get a model client by key."""
    if model_key not in MODELS:
        raise ValueError(f"Unknown model: {model_key}")
    return ModelClient(MODELS[model_key])


def list_models() -> List[ModelInfo]:
    """List all registered models."""
    return list(MODELS.values())

What's Happening Here?

The Model Registry provides a unified interface for managing different SLMs:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    Model Registry Architecture                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  MODELS Dictionary (Model Catalog)                                           │
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │ "phi3-mini"    → ModelInfo(name="Phi-3 Mini", params="3.8B", ...)      │ │
│  │ "qwen2.5-3b"   → ModelInfo(name="Qwen 2.5", params="3B", ...)          │ │
│  │ "gemma2-2b"    → ModelInfo(name="Gemma 2", params="2B", ...)           │ │
│  │ "llama3.2-3b"  → ModelInfo(name="Llama 3.2", params="3B", ...)         │ │
│  │ "smollm2-1.7b" → ModelInfo(name="SmolLM 2", params="1.7B", ...)        │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
│                                    │                                         │
│                                    ▼                                         │
│  ModelClient (Unified Interface)                                             │
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │ generate(prompt) ─────────────────────────────────────────────────────► │ │
│  │                                                                         │ │
│  │ ┌─────────────────────────────────────────────────────────────────────┐ │ │
│  │ │ 1. Start timer                                                      │ │ │
│  │ │ 2. Call ollama.chat(model=..., messages=[...])                     │ │ │
│  │ │ 3. Stop timer                                                       │ │ │
│  │ │ 4. Return {content, latency_ms, tokens}                            │ │ │
│  │ └─────────────────────────────────────────────────────────────────────┘ │ │
│  │                                                                         │ │
│  │ Same interface regardless of which model you use!                      │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Model Characteristics Explained:

Model	Strengths	Weaknesses	Best Use Case
Phi-3 Mini (3.8B)	Strong reasoning, math	Slower inference	Complex tasks, coding
Qwen 2.5 (3B)	Multilingual, extraction	Higher memory	International apps
Gemma 2 (2B)	Fast, efficient	Lower accuracy	Real-time applications
Llama 3.2 (3B)	Balanced performance	Moderate speed	General purpose
SmolLM 2 (1.7B)	Very fast, tiny	Limited capability	Edge devices, simple tasks

Why Use ModelInfo Dataclass?

┌─────────────────────────────────────────────────────────────────────────────┐
│  Consistent metadata for every model                                         │
│                                                                              │
│  @dataclass                                                                  │
│  class ModelInfo:                                                            │
│      name: str           ← Human-readable display name                      │
│      provider: str       ← Company/org that made it                         │
│      parameters: str     ← Model size (affects quality & speed)             │
│      quantization: str   ← Compression level (Q4_K_M = good balance)        │
│      context_length: int ← Max tokens model can process at once             │
│      ollama_name: str    ← Exact name to pass to Ollama                     │
│                                                                              │
│  This metadata helps with:                                                   │
│  • Generating reports with consistent model names                           │
│  • Checking if model fits your hardware (context_length × 2 ≈ RAM needed)   │
│  • Understanding tradeoffs before running benchmarks                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Benchmark Dataset Manager

Create a dataset loader for standard benchmarks.

# datasets_manager.py
from dataclasses import dataclass
from typing import List, Dict, Any, Optional, Iterator
from enum import Enum
import json
import random


class TaskType(str, Enum):
    MULTIPLE_CHOICE = "multiple_choice"
    TEXT_GENERATION = "text_generation"
    CLASSIFICATION = "classification"
    QA = "qa"
    SUMMARIZATION = "summarization"


@dataclass
class BenchmarkSample:
    """A single benchmark sample."""
    id: str
    prompt: str
    expected: str
    task_type: TaskType
    category: Optional[str] = None
    metadata: Dict[str, Any] = None


@dataclass
class BenchmarkDataset:
    """A benchmark dataset."""
    name: str
    description: str
    task_type: TaskType
    samples: List[BenchmarkSample]

    def __len__(self) -> int:
        return len(self.samples)

    def __iter__(self) -> Iterator[BenchmarkSample]:
        return iter(self.samples)

    def sample(self, n: int, seed: int = 42) -> "BenchmarkDataset":
        """Get a random sample of the dataset."""
        random.seed(seed)
        sampled = random.sample(self.samples, min(n, len(self.samples)))
        return BenchmarkDataset(
            name=self.name,
            description=self.description,
            task_type=self.task_type,
            samples=sampled
        )


# Built-in benchmark datasets
def create_mmlu_subset() -> BenchmarkDataset:
    """Create MMLU-style multiple choice questions."""
    samples = [
        BenchmarkSample(
            id="mmlu_1",
            prompt="""Question: What is the capital of France?
A) London
B) Berlin
C) Paris
D) Madrid

Answer with just the letter (A, B, C, or D):""",
            expected="C",
            task_type=TaskType.MULTIPLE_CHOICE,
            category="geography"
        ),
        BenchmarkSample(
            id="mmlu_2",
            prompt="""Question: Which planet is known as the Red Planet?
A) Venus
B) Mars
C) Jupiter
D) Saturn

Answer with just the letter (A, B, C, or D):""",
            expected="B",
            task_type=TaskType.MULTIPLE_CHOICE,
            category="astronomy"
        ),
        BenchmarkSample(
            id="mmlu_3",
            prompt="""Question: What is the chemical symbol for gold?
A) Ag
B) Fe
C) Au
D) Cu

Answer with just the letter (A, B, C, or D):""",
            expected="C",
            task_type=TaskType.MULTIPLE_CHOICE,
            category="chemistry"
        ),
        BenchmarkSample(
            id="mmlu_4",
            prompt="""Question: Who wrote "Romeo and Juliet"?
A) Charles Dickens
B) William Shakespeare
C) Jane Austen
D) Mark Twain

Answer with just the letter (A, B, C, or D):""",
            expected="B",
            task_type=TaskType.MULTIPLE_CHOICE,
            category="literature"
        ),
        BenchmarkSample(
            id="mmlu_5",
            prompt="""Question: What is the derivative of x^2?
A) x
B) 2x
C) x^2
D) 2

Answer with just the letter (A, B, C, or D):""",
            expected="B",
            task_type=TaskType.MULTIPLE_CHOICE,
            category="math"
        ),
        BenchmarkSample(
            id="mmlu_6",
            prompt="""Question: What is the time complexity of binary search?
A) O(n)
B) O(n^2)
C) O(log n)
D) O(1)

Answer with just the letter (A, B, C, or D):""",
            expected="C",
            task_type=TaskType.MULTIPLE_CHOICE,
            category="computer_science"
        ),
        BenchmarkSample(
            id="mmlu_7",
            prompt="""Question: What is the largest organ in the human body?
A) Heart
B) Brain
C) Liver
D) Skin

Answer with just the letter (A, B, C, or D):""",
            expected="D",
            task_type=TaskType.MULTIPLE_CHOICE,
            category="biology"
        ),
        BenchmarkSample(
            id="mmlu_8",
            prompt="""Question: In which year did World War II end?
A) 1943
B) 1944
C) 1945
D) 1946

Answer with just the letter (A, B, C, or D):""",
            expected="C",
            task_type=TaskType.MULTIPLE_CHOICE,
            category="history"
        ),
    ]

    return BenchmarkDataset(
        name="MMLU-Mini",
        description="Subset of MMLU-style questions",
        task_type=TaskType.MULTIPLE_CHOICE,
        samples=samples
    )


def create_commonsense_qa() -> BenchmarkDataset:
    """Create commonsense reasoning questions."""
    samples = [
        BenchmarkSample(
            id="csqa_1",
            prompt="""Question: Where would you put a plant that needs sunlight?
A) In a closet
B) Under a bed
C) Near a window
D) In a basement

Answer with just the letter:""",
            expected="C",
            task_type=TaskType.MULTIPLE_CHOICE,
            category="commonsense"
        ),
        BenchmarkSample(
            id="csqa_2",
            prompt="""Question: What do you use to cut paper?
A) Hammer
B) Scissors
C) Spoon
D) Pillow

Answer with just the letter:""",
            expected="B",
            task_type=TaskType.MULTIPLE_CHOICE,
            category="commonsense"
        ),
        BenchmarkSample(
            id="csqa_3",
            prompt="""Question: If it's raining, what should you bring outside?
A) Sunglasses
B) Umbrella
C) Sunscreen
D) Ice cream

Answer with just the letter:""",
            expected="B",
            task_type=TaskType.MULTIPLE_CHOICE,
            category="commonsense"
        ),
        BenchmarkSample(
            id="csqa_4",
            prompt="""Question: What happens to water when it freezes?
A) It becomes a gas
B) It becomes solid
C) It disappears
D) It gets warmer

Answer with just the letter:""",
            expected="B",
            task_type=TaskType.MULTIPLE_CHOICE,
            category="commonsense"
        ),
    ]

    return BenchmarkDataset(
        name="CommonsenseQA-Mini",
        description="Commonsense reasoning questions",
        task_type=TaskType.MULTIPLE_CHOICE,
        samples=samples
    )


def create_coding_benchmark() -> BenchmarkDataset:
    """Create coding knowledge questions."""
    samples = [
        BenchmarkSample(
            id="code_1",
            prompt="""What does this Python code output?

```python
x = [1, 2, 3]
print(len(x))
```

Answer with just the output:""",
            expected="3",
            task_type=TaskType.QA,
            category="python"
        ),
        BenchmarkSample(
            id="code_2",
            prompt="""What does this Python code output?

```python
x = "hello"
print(x.upper())
```

Answer with just the output:""",
            expected="HELLO",
            task_type=TaskType.QA,
            category="python"
        ),
        BenchmarkSample(
            id="code_3",
            prompt="""What does this Python code output?

```python
x = [1, 2, 3]
x.append(4)
print(x[-1])
```

Answer with just the output:""",
            expected="4",
            task_type=TaskType.QA,
            category="python"
        ),
        BenchmarkSample(
            id="code_4",
            prompt="""What does this Python code output?

```python
x = {"a": 1, "b": 2}
print(x.get("c", 0))
```

Answer with just the output:""",
            expected="0",
            task_type=TaskType.QA,
            category="python"
        ),
    ]

    return BenchmarkDataset(
        name="CodingQA-Mini",
        description="Basic coding knowledge questions",
        task_type=TaskType.QA,
        samples=samples
    )


def create_classification_benchmark() -> BenchmarkDataset:
    """Create sentiment classification samples."""
    samples = [
        BenchmarkSample(
            id="sent_1",
            prompt="""Classify the sentiment of this text as positive, negative, or neutral.

Text: "I absolutely loved this movie! The acting was superb."

Sentiment:""",
            expected="positive",
            task_type=TaskType.CLASSIFICATION,
            category="sentiment"
        ),
        BenchmarkSample(
            id="sent_2",
            prompt="""Classify the sentiment of this text as positive, negative, or neutral.

Text: "This product broke after one day. Complete waste of money."

Sentiment:""",
            expected="negative",
            task_type=TaskType.CLASSIFICATION,
            category="sentiment"
        ),
        BenchmarkSample(
            id="sent_3",
            prompt="""Classify the sentiment of this text as positive, negative, or neutral.

Text: "The meeting is scheduled for 3pm tomorrow."

Sentiment:""",
            expected="neutral",
            task_type=TaskType.CLASSIFICATION,
            category="sentiment"
        ),
        BenchmarkSample(
            id="sent_4",
            prompt="""Classify the sentiment of this text as positive, negative, or neutral.

Text: "Worst customer service I've ever experienced."

Sentiment:""",
            expected="negative",
            task_type=TaskType.CLASSIFICATION,
            category="sentiment"
        ),
        BenchmarkSample(
            id="sent_5",
            prompt="""Classify the sentiment of this text as positive, negative, or neutral.

Text: "This restaurant exceeded all my expectations!"

Sentiment:""",
            expected="positive",
            task_type=TaskType.CLASSIFICATION,
            category="sentiment"
        ),
    ]

    return BenchmarkDataset(
        name="Sentiment-Mini",
        description="Sentiment classification benchmark",
        task_type=TaskType.CLASSIFICATION,
        samples=samples
    )


# Dataset registry
DATASETS = {
    "mmlu": create_mmlu_subset,
    "commonsense": create_commonsense_qa,
    "coding": create_coding_benchmark,
    "sentiment": create_classification_benchmark,
}


def get_dataset(name: str) -> BenchmarkDataset:
    """Get a dataset by name."""
    if name not in DATASETS:
        raise ValueError(f"Unknown dataset: {name}")
    return DATASETS[name]()


def list_datasets() -> List[str]:
    """List all available datasets."""
    return list(DATASETS.keys())

What's Happening Here?

The Dataset Manager provides standardized benchmarks for consistent evaluation:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    Benchmark Dataset Structure                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  BenchmarkDataset                                                            │
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │ name: "MMLU-Mini"                                                       │ │
│  │ description: "Subset of MMLU-style questions"                           │ │
│  │ task_type: TaskType.MULTIPLE_CHOICE                                     │ │
│  │ samples: [                                                              │ │
│  │   BenchmarkSample(                                                      │ │
│  │     id: "mmlu_1",                                                       │ │
│  │     prompt: "Question: What is the capital of France?\nA) London...",  │ │
│  │     expected: "C",                                                      │ │
│  │     category: "geography"                                               │ │
│  │   ),                                                                    │ │
│  │   BenchmarkSample(...),                                                 │ │
│  │   ...                                                                   │ │
│  │ ]                                                                       │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  Why This Structure?                                                         │
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │ • id: Track which questions the model got right/wrong                  │ │
│  │ • prompt: Exact text sent to model (includes formatting!)              │ │
│  │ • expected: Ground truth for automatic scoring                         │ │
│  │ • task_type: Determines which metric to use (accuracy vs F1)           │ │
│  │ • category: Enable per-topic analysis (math vs history)                │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Task Types and Their Purposes:

Task Type	Example	Metric Used	What It Tests
`MULTIPLE_CHOICE`	MMLU, HellaSwag	MC Accuracy	Knowledge, reasoning
`TEXT_GENERATION`	Summarization	ROUGE, BLEU	Fluency, coherence
`CLASSIFICATION`	Sentiment	F1, Precision, Recall	Categorization
`QA`	Coding questions	Exact match	Factual accuracy

Understanding MMLU-Style Prompts:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    Prompt Engineering for Benchmarks                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Prompt Format (CRITICAL for reliable extraction):                           │
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │ Question: What is the capital of France?                               │ │
│  │ A) London                        ◄── Options clearly labeled           │ │
│  │ B) Berlin                                                               │ │
│  │ C) Paris                                                                │ │
│  │ D) Madrid                                                               │ │
│  │                                                                         │ │
│  │ Answer with just the letter (A, B, C, or D):  ◄── EXPLICIT instruction │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  Without the explicit instruction, models might respond:                     │
│  ❌ "The capital of France is Paris"                                        │
│  ❌ "C) Paris is the capital"                                               │
│  ❌ "Based on my knowledge, the answer is C..."                             │
│                                                                              │
│  With the instruction, models respond:                                       │
│  ✓ "C"                                                                      │
│  ✓ "C)"                                                                     │
│  ✓ "C."                                                                     │
│                                                                              │
│  All of these can be parsed by the MultipleChoiceAccuracy metric!           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Part 2: Metrics and Evaluation

Metrics Collectors

Implement various evaluation metrics.

# metrics.py
from typing import List, Dict, Any, Optional
from dataclasses import dataclass, field
import re
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score


@dataclass
class MetricResult:
    """Result of a metric calculation."""
    name: str
    value: float
    details: Dict[str, Any] = field(default_factory=dict)


class MetricCalculator:
    """Base class for metric calculators."""

    def calculate(
        self,
        predictions: List[str],
        references: List[str]
    ) -> MetricResult:
        raise NotImplementedError


class AccuracyMetric(MetricCalculator):
    """Exact match accuracy."""

    def __init__(self, normalize: bool = True):
        self.normalize = normalize

    def _normalize(self, text: str) -> str:
        """Normalize text for comparison."""
        if not self.normalize:
            return text
        # Lowercase, strip whitespace, remove punctuation
        text = text.lower().strip()
        text = re.sub(r'[^\w\s]', '', text)
        return text

    def calculate(
        self,
        predictions: List[str],
        references: List[str]
    ) -> MetricResult:
        correct = 0
        total = len(predictions)

        for pred, ref in zip(predictions, references):
            norm_pred = self._normalize(pred)
            norm_ref = self._normalize(ref)

            # Check for exact match or if reference is contained
            if norm_pred == norm_ref or norm_ref in norm_pred:
                correct += 1

        accuracy = correct / total if total > 0 else 0

        return MetricResult(
            name="accuracy",
            value=accuracy,
            details={"correct": correct, "total": total}
        )


class MultipleChoiceAccuracy(MetricCalculator):
    """Accuracy for multiple choice questions."""

    def _extract_answer(self, text: str) -> str:
        """Extract the answer letter from model output."""
        text = text.strip().upper()

        # Direct match
        if text in ['A', 'B', 'C', 'D', 'E']:
            return text

        # Look for patterns like "A)", "A.", "(A)"
        patterns = [
            r'^([A-E])\)',
            r'^([A-E])\.',
            r'^\(([A-E])\)',
            r'^([A-E])\s',
            r'answer[:\s]+([A-E])',
            r'([A-E])\s*$',
        ]

        for pattern in patterns:
            match = re.search(pattern, text, re.IGNORECASE)
            if match:
                return match.group(1).upper()

        # First letter if it's A-E
        if text and text[0] in 'ABCDE':
            return text[0]

        return ""

    def calculate(
        self,
        predictions: List[str],
        references: List[str]
    ) -> MetricResult:
        correct = 0
        total = len(predictions)
        details_list = []

        for pred, ref in zip(predictions, references):
            extracted = self._extract_answer(pred)
            is_correct = extracted == ref.upper()
            if is_correct:
                correct += 1
            details_list.append({
                "predicted": extracted,
                "expected": ref,
                "correct": is_correct
            })

        accuracy = correct / total if total > 0 else 0

        return MetricResult(
            name="mc_accuracy",
            value=accuracy,
            details={
                "correct": correct,
                "total": total,
                "breakdown": details_list
            }
        )


class ClassificationMetrics(MetricCalculator):
    """Classification metrics (precision, recall, F1)."""

    def __init__(self, labels: List[str] = None):
        self.labels = labels or ["positive", "negative", "neutral"]

    def _normalize_label(self, text: str) -> str:
        """Normalize predicted label."""
        text = text.lower().strip()
        for label in self.labels:
            if label in text:
                return label
        return text

    def calculate(
        self,
        predictions: List[str],
        references: List[str]
    ) -> MetricResult:
        # Normalize predictions
        norm_preds = [self._normalize_label(p) for p in predictions]
        norm_refs = [r.lower().strip() for r in references]

        # Calculate metrics
        accuracy = accuracy_score(norm_refs, norm_preds)
        precision = precision_score(
            norm_refs, norm_preds, average='weighted', zero_division=0
        )
        recall = recall_score(
            norm_refs, norm_preds, average='weighted', zero_division=0
        )
        f1 = f1_score(
            norm_refs, norm_preds, average='weighted', zero_division=0
        )

        return MetricResult(
            name="classification",
            value=f1,  # Primary metric
            details={
                "accuracy": accuracy,
                "precision": precision,
                "recall": recall,
                "f1": f1
            }
        )


class LatencyMetrics:
    """Calculate latency statistics."""

    @staticmethod
    def calculate(latencies_ms: List[float]) -> MetricResult:
        if not latencies_ms:
            return MetricResult(
                name="latency",
                value=0,
                details={}
            )

        arr = np.array(latencies_ms)

        return MetricResult(
            name="latency",
            value=float(np.mean(arr)),
            details={
                "mean_ms": float(np.mean(arr)),
                "median_ms": float(np.median(arr)),
                "p95_ms": float(np.percentile(arr, 95)),
                "p99_ms": float(np.percentile(arr, 99)),
                "min_ms": float(np.min(arr)),
                "max_ms": float(np.max(arr)),
                "std_ms": float(np.std(arr))
            }
        )


class ThroughputMetrics:
    """Calculate throughput statistics."""

    @staticmethod
    def calculate(
        total_tokens: int,
        total_time_seconds: float
    ) -> MetricResult:
        tokens_per_second = total_tokens / total_time_seconds if total_time_seconds > 0 else 0

        return MetricResult(
            name="throughput",
            value=tokens_per_second,
            details={
                "tokens_per_second": tokens_per_second,
                "total_tokens": total_tokens,
                "total_time_seconds": total_time_seconds
            }
        )


# Metric registry by task type
def get_metric_for_task(task_type: str) -> MetricCalculator:
    """Get appropriate metric calculator for a task type."""
    from datasets_manager import TaskType

    if task_type == TaskType.MULTIPLE_CHOICE:
        return MultipleChoiceAccuracy()
    elif task_type == TaskType.CLASSIFICATION:
        return ClassificationMetrics()
    else:
        return AccuracyMetric()

What's Happening Here?

The Metrics module handles the tricky task of extracting and scoring model outputs:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    Answer Extraction Challenge                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Model outputs are messy! The same answer can appear many ways:             │
│                                                                              │
│  All of these mean "C":                                                     │
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │ "C"                                ◄── Clean single letter             │ │
│  │ "C)"                               ◄── Letter with parenthesis         │ │
│  │ "C."                               ◄── Letter with period              │ │
│  │ "(C)"                              ◄── Letter in parentheses           │ │
│  │ "The answer is C"                  ◄── Sentence with letter            │ │
│  │ "Based on my analysis, C is..."    ◄── Verbose explanation             │ │
│  │ "c"                                ◄── Lowercase letter                │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  MultipleChoiceAccuracy._extract_answer() handles all these:                │
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │ 1. Uppercase the text                                                  │ │
│  │ 2. Check if it's a single letter A-E                                   │ │
│  │ 3. Try regex patterns: r'^([A-E])\)', r'answer[:\s]+([A-E])', etc.   │ │
│  │ 4. Fall back to first character if A-E                                 │ │
│  │ 5. Return empty string if no match (counted as wrong)                  │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Understanding Latency Metrics:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    Latency Distribution Explained                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Why measure multiple latency values?                                        │
│                                                                              │
│  Sample latencies: [120, 125, 130, 128, 122, 450, 127, 124, 126, 123] ms    │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │ Metric        │ Value   │ What It Tells You                            │ │
│  ├─────────────────────────────────────────────────────────────────────────┤ │
│  │ mean_ms       │ 157.5   │ Average experience (includes outliers)       │ │
│  │ median_ms     │ 125.5   │ Typical experience (ignores outliers)        │ │
│  │ p95_ms        │ 450     │ 5% of requests slower than this              │ │
│  │ p99_ms        │ 450     │ 1% of requests slower than this              │ │
│  │ min_ms        │ 120     │ Best case scenario                           │ │
│  │ max_ms        │ 450     │ Worst case scenario (cold start, etc.)       │ │
│  │ std_ms        │ 102.8   │ How consistent is performance?               │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  For SLA definitions:                                                        │
│  • Use median for "typical user experience"                                 │
│  • Use P95 for "worst acceptable experience"                                │
│  • Use P99 for capacity planning (tail latency)                             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Classification Metrics Deep Dive:

Metric	Formula	When to Use
Accuracy	correct / total	Balanced classes
Precision	TP / (TP + FP)	Cost of false positives high
Recall	TP / (TP + FN)	Cost of false negatives high
F1	2 × (P × R) / (P + R)	Imbalanced classes, need balance

Resource Profiler

Monitor memory and resource usage during evaluation.

# profiler.py
import psutil
import time
from dataclasses import dataclass, field
from typing import List, Dict, Any, Optional
from contextlib import contextmanager
import threading


@dataclass
class ResourceSnapshot:
    """A snapshot of resource usage."""
    timestamp: float
    cpu_percent: float
    memory_mb: float
    memory_percent: float
    gpu_memory_mb: Optional[float] = None
    gpu_utilization: Optional[float] = None


@dataclass
class ResourceProfile:
    """Complete resource profile for a run."""
    snapshots: List[ResourceSnapshot] = field(default_factory=list)
    peak_memory_mb: float = 0
    avg_cpu_percent: float = 0
    duration_seconds: float = 0

    def summarize(self) -> Dict[str, Any]:
        """Get summary statistics."""
        if not self.snapshots:
            return {}

        memories = [s.memory_mb for s in self.snapshots]
        cpus = [s.cpu_percent for s in self.snapshots]

        return {
            "duration_seconds": self.duration_seconds,
            "peak_memory_mb": max(memories) if memories else 0,
            "avg_memory_mb": sum(memories) / len(memories) if memories else 0,
            "avg_cpu_percent": sum(cpus) / len(cpus) if cpus else 0,
            "sample_count": len(self.snapshots)
        }


class ResourceProfiler:
    """Profile resource usage during model execution."""

    def __init__(self, interval_seconds: float = 0.1):
        self.interval = interval_seconds
        self._snapshots: List[ResourceSnapshot] = []
        self._running = False
        self._thread: Optional[threading.Thread] = None
        self._process = psutil.Process()

    def _get_gpu_stats(self) -> tuple:
        """Get GPU stats if available."""
        try:
            import GPUtil
            gpus = GPUtil.getGPUs()
            if gpus:
                gpu = gpus[0]
                return gpu.memoryUsed, gpu.load * 100
        except (ImportError, Exception):
            pass
        return None, None

    def _sample(self):
        """Take a resource snapshot."""
        gpu_mem, gpu_util = self._get_gpu_stats()

        return ResourceSnapshot(
            timestamp=time.time(),
            cpu_percent=self._process.cpu_percent(),
            memory_mb=self._process.memory_info().rss / (1024 * 1024),
            memory_percent=self._process.memory_percent(),
            gpu_memory_mb=gpu_mem,
            gpu_utilization=gpu_util
        )

    def _monitor_loop(self):
        """Background monitoring loop."""
        while self._running:
            self._snapshots.append(self._sample())
            time.sleep(self.interval)

    def start(self):
        """Start profiling."""
        self._snapshots = []
        self._running = True
        # Get initial CPU reading
        self._process.cpu_percent()
        self._thread = threading.Thread(target=self._monitor_loop)
        self._thread.start()

    def stop(self) -> ResourceProfile:
        """Stop profiling and return results."""
        self._running = False
        if self._thread:
            self._thread.join()

        profile = ResourceProfile(snapshots=self._snapshots)

        if self._snapshots:
            profile.peak_memory_mb = max(s.memory_mb for s in self._snapshots)
            profile.avg_cpu_percent = sum(s.cpu_percent for s in self._snapshots) / len(self._snapshots)
            profile.duration_seconds = self._snapshots[-1].timestamp - self._snapshots[0].timestamp

        return profile

    @contextmanager
    def profile(self):
        """Context manager for profiling."""
        self.start()
        try:
            yield self
        finally:
            self._profile = self.stop()

    def get_profile(self) -> ResourceProfile:
        """Get the last profile."""
        return getattr(self, '_profile', ResourceProfile())

What's Happening Here?

The Resource Profiler monitors system usage during model inference:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    Resource Profiling Flow                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  profiler.start()                                                            │
│       │                                                                      │
│       ▼                                                                      │
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │ Background Thread Started                                               │ │
│  │                                                                         │ │
│  │  while running:                                                         │ │
│  │    snapshot = {                                                         │ │
│  │      timestamp: time.time(),                                            │ │
│  │      cpu_percent: process.cpu_percent(),    ◄── How much CPU?          │ │
│  │      memory_mb: process.memory_info().rss,  ◄── How much RAM?          │ │
│  │      gpu_memory_mb: GPUtil.getGPUs()[0].memoryUsed  ◄── GPU VRAM?      │ │
│  │    }                                                                    │ │
│  │    snapshots.append(snapshot)                                           │ │
│  │    sleep(0.1 seconds)                       ◄── Sample 10x per second  │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
│       │                                                                      │
│       │ (model inference runs in main thread)                               │
│       │                                                                      │
│       ▼                                                                      │
│  profiler.stop()                                                             │
│       │                                                                      │
│       ▼                                                                      │
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │ Aggregate Snapshots:                                                    │ │
│  │ • peak_memory_mb = max(all memory readings)                            │ │
│  │ • avg_cpu_percent = mean(all CPU readings)                             │ │
│  │ • duration_seconds = last_timestamp - first_timestamp                  │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Why Peak Memory Matters:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    Memory Usage Patterns                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Memory over time for model inference:                                       │
│                                                                              │
│     Memory (MB)                                                              │
│        ^                                                                     │
│  2500  │           ┌────────────┐  ◄── Peak: Model fully loaded            │
│        │          /│            │                                           │
│  2000  │         / │            │                                           │
│        │        /  │            │\                                          │
│  1500  │       /   │            │ \                                         │
│        │      /    │   Running  │  \                                        │
│  1000  │     /     │            │   \                                       │
│        │    /      │            │    \                                      │
│   500  │___/       │            │     \___                                  │
│        │  Load     │            │  Unload                                   │
│      0 └───────────┴────────────┴─────────────────► Time                    │
│             Model        Inference      Cleanup                             │
│            Loading                                                          │
│                                                                              │
│  Peak memory determines:                                                     │
│  • Minimum RAM/VRAM needed to run the model                                 │
│  • Whether model fits on your hardware                                      │
│  • How many concurrent requests you can handle                              │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Resource Requirements by Model Size:

Model Size	Typical RAM	Min GPU VRAM	Concurrent Limit (8GB RAM)
1-2B params	1-2 GB	2-4 GB	4-6 instances
3-4B params	2-3 GB	4-6 GB	2-3 instances
7B params	4-6 GB	8-12 GB	1 instance

Part 3: Benchmark Runner

Complete Benchmark Runner

Combine all components into a unified runner.

# benchmark_runner.py
from dataclasses import dataclass, field
from typing import List, Dict, Any, Optional
from datetime import datetime
import json
import time
from tqdm import tqdm
from rich.console import Console
from rich.table import Table

from models import ModelClient, ModelInfo, get_model, MODELS
from datasets_manager import BenchmarkDataset, BenchmarkSample, TaskType, get_dataset
from metrics import (
    MetricResult, AccuracyMetric, MultipleChoiceAccuracy,
    ClassificationMetrics, LatencyMetrics, ThroughputMetrics,
    get_metric_for_task
)
from profiler import ResourceProfiler, ResourceProfile


@dataclass
class SampleResult:
    """Result for a single sample."""
    sample_id: str
    prompt: str
    expected: str
    prediction: str
    correct: bool
    latency_ms: float
    tokens: int


@dataclass
class BenchmarkResult:
    """Complete benchmark result for a model on a dataset."""
    model_name: str
    dataset_name: str
    timestamp: str
    sample_results: List[SampleResult]
    quality_metrics: Dict[str, MetricResult]
    performance_metrics: Dict[str, MetricResult]
    resource_profile: Optional[Dict[str, Any]] = None

    def to_dict(self) -> Dict[str, Any]:
        """Convert to dictionary."""
        return {
            "model": self.model_name,
            "dataset": self.dataset_name,
            "timestamp": self.timestamp,
            "samples": len(self.sample_results),
            "quality_metrics": {
                k: {"value": v.value, "details": v.details}
                for k, v in self.quality_metrics.items()
            },
            "performance_metrics": {
                k: {"value": v.value, "details": v.details}
                for k, v in self.performance_metrics.items()
            },
            "resource_profile": self.resource_profile
        }


class BenchmarkRunner:
    """Run benchmarks on models."""

    def __init__(
        self,
        profile_resources: bool = True,
        verbose: bool = True
    ):
        self.profile_resources = profile_resources
        self.verbose = verbose
        self.console = Console()
        self.results: List[BenchmarkResult] = []

    def run_single(
        self,
        model_key: str,
        dataset_name: str,
        max_samples: Optional[int] = None
    ) -> BenchmarkResult:
        """Run benchmark for a single model on a single dataset."""
        model = get_model(model_key)
        dataset = get_dataset(dataset_name)

        if max_samples:
            dataset = dataset.sample(max_samples)

        if self.verbose:
            self.console.print(f"\n[bold]Running {model.model_info.name} on {dataset.name}[/bold]")

        # Initialize profiler
        profiler = ResourceProfiler() if self.profile_resources else None

        sample_results = []
        latencies = []
        total_tokens = 0
        start_time = time.time()

        if profiler:
            profiler.start()

        # Run evaluation
        iterator = tqdm(dataset.samples, desc="Evaluating") if self.verbose else dataset.samples

        for sample in iterator:
            response = model.generate(sample.prompt)

            prediction = response["content"]
            latency = response["latency_ms"]
            tokens = response["total_tokens"]

            latencies.append(latency)
            total_tokens += tokens

            # Check correctness based on task type
            metric = get_metric_for_task(dataset.task_type)
            result = metric.calculate([prediction], [sample.expected])

            sample_results.append(SampleResult(
                sample_id=sample.id,
                prompt=sample.prompt[:100] + "...",
                expected=sample.expected,
                prediction=prediction[:200],
                correct=result.value > 0.5,
                latency_ms=latency,
                tokens=tokens
            ))

        end_time = time.time()

        if profiler:
            resource_profile = profiler.stop().summarize()
        else:
            resource_profile = None

        # Calculate metrics
        predictions = [r.prediction for r in sample_results]
        references = [s.expected for s in dataset.samples[:len(predictions)]]

        quality_metric = get_metric_for_task(dataset.task_type)
        quality_result = quality_metric.calculate(predictions, references)

        latency_result = LatencyMetrics.calculate(latencies)
        throughput_result = ThroughputMetrics.calculate(
            total_tokens, end_time - start_time
        )

        result = BenchmarkResult(
            model_name=model.model_info.name,
            dataset_name=dataset.name,
            timestamp=datetime.now().isoformat(),
            sample_results=sample_results,
            quality_metrics={quality_result.name: quality_result},
            performance_metrics={
                "latency": latency_result,
                "throughput": throughput_result
            },
            resource_profile=resource_profile
        )

        self.results.append(result)
        return result

    def run_comparison(
        self,
        model_keys: List[str],
        dataset_name: str,
        max_samples: Optional[int] = None
    ) -> List[BenchmarkResult]:
        """Run benchmark comparing multiple models."""
        results = []
        for model_key in model_keys:
            result = self.run_single(model_key, dataset_name, max_samples)
            results.append(result)
        return results

    def run_full_suite(
        self,
        model_keys: List[str],
        dataset_names: List[str],
        max_samples: Optional[int] = None
    ) -> List[BenchmarkResult]:
        """Run full benchmark suite."""
        results = []
        for dataset_name in dataset_names:
            for model_key in model_keys:
                result = self.run_single(model_key, dataset_name, max_samples)
                results.append(result)
        return results

    def print_comparison(self, results: List[BenchmarkResult] = None):
        """Print comparison table."""
        results = results or self.results

        if not results:
            self.console.print("[yellow]No results to display[/yellow]")
            return

        # Group by dataset
        datasets = {}
        for r in results:
            if r.dataset_name not in datasets:
                datasets[r.dataset_name] = []
            datasets[r.dataset_name].append(r)

        for dataset_name, dataset_results in datasets.items():
            table = Table(title=f"Results: {dataset_name}")

            table.add_column("Model", style="cyan")
            table.add_column("Accuracy", justify="right")
            table.add_column("Latency (ms)", justify="right")
            table.add_column("Throughput (tok/s)", justify="right")
            table.add_column("Memory (MB)", justify="right")

            for r in dataset_results:
                # Get primary quality metric
                quality_value = 0
                for metric in r.quality_metrics.values():
                    quality_value = metric.value
                    break

                latency = r.performance_metrics.get("latency")
                throughput = r.performance_metrics.get("throughput")

                latency_str = f"{latency.details['mean_ms']:.0f}" if latency else "N/A"
                throughput_str = f"{throughput.value:.1f}" if throughput else "N/A"
                memory_str = f"{r.resource_profile['peak_memory_mb']:.0f}" if r.resource_profile else "N/A"

                table.add_row(
                    r.model_name,
                    f"{quality_value:.1%}",
                    latency_str,
                    throughput_str,
                    memory_str
                )

            self.console.print(table)

    def save_results(self, filepath: str):
        """Save results to JSON."""
        data = [r.to_dict() for r in self.results]
        with open(filepath, 'w') as f:
            json.dump(data, f, indent=2)
        self.console.print(f"[green]Results saved to {filepath}[/green]")


# Example usage
if __name__ == "__main__":
    runner = BenchmarkRunner(profile_resources=True)

    # Run on multiple models
    models = ["phi3-mini", "qwen2.5-3b", "gemma2-2b"]
    datasets = ["mmlu", "sentiment"]

    results = runner.run_full_suite(models, datasets, max_samples=10)

    # Print comparison
    runner.print_comparison()

    # Save results
    runner.save_results("benchmark_results.json")

Part 4: Visualization and Reporting

Visualization Module

Create charts and visualizations for benchmark results.

# visualizer.py
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from typing import List, Dict, Any, Optional
from pathlib import Path


class BenchmarkVisualizer:
    """Create visualizations for benchmark results."""

    def __init__(self, results: List[Dict[str, Any]], output_dir: str = "reports"):
        self.results = results
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True)

        # Set style
        plt.style.use('seaborn-v0_8-whitegrid')
        sns.set_palette("husl")

    def _results_to_df(self) -> pd.DataFrame:
        """Convert results to DataFrame."""
        rows = []
        for r in self.results:
            row = {
                "model": r["model"],
                "dataset": r["dataset"],
            }

            # Quality metrics
            for name, metric in r.get("quality_metrics", {}).items():
                row[f"quality_{name}"] = metric["value"]

            # Performance metrics
            for name, metric in r.get("performance_metrics", {}).items():
                if name == "latency":
                    row["latency_mean_ms"] = metric["details"].get("mean_ms", 0)
                    row["latency_p95_ms"] = metric["details"].get("p95_ms", 0)
                elif name == "throughput":
                    row["throughput_tps"] = metric["value"]

            # Resource metrics
            if r.get("resource_profile"):
                row["peak_memory_mb"] = r["resource_profile"].get("peak_memory_mb", 0)
                row["avg_cpu_percent"] = r["resource_profile"].get("avg_cpu_percent", 0)

            rows.append(row)

        return pd.DataFrame(rows)

    def plot_accuracy_comparison(self, save: bool = True) -> plt.Figure:
        """Plot accuracy comparison across models and datasets."""
        df = self._results_to_df()

        # Find accuracy column
        accuracy_col = [c for c in df.columns if "accuracy" in c.lower()]
        if not accuracy_col:
            accuracy_col = [c for c in df.columns if "quality" in c.lower()]
        if not accuracy_col:
            print("No accuracy column found")
            return None

        accuracy_col = accuracy_col[0]

        fig, ax = plt.subplots(figsize=(10, 6))

        pivot = df.pivot(index="model", columns="dataset", values=accuracy_col)
        pivot.plot(kind="bar", ax=ax, width=0.8)

        ax.set_ylabel("Accuracy")
        ax.set_xlabel("Model")
        ax.set_title("Model Accuracy by Dataset")
        ax.set_ylim(0, 1)
        ax.legend(title="Dataset", bbox_to_anchor=(1.02, 1))
        ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')

        # Add value labels
        for container in ax.containers:
            ax.bar_label(container, fmt='%.2f', fontsize=8)

        plt.tight_layout()

        if save:
            fig.savefig(self.output_dir / "accuracy_comparison.png", dpi=150, bbox_inches='tight')

        return fig

    def plot_latency_comparison(self, save: bool = True) -> plt.Figure:
        """Plot latency comparison."""
        df = self._results_to_df()

        fig, ax = plt.subplots(figsize=(10, 6))

        # Create grouped bar chart
        x = np.arange(len(df["model"].unique()))
        width = 0.35

        models = df["model"].unique()
        datasets = df["dataset"].unique()

        for i, dataset in enumerate(datasets):
            data = df[df["dataset"] == dataset]
            values = [data[data["model"] == m]["latency_mean_ms"].values[0]
                     if len(data[data["model"] == m]) > 0 else 0
                     for m in models]
            ax.bar(x + i * width, values, width, label=dataset)

        ax.set_ylabel("Latency (ms)")
        ax.set_xlabel("Model")
        ax.set_title("Mean Latency by Model and Dataset")
        ax.set_xticks(x + width / 2)
        ax.set_xticklabels(models, rotation=45, ha='right')
        ax.legend(title="Dataset")

        plt.tight_layout()

        if save:
            fig.savefig(self.output_dir / "latency_comparison.png", dpi=150, bbox_inches='tight')

        return fig

    def plot_tradeoff(self, save: bool = True) -> plt.Figure:
        """Plot accuracy vs latency tradeoff."""
        df = self._results_to_df()

        # Find accuracy column
        accuracy_col = [c for c in df.columns if "accuracy" in c.lower() or "quality" in c.lower()]
        accuracy_col = accuracy_col[0] if accuracy_col else None

        if not accuracy_col or "latency_mean_ms" not in df.columns:
            print("Required columns not found")
            return None

        fig, ax = plt.subplots(figsize=(10, 6))

        # Get unique models
        models = df["model"].unique()
        colors = sns.color_palette("husl", len(models))
        model_colors = dict(zip(models, colors))

        for _, row in df.iterrows():
            ax.scatter(
                row["latency_mean_ms"],
                row[accuracy_col],
                c=[model_colors[row["model"]]],
                s=100,
                alpha=0.7
            )
            ax.annotate(
                f'{row["model"]}\n({row["dataset"]})',
                (row["latency_mean_ms"], row[accuracy_col]),
                textcoords="offset points",
                xytext=(5, 5),
                fontsize=8
            )

        ax.set_xlabel("Latency (ms)")
        ax.set_ylabel("Accuracy")
        ax.set_title("Accuracy vs Latency Tradeoff")

        # Add legend
        handles = [plt.scatter([], [], c=[c], label=m) for m, c in model_colors.items()]
        ax.legend(handles=handles, title="Model", bbox_to_anchor=(1.02, 1))

        plt.tight_layout()

        if save:
            fig.savefig(self.output_dir / "tradeoff.png", dpi=150, bbox_inches='tight')

        return fig

    def plot_resource_usage(self, save: bool = True) -> plt.Figure:
        """Plot resource usage comparison."""
        df = self._results_to_df()

        if "peak_memory_mb" not in df.columns:
            print("No resource data available")
            return None

        fig, axes = plt.subplots(1, 2, figsize=(14, 5))

        # Memory usage
        ax1 = axes[0]
        df_grouped = df.groupby("model")["peak_memory_mb"].mean().sort_values()
        df_grouped.plot(kind="barh", ax=ax1, color=sns.color_palette("husl", len(df_grouped)))
        ax1.set_xlabel("Peak Memory (MB)")
        ax1.set_title("Peak Memory Usage by Model")

        # CPU usage
        ax2 = axes[1]
        if "avg_cpu_percent" in df.columns:
            df_grouped = df.groupby("model")["avg_cpu_percent"].mean().sort_values()
            df_grouped.plot(kind="barh", ax=ax2, color=sns.color_palette("husl", len(df_grouped)))
            ax2.set_xlabel("Average CPU (%)")
            ax2.set_title("Average CPU Usage by Model")

        plt.tight_layout()

        if save:
            fig.savefig(self.output_dir / "resource_usage.png", dpi=150, bbox_inches='tight')

        return fig

    def generate_radar_chart(self, save: bool = True) -> plt.Figure:
        """Generate radar chart comparing models across metrics."""
        df = self._results_to_df()

        # Aggregate by model
        models = df["model"].unique()

        # Metrics to include (normalize each)
        metrics = []
        if "quality_mc_accuracy" in df.columns:
            metrics.append(("quality_mc_accuracy", "Accuracy", False))
        if "latency_mean_ms" in df.columns:
            metrics.append(("latency_mean_ms", "Speed", True))  # Inverse
        if "throughput_tps" in df.columns:
            metrics.append(("throughput_tps", "Throughput", False))
        if "peak_memory_mb" in df.columns:
            metrics.append(("peak_memory_mb", "Memory Eff.", True))  # Inverse

        if len(metrics) < 3:
            print("Not enough metrics for radar chart")
            return None

        # Prepare data
        num_metrics = len(metrics)
        angles = np.linspace(0, 2 * np.pi, num_metrics, endpoint=False).tolist()
        angles += angles[:1]  # Complete the loop

        fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))

        for model in models:
            model_data = df[df["model"] == model]
            values = []

            for col, _, inverse in metrics:
                if col in model_data.columns:
                    val = model_data[col].mean()
                    # Normalize to 0-1
                    col_min = df[col].min()
                    col_max = df[col].max()
                    if col_max > col_min:
                        normalized = (val - col_min) / (col_max - col_min)
                    else:
                        normalized = 0.5
                    if inverse:
                        normalized = 1 - normalized
                    values.append(normalized)
                else:
                    values.append(0)

            values += values[:1]  # Complete the loop
            ax.plot(angles, values, 'o-', linewidth=2, label=model)
            ax.fill(angles, values, alpha=0.1)

        # Labels
        metric_labels = [m[1] for m in metrics]
        ax.set_xticks(angles[:-1])
        ax.set_xticklabels(metric_labels)
        ax.set_ylim(0, 1)
        ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.0))
        ax.set_title("Model Comparison Radar", y=1.08)

        plt.tight_layout()

        if save:
            fig.savefig(self.output_dir / "radar_chart.png", dpi=150, bbox_inches='tight')

        return fig

    def generate_report(self) -> str:
        """Generate a complete HTML report."""
        # Generate all plots
        self.plot_accuracy_comparison()
        self.plot_latency_comparison()
        self.plot_tradeoff()
        self.plot_resource_usage()
        self.plot_radar_chart()

        df = self._results_to_df()

        html = f"""
<!DOCTYPE html>
<html>
<head>
    <title>SLM Benchmark Report</title>
    <style>
        body {{ font-family: Arial, sans-serif; margin: 40px; }}
        h1 {{ color: #333; }}
        h2 {{ color: #666; margin-top: 30px; }}
        table {{ border-collapse: collapse; width: 100%; margin: 20px 0; }}
        th, td {{ border: 1px solid #ddd; padding: 12px; text-align: left; }}
        th {{ background-color: #4CAF50; color: white; }}
        tr:nth-child(even) {{ background-color: #f2f2f2; }}
        img {{ max-width: 100%; margin: 20px 0; border: 1px solid #ddd; }}
        .metric {{ font-size: 24px; font-weight: bold; color: #4CAF50; }}
        .summary {{ display: flex; gap: 20px; flex-wrap: wrap; }}
        .summary-card {{ background: #f9f9f9; padding: 20px; border-radius: 8px; flex: 1; min-width: 200px; }}
    </style>
</head>
<body>
    <h1>SLM Benchmark Report</h1>
    <p>Generated: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}</p>

    <h2>Summary</h2>
    <div class="summary">
        <div class="summary-card">
            <div class="metric">{len(df['model'].unique())}</div>
            <div>Models Tested</div>
        </div>
        <div class="summary-card">
            <div class="metric">{len(df['dataset'].unique())}</div>
            <div>Datasets Used</div>
        </div>
        <div class="summary-card">
            <div class="metric">{len(self.results)}</div>
            <div>Total Benchmarks</div>
        </div>
    </div>

    <h2>Results Table</h2>
    {df.to_html(index=False, float_format='%.3f')}

    <h2>Accuracy Comparison</h2>
    <img src="accuracy_comparison.png" alt="Accuracy Comparison">

    <h2>Latency Comparison</h2>
    <img src="latency_comparison.png" alt="Latency Comparison">

    <h2>Accuracy vs Latency Tradeoff</h2>
    <img src="tradeoff.png" alt="Tradeoff Chart">

    <h2>Resource Usage</h2>
    <img src="resource_usage.png" alt="Resource Usage">

    <h2>Model Comparison Radar</h2>
    <img src="radar_chart.png" alt="Radar Chart">
</body>
</html>
        """

        report_path = self.output_dir / "report.html"
        with open(report_path, 'w') as f:
            f.write(html)

        print(f"Report saved to {report_path}")
        return str(report_path)

Part 5: Complete CLI Application

Main Application

Create a command-line interface for running benchmarks.

# main.py
import argparse
import json
from pathlib import Path
from rich.console import Console
from rich.panel import Panel

from models import list_models, MODELS
from datasets_manager import list_datasets
from benchmark_runner import BenchmarkRunner
from visualizer import BenchmarkVisualizer


console = Console()


def list_available():
    """List available models and datasets."""
    console.print("\n[bold cyan]Available Models:[/bold cyan]")
    for key, info in MODELS.items():
        console.print(f"  • {key}: {info.name} ({info.parameters}, {info.provider})")

    console.print("\n[bold cyan]Available Datasets:[/bold cyan]")
    for name in list_datasets():
        console.print(f"  • {name}")


def run_benchmark(args):
    """Run benchmark with specified parameters."""
    console.print(Panel.fit(
        "[bold]SLM Benchmarking Suite[/bold]",
        border_style="cyan"
    ))

    # Parse models and datasets
    models = args.models.split(",") if args.models else list(MODELS.keys())
    datasets = args.datasets.split(",") if args.datasets else list_datasets()

    console.print(f"\n[cyan]Models:[/cyan] {', '.join(models)}")
    console.print(f"[cyan]Datasets:[/cyan] {', '.join(datasets)}")
    console.print(f"[cyan]Max samples:[/cyan] {args.max_samples or 'All'}")

    # Run benchmarks
    runner = BenchmarkRunner(
        profile_resources=not args.no_profile,
        verbose=not args.quiet
    )

    results = runner.run_full_suite(
        model_keys=models,
        dataset_names=datasets,
        max_samples=args.max_samples
    )

    # Print comparison
    runner.print_comparison()

    # Save raw results
    if args.output:
        runner.save_results(args.output)

    # Generate report
    if args.report:
        results_data = [r.to_dict() for r in results]
        visualizer = BenchmarkVisualizer(results_data, output_dir=args.report_dir)
        report_path = visualizer.generate_report()
        console.print(f"\n[green]Report generated: {report_path}[/green]")

    return results


def compare_models(args):
    """Quick comparison of specified models."""
    console.print(f"\n[bold]Comparing models on {args.dataset}[/bold]\n")

    models = args.models.split(",")

    runner = BenchmarkRunner(profile_resources=True)
    results = runner.run_comparison(
        model_keys=models,
        dataset_name=args.dataset,
        max_samples=args.max_samples
    )

    runner.print_comparison()

    # Find best model
    best_accuracy = 0
    best_model = None
    for r in results:
        for metric in r.quality_metrics.values():
            if metric.value > best_accuracy:
                best_accuracy = metric.value
                best_model = r.model_name

    console.print(f"\n[bold green]Best model: {best_model} ({best_accuracy:.1%} accuracy)[/bold green]")


def main():
    parser = argparse.ArgumentParser(
        description="SLM Benchmarking Suite",
        formatter_class=argparse.RawDescriptionHelpFormatter
    )

    subparsers = parser.add_subparsers(dest="command", help="Commands")

    # List command
    list_parser = subparsers.add_parser("list", help="List available models and datasets")

    # Run command
    run_parser = subparsers.add_parser("run", help="Run benchmarks")
    run_parser.add_argument("-m", "--models", help="Comma-separated model keys")
    run_parser.add_argument("-d", "--datasets", help="Comma-separated dataset names")
    run_parser.add_argument("-n", "--max-samples", type=int, help="Max samples per dataset")
    run_parser.add_argument("-o", "--output", help="Output JSON file")
    run_parser.add_argument("--report", action="store_true", help="Generate HTML report")
    run_parser.add_argument("--report-dir", default="reports", help="Report output directory")
    run_parser.add_argument("--no-profile", action="store_true", help="Disable resource profiling")
    run_parser.add_argument("-q", "--quiet", action="store_true", help="Quiet mode")

    # Compare command
    compare_parser = subparsers.add_parser("compare", help="Quick model comparison")
    compare_parser.add_argument("-m", "--models", required=True, help="Comma-separated models")
    compare_parser.add_argument("-d", "--dataset", default="mmlu", help="Dataset to use")
    compare_parser.add_argument("-n", "--max-samples", type=int, default=20)

    args = parser.parse_args()

    if args.command == "list":
        list_available()
    elif args.command == "run":
        run_benchmark(args)
    elif args.command == "compare":
        compare_models(args)
    else:
        parser.print_help()


if __name__ == "__main__":
    main()

Example Usage

# List available options
python main.py list

# Run full benchmark suite
python main.py run --report

# Run specific models on specific datasets
python main.py run -m phi3-mini,qwen2.5-3b -d mmlu,sentiment -n 50

# Quick comparison
python main.py compare -m phi3-mini,qwen2.5-3b,gemma2-2b -d mmlu -n 30

# Save results to file
python main.py run -m phi3-mini -d mmlu -o results.json

Model Selection Guide

Based on typical benchmark results:

Model	Best For	Accuracy	Speed	Memory
Phi-3 Mini	Reasoning, Math	High	Medium	2.3GB
Qwen 2.5 3B	Multilingual, Extraction	High	Medium	2.0GB
Gemma 2 2B	Generation, Fast tasks	Medium	Fast	1.6GB
Llama 3.2 3B	General purpose	High	Medium	2.0GB
SmolLM 2 1.7B	Resource-constrained	Medium	Very Fast	1.0GB

Exercises

Add More Datasets: Implement loading from HuggingFace datasets like MMLU, HellaSwag, or ARC
Custom Metrics: Create a custom metric for your specific use case (e.g., code execution accuracy)
Automated Model Selection: Build a recommender that suggests the best model based on task requirements
Continuous Benchmarking: Set up a scheduled job to track model performance over time

Loading HuggingFace Datasets

# huggingface_datasets.py
from datasets import load_dataset
from datasets_manager import BenchmarkDataset, BenchmarkSample, TaskType


def load_mmlu_from_hf(subject: str = "elementary_mathematics", split: str = "test") -> BenchmarkDataset:
    """Load MMLU dataset from HuggingFace."""
    dataset = load_dataset("cais/mmlu", subject, split=split)

    samples = []
    for i, item in enumerate(dataset):
        choices = item["choices"]
        prompt = f"""Question: {item["question"]}
A) {choices[0]}
B) {choices[1]}
C) {choices[2]}
D) {choices[3]}

Answer with just the letter (A, B, C, or D):"""

        answer_map = {0: "A", 1: "B", 2: "C", 3: "D"}
        expected = answer_map[item["answer"]]

        samples.append(BenchmarkSample(
            id=f"mmlu_{subject}_{i}",
            prompt=prompt,
            expected=expected,
            task_type=TaskType.MULTIPLE_CHOICE,
            category=subject
        ))

    return BenchmarkDataset(
        name=f"MMLU-{subject}",
        description=f"MMLU {subject} subset",
        task_type=TaskType.MULTIPLE_CHOICE,
        samples=samples
    )

Key Concepts Recap

Concept	What It Is	Why It Matters
MMLU	Massive Multitask Language Understanding benchmark	Standard test for general knowledge
Accuracy	% of correct answers	Primary quality metric
F1 Score	Harmonic mean of precision and recall	Better for imbalanced classes
Latency	Time from prompt to response complete	User experience metric
Time-to-First-Token	Time until first token generated	Perceived responsiveness
Throughput	Tokens per second generated	Capacity planning metric
P95/P99 Latency	95th/99th percentile latency	Worst-case performance
Peak Memory	Maximum RAM used during inference	Hardware requirements
Quantization Impact	Quality loss from Q8→Q4→Q2	Usually under 5% for Q4_K_M
Radar Chart	Multi-dimensional model comparison	Visualize tradeoffs at once

Next Steps

SLM Fine-tuning - Customize models for better task performance
SLM-Powered RAG - Combine SLMs with retrieval
Production SLM System - Deploy SLMs at scale

SLM Evaluation & Benchmarking

On this page