Comprehensive model evaluation with standard and custom metrics using the evaluate library

Model Evaluation & Benchmarks

TL;DR

Use the HuggingFace evaluate library to measure model performance with standard metrics (BLEU, ROUGE, BERTScore, accuracy, F1), build custom evaluation pipelines, compare models side-by-side, and generate model cards with evaluation results.

Build comprehensive model evaluation pipelines using the HuggingFace evaluate library, covering NLP metrics, custom metrics, model comparison, and model cards.

What You'll Learn

Standard NLP metrics (BLEU, ROUGE, BERTScore)
Classification metrics (accuracy, F1, precision, recall)
Custom metric creation and combination
Model comparison pipelines
Model cards with evaluation results
Benchmark suite design

Tech Stack

Component	Technology
Metrics	`evaluate`
Models	`transformers`
Hub	`huggingface_hub`
Python	3.10+

Architecture

┌──────────────────────────────────────────────────────────────────────────────┐
│                         EVALUATION PIPELINE                                   │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────┐   ┌────────────────┐   ┌──────────────────┐   ┌───────────┐  │
│  │  Model    │──▶│  Predictions   │──▶│  Metric Suite    │──▶│  Report   │  │
│  │  Outputs  │   │  (generated    │   │                  │   │           │  │
│  │           │   │   text or       │   │ • BLEU           │   │ • Scores  │  │
│  │           │   │   labels)       │   │ • ROUGE          │   │ • Compare │  │
│  └──────────┘   └────────────────┘   │ • BERTScore      │   │ • Card    │  │
│                                      │ • Accuracy       │   └───────────┘  │
│  ┌──────────┐                        │ • F1             │                   │
│  │  Ground   │──────────────────────▶│ • Custom         │                   │
│  │  Truth    │                        │                  │                   │
│  └──────────┘                        └──────────────────┘                   │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Project Structure

model-evaluation/
├── src/
│   ├── __init__.py
│   ├── metrics.py             # Standard metrics with evaluate
│   ├── custom_metrics.py      # Custom metric creation
│   ├── comparison.py          # Model comparison pipeline
│   ├── model_card.py          # Generate model cards with eval results
│   └── benchmark.py           # Benchmark suite
├── examples/
│   └── evaluate_models.py
├── requirements.txt
└── README.md

Implementation

Step 1: Dependencies

requirements.txt

evaluate>=0.4.0
transformers>=4.40.0
datasets>=2.19.0
huggingface_hub>=0.23.0
rouge-score>=0.1.2
bert-score>=0.3.13
nltk>=3.8.0
scikit-learn>=1.4.0

Step 2: Standard Metrics

src/metrics.py

"""Standard NLP and classification metrics using the evaluate library."""

import evaluate
import numpy as np


class MetricSuite:
    """
    Collection of standard metrics for model evaluation.

    The evaluate library provides 100+ metrics with consistent APIs.
    Metrics are loaded by name and called with predictions + references.
    """

    def __init__(self):
        # Load metrics (downloads metric code from Hub on first use)
        self._accuracy = evaluate.load("accuracy")
        self._f1 = evaluate.load("f1")
        self._precision = evaluate.load("precision")
        self._recall = evaluate.load("recall")

    def classification_report(
        self,
        predictions: list[int],
        references: list[int],
        average: str = "weighted",
    ) -> dict:
        """Compute all classification metrics."""
        results = {}

        results.update(
            self._accuracy.compute(
                predictions=predictions,
                references=references,
            )
        )
        results.update(
            self._f1.compute(
                predictions=predictions,
                references=references,
                average=average,
            )
        )
        results.update(
            self._precision.compute(
                predictions=predictions,
                references=references,
                average=average,
            )
        )
        results.update(
            self._recall.compute(
                predictions=predictions,
                references=references,
                average=average,
            )
        )

        return results


class GenerationMetrics:
    """Metrics for evaluating text generation quality."""

    def __init__(self):
        self._bleu = evaluate.load("bleu")
        self._rouge = evaluate.load("rouge")
        self._bertscore = evaluate.load("bertscore")

    def compute_bleu(
        self,
        predictions: list[str],
        references: list[list[str]],
    ) -> dict:
        """
        BLEU: measures n-gram overlap between prediction and reference.

        Scores range from 0 to 1. Good scores:
        - Machine translation: 0.30-0.50
        - Summarization: 0.15-0.30
        - General text generation: 0.10-0.25
        """
        return self._bleu.compute(
            predictions=predictions,
            references=references,
        )

    def compute_rouge(
        self,
        predictions: list[str],
        references: list[str],
    ) -> dict:
        """
        ROUGE: measures recall of n-grams from reference in prediction.

        Variants:
        - rouge1: Unigram overlap
        - rouge2: Bigram overlap
        - rougeL: Longest common subsequence
        - rougeLsum: rougeL over sentences
        """
        return self._rouge.compute(
            predictions=predictions,
            references=references,
        )

    def compute_bertscore(
        self,
        predictions: list[str],
        references: list[str],
        lang: str = "en",
    ) -> dict:
        """
        BERTScore: semantic similarity using contextual embeddings.

        Unlike BLEU/ROUGE which count exact n-gram matches,
        BERTScore measures meaning similarity:
        - "The cat sat on the mat" vs "A feline rested on the rug"
        - BLEU/ROUGE: low score (different words)
        - BERTScore: high score (same meaning)
        """
        results = self._bertscore.compute(
            predictions=predictions,
            references=references,
            lang=lang,
        )
        return {
            "bertscore_precision": float(np.mean(results["precision"])),
            "bertscore_recall": float(np.mean(results["recall"])),
            "bertscore_f1": float(np.mean(results["f1"])),
        }

    def full_report(
        self,
        predictions: list[str],
        references: list[str],
    ) -> dict:
        """Compute all generation metrics."""
        results = {}

        # ROUGE
        results.update(self.compute_rouge(predictions, references))

        # BERTScore
        results.update(self.compute_bertscore(predictions, references))

        # BLEU (needs list-of-lists for references)
        bleu_refs = [[ref] for ref in references]
        bleu_result = self.compute_bleu(predictions, bleu_refs)
        results["bleu"] = bleu_result["bleu"]

        return results

Metric Comparison:

┌─────────────────────────────────────────────────────────────────┐
│ WHEN TO USE WHICH METRIC                                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  CLASSIFICATION (discrete labels)                                │
│  ┌──────────────┬────────────────────────────────────────┐      │
│  │ Accuracy     │ Overall % correct. Misleading if        │      │
│  │              │ classes are imbalanced.                  │      │
│  │ F1 (macro)   │ Harmonic mean of precision & recall.    │      │
│  │              │ Good for imbalanced classes.             │      │
│  │ Precision    │ Of predicted positives, how many right?  │      │
│  │ Recall       │ Of actual positives, how many found?     │      │
│  └──────────────┴────────────────────────────────────────┘      │
│                                                                  │
│  GENERATION (free-form text)                                     │
│  ┌──────────────┬────────────────────────────────────────┐      │
│  │ BLEU         │ n-gram precision. Good for translation.  │      │
│  │ ROUGE        │ n-gram recall. Good for summarization.   │      │
│  │ BERTScore    │ Semantic similarity with embeddings.     │      │
│  │              │ Best for paraphrase and meaning.         │      │
│  └──────────────┴────────────────────────────────────────┘      │
│                                                                  │
│  Rule of thumb:                                                  │
│  • Translation → BLEU + BERTScore                                │
│  • Summarization → ROUGE + BERTScore                             │
│  • Classification → F1 (weighted or macro)                       │
│  • QA → Exact Match + F1 (token-level)                           │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Step 3: Custom Metrics

src/custom_metrics.py

"""Create custom evaluation metrics."""

import evaluate
import numpy as np


class ConsistencyMetric:
    """
    Custom metric: measure response consistency.

    Given the same question asked differently, do we get
    the same answer? Useful for chatbot evaluation.
    """

    def compute(
        self,
        predictions_a: list[str],
        predictions_b: list[str],
    ) -> dict:
        """Compare predictions from paraphrased inputs."""
        bertscore = evaluate.load("bertscore")
        results = bertscore.compute(
            predictions=predictions_a,
            references=predictions_b,
            lang="en",
        )
        return {
            "consistency_score": float(np.mean(results["f1"])),
        }


class LengthPenaltyMetric:
    """
    Custom metric: penalize overly long or short responses.

    Useful for ensuring summarization quality or checking
    that generation stays within expected bounds.
    """

    def compute(
        self,
        predictions: list[str],
        target_length: int = 100,
        tolerance: float = 0.3,
    ) -> dict:
        """Score based on how close to target length."""
        lengths = [len(p.split()) for p in predictions]
        penalties = []

        for length in lengths:
            ratio = length / target_length
            if 1 - tolerance <= ratio <= 1 + tolerance:
                penalties.append(1.0)
            else:
                penalties.append(max(0, 1 - abs(ratio - 1)))

        return {
            "length_score": float(np.mean(penalties)),
            "avg_length": float(np.mean(lengths)),
            "length_std": float(np.std(lengths)),
        }


def combine_metrics(
    predictions: list[str],
    references: list[str],
    weights: dict[str, float] | None = None,
) -> dict:
    """
    Combine multiple metrics into a single evaluation report.

    Args:
        predictions: Model outputs
        references: Ground truth
        weights: Metric weights for combined score
    """
    if weights is None:
        weights = {"rouge1": 0.3, "bertscore_f1": 0.5, "length_score": 0.2}

    gen_metrics = evaluate.combine([
        evaluate.load("rouge"),
        evaluate.load("bleu"),
    ])

    results = gen_metrics.compute(
        predictions=predictions,
        references=references,
    )

    # Add BERTScore separately (different API)
    bertscore = evaluate.load("bertscore")
    bs_results = bertscore.compute(
        predictions=predictions,
        references=references,
        lang="en",
    )
    results["bertscore_f1"] = float(np.mean(bs_results["f1"]))

    # Add length metric
    length_metric = LengthPenaltyMetric()
    results.update(length_metric.compute(predictions))

    # Compute weighted combined score
    combined = sum(
        results.get(k, 0) * v
        for k, v in weights.items()
    )
    results["combined_score"] = combined

    return results

Step 4: Model Comparison

src/comparison.py

"""Compare multiple models on the same benchmark."""

from transformers import pipeline
from datasets import load_dataset
from .metrics import MetricSuite, GenerationMetrics


def compare_classification_models(
    model_names: list[str],
    dataset_name: str = "imdb",
    split: str = "test[:200]",
) -> dict:
    """Compare classification models on the same test set."""
    dataset = load_dataset(dataset_name, split=split)
    metrics = MetricSuite()
    results = {}

    for model_name in model_names:
        print(f"Evaluating {model_name}...")
        pipe = pipeline("text-classification", model=model_name, device=0)

        predictions = []
        for example in dataset:
            result = pipe(example["text"], truncation=True, max_length=512)
            # Map label string to int
            pred = 1 if result[0]["label"] == "POSITIVE" else 0
            predictions.append(pred)

        references = dataset["label"]
        scores = metrics.classification_report(predictions, references)
        results[model_name] = scores

        print(f"  Accuracy: {scores['accuracy']:.4f}, F1: {scores['f1']:.4f}")

    return results


def compare_summarization_models(
    model_names: list[str],
    dataset_name: str = "cnn_dailymail",
    dataset_config: str = "3.0.0",
    num_samples: int = 100,
) -> dict:
    """Compare summarization models on the same test set."""
    dataset = load_dataset(
        dataset_name, dataset_config, split=f"test[:{num_samples}]"
    )
    gen_metrics = GenerationMetrics()
    results = {}

    for model_name in model_names:
        print(f"Evaluating {model_name}...")
        pipe = pipeline("summarization", model=model_name, device=0)

        predictions = []
        for example in dataset:
            summary = pipe(
                example["article"],
                max_length=130,
                min_length=30,
                truncation=True,
            )
            predictions.append(summary[0]["summary_text"])

        references = dataset["highlights"]
        scores = gen_metrics.full_report(predictions, references)
        results[model_name] = scores

        print(f"  ROUGE-1: {scores['rouge1']:.4f}, BERTScore: {scores['bertscore_f1']:.4f}")

    return results


def format_comparison_table(results: dict) -> str:
    """Format comparison results as a markdown table."""
    if not results:
        return ""

    models = list(results.keys())
    metrics = list(results[models[0]].keys())

    # Header
    header = "| Model | " + " | ".join(metrics) + " |"
    separator = "|" + "|".join(["---"] * (len(metrics) + 1)) + "|"

    rows = []
    for model in models:
        values = [f"{results[model][m]:.4f}" for m in metrics]
        rows.append(f"| {model} | " + " | ".join(values) + " |")

    return "\n".join([header, separator] + rows)

Step 5: Model Cards

src/model_card.py

"""Generate model cards with evaluation results."""

from huggingface_hub import ModelCard, ModelCardData


def create_model_card(
    model_name: str,
    base_model: str,
    eval_results: dict,
    task: str = "text-classification",
    dataset: str = "imdb",
    training_details: dict | None = None,
) -> str:
    """
    Generate a model card with evaluation results.

    Model cards are standardized documentation for ML models,
    required by HuggingFace Hub for good practices.
    """
    card_data = ModelCardData(
        model_name=model_name,
        language="en",
        license="mit",
        library_name="transformers",
        tags=["text-classification", "fine-tuned"],
        datasets=[dataset],
        metrics=list(eval_results.keys()),
        eval_results=[
            {
                "task": {"type": task},
                "dataset": {"type": dataset, "name": dataset},
                "metrics": [
                    {"type": k, "value": v}
                    for k, v in eval_results.items()
                ],
            }
        ],
    )

    training_info = ""
    if training_details:
        training_info = "\n## Training Details\n\n"
        for key, value in training_details.items():
            training_info += f"- **{key}**: {value}\n"

    card_content = f"""---
{card_data.to_yaml()}
---

# {model_name}

Fine-tuned from [{base_model}](https://huggingface.co/{base_model}) on {dataset}.

## Evaluation Results

| Metric | Score |
|--------|-------|
""" + "\n".join(f"| {k} | {v:.4f} |" for k, v in eval_results.items())

    card_content += training_info

    return card_content

Step 6: Benchmark Suite

src/benchmark.py

"""Design reusable benchmark suites."""

from dataclasses import dataclass
from typing import Callable
from datasets import load_dataset
from .metrics import MetricSuite, GenerationMetrics


@dataclass
class BenchmarkTask:
    """A single benchmark task definition."""
    name: str
    dataset: str
    dataset_config: str | None
    split: str
    input_column: str
    target_column: str
    task_type: str  # "classification" or "generation"
    num_samples: int = 200


class BenchmarkSuite:
    """Run a suite of benchmarks against any model."""

    STANDARD_TASKS = [
        BenchmarkTask(
            name="Sentiment (IMDB)",
            dataset="imdb",
            dataset_config=None,
            split="test[:200]",
            input_column="text",
            target_column="label",
            task_type="classification",
        ),
        BenchmarkTask(
            name="Summarization (CNN/DM)",
            dataset="cnn_dailymail",
            dataset_config="3.0.0",
            split="test[:50]",
            input_column="article",
            target_column="highlights",
            task_type="generation",
        ),
    ]

    def __init__(self, tasks: list[BenchmarkTask] | None = None):
        self.tasks = tasks or self.STANDARD_TASKS
        self.classification_metrics = MetricSuite()
        self.generation_metrics = GenerationMetrics()

    def run(
        self,
        predict_fn: Callable[[list[str]], list],
        task_filter: list[str] | None = None,
    ) -> dict:
        """
        Run all benchmark tasks.

        Args:
            predict_fn: Function that takes list[str] and returns predictions
            task_filter: Only run tasks with these names
        """
        results = {}

        for task in self.tasks:
            if task_filter and task.name not in task_filter:
                continue

            print(f"Running: {task.name}")
            dataset = load_dataset(
                task.dataset,
                task.dataset_config,
                split=task.split,
            )

            inputs = dataset[task.input_column]
            targets = dataset[task.target_column]
            predictions = predict_fn(inputs)

            if task.task_type == "classification":
                scores = self.classification_metrics.classification_report(
                    predictions, targets,
                )
            else:
                scores = self.generation_metrics.full_report(
                    predictions, targets,
                )

            results[task.name] = scores
            print(f"  {scores}")

        return results

Running the Project

# Install dependencies
pip install -r requirements.txt

# Evaluate a classification model
python -c "
from src.metrics import MetricSuite
metrics = MetricSuite()
results = metrics.classification_report(
    predictions=[1, 0, 1, 1, 0],
    references=[1, 0, 0, 1, 0],
)
print(results)
"

# Compare summarization models
python -c "
from src.comparison import compare_summarization_models
results = compare_summarization_models(
    model_names=['facebook/bart-large-cnn', 't5-small'],
    num_samples=20,
)
"

# Generate a model card
python -c "
from src.model_card import create_model_card
card = create_model_card(
    model_name='my-sentiment-model',
    base_model='bert-base-uncased',
    eval_results={'accuracy': 0.92, 'f1': 0.91},
)
print(card)
"

Key Concepts Recap

Concept	What It Is	Why It Matters
BLEU	N-gram precision score for generation	Standard for machine translation evaluation
ROUGE	N-gram recall score for generation	Standard for summarization evaluation
BERTScore	Semantic similarity via embeddings	Captures meaning beyond exact word match
F1 Score	Harmonic mean of precision and recall	Balanced metric for classification
`evaluate.combine`	Run multiple metrics at once	Efficient multi-metric evaluation
Model Card	Standardized model documentation	Required for responsible model sharing
Benchmark Suite	Reusable multi-task evaluation	Consistent model comparison

Next Steps

Preference Alignment with TRL — Evaluate alignment quality
Production AI Workbench — Build an evaluation dashboard

Model Evaluation & Benchmarks

TL;DR

Build comprehensive model evaluation pipelines using the HuggingFace evaluate library, covering NLP metrics, custom metrics, model comparison, and model cards.

What You'll Learn

Standard NLP metrics (BLEU, ROUGE, BERTScore)
Classification metrics (accuracy, F1, precision, recall)
Custom metric creation and combination
Model comparison pipelines
Model cards with evaluation results
Benchmark suite design

Tech Stack

Component	Technology
Metrics	`evaluate`
Models	`transformers`
Hub	`huggingface_hub`
Python	3.10+

Architecture

┌──────────────────────────────────────────────────────────────────────────────┐
│                         EVALUATION PIPELINE                                   │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────┐   ┌────────────────┐   ┌──────────────────┐   ┌───────────┐  │
│  │  Model    │──▶│  Predictions   │──▶│  Metric Suite    │──▶│  Report   │  │
│  │  Outputs  │   │  (generated    │   │                  │   │           │  │
│  │           │   │   text or       │   │ • BLEU           │   │ • Scores  │  │
│  │           │   │   labels)       │   │ • ROUGE          │   │ • Compare │  │
│  └──────────┘   └────────────────┘   │ • BERTScore      │   │ • Card    │  │
│                                      │ • Accuracy       │   └───────────┘  │
│  ┌──────────┐                        │ • F1             │                   │
│  │  Ground   │──────────────────────▶│ • Custom         │                   │
│  │  Truth    │                        │                  │                   │
│  └──────────┘                        └──────────────────┘                   │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Project Structure

model-evaluation/
├── src/
│   ├── __init__.py
│   ├── metrics.py             # Standard metrics with evaluate
│   ├── custom_metrics.py      # Custom metric creation
│   ├── comparison.py          # Model comparison pipeline
│   ├── model_card.py          # Generate model cards with eval results
│   └── benchmark.py           # Benchmark suite
├── examples/
│   └── evaluate_models.py
├── requirements.txt
└── README.md

Implementation

Step 1: Dependencies

requirements.txt

evaluate>=0.4.0
transformers>=4.40.0
datasets>=2.19.0
huggingface_hub>=0.23.0
rouge-score>=0.1.2
bert-score>=0.3.13
nltk>=3.8.0
scikit-learn>=1.4.0

Step 2: Standard Metrics

src/metrics.py

"""Standard NLP and classification metrics using the evaluate library."""

import evaluate
import numpy as np


class MetricSuite:
    """
    Collection of standard metrics for model evaluation.

    The evaluate library provides 100+ metrics with consistent APIs.
    Metrics are loaded by name and called with predictions + references.
    """

    def __init__(self):
        # Load metrics (downloads metric code from Hub on first use)
        self._accuracy = evaluate.load("accuracy")
        self._f1 = evaluate.load("f1")
        self._precision = evaluate.load("precision")
        self._recall = evaluate.load("recall")

    def classification_report(
        self,
        predictions: list[int],
        references: list[int],
        average: str = "weighted",
    ) -> dict:
        """Compute all classification metrics."""
        results = {}

        results.update(
            self._accuracy.compute(
                predictions=predictions,
                references=references,
            )
        )
        results.update(
            self._f1.compute(
                predictions=predictions,
                references=references,
                average=average,
            )
        )
        results.update(
            self._precision.compute(
                predictions=predictions,
                references=references,
                average=average,
            )
        )
        results.update(
            self._recall.compute(
                predictions=predictions,
                references=references,
                average=average,
            )
        )

        return results


class GenerationMetrics:
    """Metrics for evaluating text generation quality."""

    def __init__(self):
        self._bleu = evaluate.load("bleu")
        self._rouge = evaluate.load("rouge")
        self._bertscore = evaluate.load("bertscore")

    def compute_bleu(
        self,
        predictions: list[str],
        references: list[list[str]],
    ) -> dict:
        """
        BLEU: measures n-gram overlap between prediction and reference.

        Scores range from 0 to 1. Good scores:
        - Machine translation: 0.30-0.50
        - Summarization: 0.15-0.30
        - General text generation: 0.10-0.25
        """
        return self._bleu.compute(
            predictions=predictions,
            references=references,
        )

    def compute_rouge(
        self,
        predictions: list[str],
        references: list[str],
    ) -> dict:
        """
        ROUGE: measures recall of n-grams from reference in prediction.

        Variants:
        - rouge1: Unigram overlap
        - rouge2: Bigram overlap
        - rougeL: Longest common subsequence
        - rougeLsum: rougeL over sentences
        """
        return self._rouge.compute(
            predictions=predictions,
            references=references,
        )

    def compute_bertscore(
        self,
        predictions: list[str],
        references: list[str],
        lang: str = "en",
    ) -> dict:
        """
        BERTScore: semantic similarity using contextual embeddings.

        Unlike BLEU/ROUGE which count exact n-gram matches,
        BERTScore measures meaning similarity:
        - "The cat sat on the mat" vs "A feline rested on the rug"
        - BLEU/ROUGE: low score (different words)
        - BERTScore: high score (same meaning)
        """
        results = self._bertscore.compute(
            predictions=predictions,
            references=references,
            lang=lang,
        )
        return {
            "bertscore_precision": float(np.mean(results["precision"])),
            "bertscore_recall": float(np.mean(results["recall"])),
            "bertscore_f1": float(np.mean(results["f1"])),
        }

    def full_report(
        self,
        predictions: list[str],
        references: list[str],
    ) -> dict:
        """Compute all generation metrics."""
        results = {}

        # ROUGE
        results.update(self.compute_rouge(predictions, references))

        # BERTScore
        results.update(self.compute_bertscore(predictions, references))

        # BLEU (needs list-of-lists for references)
        bleu_refs = [[ref] for ref in references]
        bleu_result = self.compute_bleu(predictions, bleu_refs)
        results["bleu"] = bleu_result["bleu"]

        return results

Metric Comparison:

┌─────────────────────────────────────────────────────────────────┐
│ WHEN TO USE WHICH METRIC                                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  CLASSIFICATION (discrete labels)                                │
│  ┌──────────────┬────────────────────────────────────────┐      │
│  │ Accuracy     │ Overall % correct. Misleading if        │      │
│  │              │ classes are imbalanced.                  │      │
│  │ F1 (macro)   │ Harmonic mean of precision & recall.    │      │
│  │              │ Good for imbalanced classes.             │      │
│  │ Precision    │ Of predicted positives, how many right?  │      │
│  │ Recall       │ Of actual positives, how many found?     │      │
│  └──────────────┴────────────────────────────────────────┘      │
│                                                                  │
│  GENERATION (free-form text)                                     │
│  ┌──────────────┬────────────────────────────────────────┐      │
│  │ BLEU         │ n-gram precision. Good for translation.  │      │
│  │ ROUGE        │ n-gram recall. Good for summarization.   │      │
│  │ BERTScore    │ Semantic similarity with embeddings.     │      │
│  │              │ Best for paraphrase and meaning.         │      │
│  └──────────────┴────────────────────────────────────────┘      │
│                                                                  │
│  Rule of thumb:                                                  │
│  • Translation → BLEU + BERTScore                                │
│  • Summarization → ROUGE + BERTScore                             │
│  • Classification → F1 (weighted or macro)                       │
│  • QA → Exact Match + F1 (token-level)                           │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Step 3: Custom Metrics

src/custom_metrics.py

"""Create custom evaluation metrics."""

import evaluate
import numpy as np


class ConsistencyMetric:
    """
    Custom metric: measure response consistency.

    Given the same question asked differently, do we get
    the same answer? Useful for chatbot evaluation.
    """

    def compute(
        self,
        predictions_a: list[str],
        predictions_b: list[str],
    ) -> dict:
        """Compare predictions from paraphrased inputs."""
        bertscore = evaluate.load("bertscore")
        results = bertscore.compute(
            predictions=predictions_a,
            references=predictions_b,
            lang="en",
        )
        return {
            "consistency_score": float(np.mean(results["f1"])),
        }


class LengthPenaltyMetric:
    """
    Custom metric: penalize overly long or short responses.

    Useful for ensuring summarization quality or checking
    that generation stays within expected bounds.
    """

    def compute(
        self,
        predictions: list[str],
        target_length: int = 100,
        tolerance: float = 0.3,
    ) -> dict:
        """Score based on how close to target length."""
        lengths = [len(p.split()) for p in predictions]
        penalties = []

        for length in lengths:
            ratio = length / target_length
            if 1 - tolerance <= ratio <= 1 + tolerance:
                penalties.append(1.0)
            else:
                penalties.append(max(0, 1 - abs(ratio - 1)))

        return {
            "length_score": float(np.mean(penalties)),
            "avg_length": float(np.mean(lengths)),
            "length_std": float(np.std(lengths)),
        }


def combine_metrics(
    predictions: list[str],
    references: list[str],
    weights: dict[str, float] | None = None,
) -> dict:
    """
    Combine multiple metrics into a single evaluation report.

    Args:
        predictions: Model outputs
        references: Ground truth
        weights: Metric weights for combined score
    """
    if weights is None:
        weights = {"rouge1": 0.3, "bertscore_f1": 0.5, "length_score": 0.2}

    gen_metrics = evaluate.combine([
        evaluate.load("rouge"),
        evaluate.load("bleu"),
    ])

    results = gen_metrics.compute(
        predictions=predictions,
        references=references,
    )

    # Add BERTScore separately (different API)
    bertscore = evaluate.load("bertscore")
    bs_results = bertscore.compute(
        predictions=predictions,
        references=references,
        lang="en",
    )
    results["bertscore_f1"] = float(np.mean(bs_results["f1"]))

    # Add length metric
    length_metric = LengthPenaltyMetric()
    results.update(length_metric.compute(predictions))

    # Compute weighted combined score
    combined = sum(
        results.get(k, 0) * v
        for k, v in weights.items()
    )
    results["combined_score"] = combined

    return results

Step 4: Model Comparison

src/comparison.py

"""Compare multiple models on the same benchmark."""

from transformers import pipeline
from datasets import load_dataset
from .metrics import MetricSuite, GenerationMetrics


def compare_classification_models(
    model_names: list[str],
    dataset_name: str = "imdb",
    split: str = "test[:200]",
) -> dict:
    """Compare classification models on the same test set."""
    dataset = load_dataset(dataset_name, split=split)
    metrics = MetricSuite()
    results = {}

    for model_name in model_names:
        print(f"Evaluating {model_name}...")
        pipe = pipeline("text-classification", model=model_name, device=0)

        predictions = []
        for example in dataset:
            result = pipe(example["text"], truncation=True, max_length=512)
            # Map label string to int
            pred = 1 if result[0]["label"] == "POSITIVE" else 0
            predictions.append(pred)

        references = dataset["label"]
        scores = metrics.classification_report(predictions, references)
        results[model_name] = scores

        print(f"  Accuracy: {scores['accuracy']:.4f}, F1: {scores['f1']:.4f}")

    return results


def compare_summarization_models(
    model_names: list[str],
    dataset_name: str = "cnn_dailymail",
    dataset_config: str = "3.0.0",
    num_samples: int = 100,
) -> dict:
    """Compare summarization models on the same test set."""
    dataset = load_dataset(
        dataset_name, dataset_config, split=f"test[:{num_samples}]"
    )
    gen_metrics = GenerationMetrics()
    results = {}

    for model_name in model_names:
        print(f"Evaluating {model_name}...")
        pipe = pipeline("summarization", model=model_name, device=0)

        predictions = []
        for example in dataset:
            summary = pipe(
                example["article"],
                max_length=130,
                min_length=30,
                truncation=True,
            )
            predictions.append(summary[0]["summary_text"])

        references = dataset["highlights"]
        scores = gen_metrics.full_report(predictions, references)
        results[model_name] = scores

        print(f"  ROUGE-1: {scores['rouge1']:.4f}, BERTScore: {scores['bertscore_f1']:.4f}")

    return results


def format_comparison_table(results: dict) -> str:
    """Format comparison results as a markdown table."""
    if not results:
        return ""

    models = list(results.keys())
    metrics = list(results[models[0]].keys())

    # Header
    header = "| Model | " + " | ".join(metrics) + " |"
    separator = "|" + "|".join(["---"] * (len(metrics) + 1)) + "|"

    rows = []
    for model in models:
        values = [f"{results[model][m]:.4f}" for m in metrics]
        rows.append(f"| {model} | " + " | ".join(values) + " |")

    return "\n".join([header, separator] + rows)

Step 5: Model Cards

src/model_card.py

"""Generate model cards with evaluation results."""

from huggingface_hub import ModelCard, ModelCardData


def create_model_card(
    model_name: str,
    base_model: str,
    eval_results: dict,
    task: str = "text-classification",
    dataset: str = "imdb",
    training_details: dict | None = None,
) -> str:
    """
    Generate a model card with evaluation results.

    Model cards are standardized documentation for ML models,
    required by HuggingFace Hub for good practices.
    """
    card_data = ModelCardData(
        model_name=model_name,
        language="en",
        license="mit",
        library_name="transformers",
        tags=["text-classification", "fine-tuned"],
        datasets=[dataset],
        metrics=list(eval_results.keys()),
        eval_results=[
            {
                "task": {"type": task},
                "dataset": {"type": dataset, "name": dataset},
                "metrics": [
                    {"type": k, "value": v}
                    for k, v in eval_results.items()
                ],
            }
        ],
    )

    training_info = ""
    if training_details:
        training_info = "\n## Training Details\n\n"
        for key, value in training_details.items():
            training_info += f"- **{key}**: {value}\n"

    card_content = f"""---
{card_data.to_yaml()}
---

# {model_name}

Fine-tuned from [{base_model}](https://huggingface.co/{base_model}) on {dataset}.

## Evaluation Results

| Metric | Score |
|--------|-------|
""" + "\n".join(f"| {k} | {v:.4f} |" for k, v in eval_results.items())

    card_content += training_info

    return card_content

Step 6: Benchmark Suite

src/benchmark.py

"""Design reusable benchmark suites."""

from dataclasses import dataclass
from typing import Callable
from datasets import load_dataset
from .metrics import MetricSuite, GenerationMetrics


@dataclass
class BenchmarkTask:
    """A single benchmark task definition."""
    name: str
    dataset: str
    dataset_config: str | None
    split: str
    input_column: str
    target_column: str
    task_type: str  # "classification" or "generation"
    num_samples: int = 200


class BenchmarkSuite:
    """Run a suite of benchmarks against any model."""

    STANDARD_TASKS = [
        BenchmarkTask(
            name="Sentiment (IMDB)",
            dataset="imdb",
            dataset_config=None,
            split="test[:200]",
            input_column="text",
            target_column="label",
            task_type="classification",
        ),
        BenchmarkTask(
            name="Summarization (CNN/DM)",
            dataset="cnn_dailymail",
            dataset_config="3.0.0",
            split="test[:50]",
            input_column="article",
            target_column="highlights",
            task_type="generation",
        ),
    ]

    def __init__(self, tasks: list[BenchmarkTask] | None = None):
        self.tasks = tasks or self.STANDARD_TASKS
        self.classification_metrics = MetricSuite()
        self.generation_metrics = GenerationMetrics()

    def run(
        self,
        predict_fn: Callable[[list[str]], list],
        task_filter: list[str] | None = None,
    ) -> dict:
        """
        Run all benchmark tasks.

        Args:
            predict_fn: Function that takes list[str] and returns predictions
            task_filter: Only run tasks with these names
        """
        results = {}

        for task in self.tasks:
            if task_filter and task.name not in task_filter:
                continue

            print(f"Running: {task.name}")
            dataset = load_dataset(
                task.dataset,
                task.dataset_config,
                split=task.split,
            )

            inputs = dataset[task.input_column]
            targets = dataset[task.target_column]
            predictions = predict_fn(inputs)

            if task.task_type == "classification":
                scores = self.classification_metrics.classification_report(
                    predictions, targets,
                )
            else:
                scores = self.generation_metrics.full_report(
                    predictions, targets,
                )

            results[task.name] = scores
            print(f"  {scores}")

        return results

Running the Project

# Install dependencies
pip install -r requirements.txt

# Evaluate a classification model
python -c "
from src.metrics import MetricSuite
metrics = MetricSuite()
results = metrics.classification_report(
    predictions=[1, 0, 1, 1, 0],
    references=[1, 0, 0, 1, 0],
)
print(results)
"

# Compare summarization models
python -c "
from src.comparison import compare_summarization_models
results = compare_summarization_models(
    model_names=['facebook/bart-large-cnn', 't5-small'],
    num_samples=20,
)
"

# Generate a model card
python -c "
from src.model_card import create_model_card
card = create_model_card(
    model_name='my-sentiment-model',
    base_model='bert-base-uncased',
    eval_results={'accuracy': 0.92, 'f1': 0.91},
)
print(card)
"

Key Concepts Recap

Concept	What It Is	Why It Matters
BLEU	N-gram precision score for generation	Standard for machine translation evaluation
ROUGE	N-gram recall score for generation	Standard for summarization evaluation
BERTScore	Semantic similarity via embeddings	Captures meaning beyond exact word match
F1 Score	Harmonic mean of precision and recall	Balanced metric for classification
`evaluate.combine`	Run multiple metrics at once	Efficient multi-metric evaluation
Model Card	Standardized model documentation	Required for responsible model sharing
Benchmark Suite	Reusable multi-task evaluation	Consistent model comparison

Next Steps

Preference Alignment with TRL — Evaluate alignment quality
Production AI Workbench — Build an evaluation dashboard

Model Evaluation & Benchmarks

Model Evaluation & Benchmarks

What You'll Learn

Tech Stack

Architecture

Project Structure

Implementation

Step 1: Dependencies

Step 2: Standard Metrics

Step 3: Custom Metrics

Step 4: Model Comparison

Step 5: Model Cards

Step 6: Benchmark Suite

Running the Project

Key Concepts Recap

Next Steps

On this page

Model Evaluation & Benchmarks

Model Evaluation & Benchmarks

What You'll Learn

Tech Stack

Architecture

Project Structure

Implementation

Step 1: Dependencies

Step 2: Standard Metrics

Step 3: Custom Metrics

Step 4: Model Comparison

Step 5: Model Cards

Step 6: Benchmark Suite

Running the Project

Key Concepts Recap

Next Steps

On this page