HuggingFace EcosystemIntermediate
Model Evaluation & Benchmarks
Comprehensive model evaluation with standard and custom metrics using the evaluate library
Model Evaluation & Benchmarks
TL;DR
Use the HuggingFace evaluate library to measure model performance with standard metrics (BLEU, ROUGE, BERTScore, accuracy, F1), build custom evaluation pipelines, compare models side-by-side, and generate model cards with evaluation results.
Build comprehensive model evaluation pipelines using the HuggingFace evaluate library, covering NLP metrics, custom metrics, model comparison, and model cards.
What You'll Learn
- Standard NLP metrics (BLEU, ROUGE, BERTScore)
- Classification metrics (accuracy, F1, precision, recall)
- Custom metric creation and combination
- Model comparison pipelines
- Model cards with evaluation results
- Benchmark suite design
Tech Stack
| Component | Technology |
|---|---|
| Metrics | evaluate |
| Models | transformers |
| Hub | huggingface_hub |
| Python | 3.10+ |
Architecture
┌──────────────────────────────────────────────────────────────────────────────┐
│ EVALUATION PIPELINE │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌────────────────┐ ┌──────────────────┐ ┌───────────┐ │
│ │ Model │──▶│ Predictions │──▶│ Metric Suite │──▶│ Report │ │
│ │ Outputs │ │ (generated │ │ │ │ │ │
│ │ │ │ text or │ │ • BLEU │ │ • Scores │ │
│ │ │ │ labels) │ │ • ROUGE │ │ • Compare │ │
│ └──────────┘ └────────────────┘ │ • BERTScore │ │ • Card │ │
│ │ • Accuracy │ └───────────┘ │
│ ┌──────────┐ │ • F1 │ │
│ │ Ground │──────────────────────▶│ • Custom │ │
│ │ Truth │ │ │ │
│ └──────────┘ └──────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘Project Structure
model-evaluation/
├── src/
│ ├── __init__.py
│ ├── metrics.py # Standard metrics with evaluate
│ ├── custom_metrics.py # Custom metric creation
│ ├── comparison.py # Model comparison pipeline
│ ├── model_card.py # Generate model cards with eval results
│ └── benchmark.py # Benchmark suite
├── examples/
│ └── evaluate_models.py
├── requirements.txt
└── README.mdImplementation
Step 1: Dependencies
evaluate>=0.4.0
transformers>=4.40.0
datasets>=2.19.0
huggingface_hub>=0.23.0
rouge-score>=0.1.2
bert-score>=0.3.13
nltk>=3.8.0
scikit-learn>=1.4.0Step 2: Standard Metrics
"""Standard NLP and classification metrics using the evaluate library."""
import evaluate
import numpy as np
class MetricSuite:
"""
Collection of standard metrics for model evaluation.
The evaluate library provides 100+ metrics with consistent APIs.
Metrics are loaded by name and called with predictions + references.
"""
def __init__(self):
# Load metrics (downloads metric code from Hub on first use)
self._accuracy = evaluate.load("accuracy")
self._f1 = evaluate.load("f1")
self._precision = evaluate.load("precision")
self._recall = evaluate.load("recall")
def classification_report(
self,
predictions: list[int],
references: list[int],
average: str = "weighted",
) -> dict:
"""Compute all classification metrics."""
results = {}
results.update(
self._accuracy.compute(
predictions=predictions,
references=references,
)
)
results.update(
self._f1.compute(
predictions=predictions,
references=references,
average=average,
)
)
results.update(
self._precision.compute(
predictions=predictions,
references=references,
average=average,
)
)
results.update(
self._recall.compute(
predictions=predictions,
references=references,
average=average,
)
)
return results
class GenerationMetrics:
"""Metrics for evaluating text generation quality."""
def __init__(self):
self._bleu = evaluate.load("bleu")
self._rouge = evaluate.load("rouge")
self._bertscore = evaluate.load("bertscore")
def compute_bleu(
self,
predictions: list[str],
references: list[list[str]],
) -> dict:
"""
BLEU: measures n-gram overlap between prediction and reference.
Scores range from 0 to 1. Good scores:
- Machine translation: 0.30-0.50
- Summarization: 0.15-0.30
- General text generation: 0.10-0.25
"""
return self._bleu.compute(
predictions=predictions,
references=references,
)
def compute_rouge(
self,
predictions: list[str],
references: list[str],
) -> dict:
"""
ROUGE: measures recall of n-grams from reference in prediction.
Variants:
- rouge1: Unigram overlap
- rouge2: Bigram overlap
- rougeL: Longest common subsequence
- rougeLsum: rougeL over sentences
"""
return self._rouge.compute(
predictions=predictions,
references=references,
)
def compute_bertscore(
self,
predictions: list[str],
references: list[str],
lang: str = "en",
) -> dict:
"""
BERTScore: semantic similarity using contextual embeddings.
Unlike BLEU/ROUGE which count exact n-gram matches,
BERTScore measures meaning similarity:
- "The cat sat on the mat" vs "A feline rested on the rug"
- BLEU/ROUGE: low score (different words)
- BERTScore: high score (same meaning)
"""
results = self._bertscore.compute(
predictions=predictions,
references=references,
lang=lang,
)
return {
"bertscore_precision": float(np.mean(results["precision"])),
"bertscore_recall": float(np.mean(results["recall"])),
"bertscore_f1": float(np.mean(results["f1"])),
}
def full_report(
self,
predictions: list[str],
references: list[str],
) -> dict:
"""Compute all generation metrics."""
results = {}
# ROUGE
results.update(self.compute_rouge(predictions, references))
# BERTScore
results.update(self.compute_bertscore(predictions, references))
# BLEU (needs list-of-lists for references)
bleu_refs = [[ref] for ref in references]
bleu_result = self.compute_bleu(predictions, bleu_refs)
results["bleu"] = bleu_result["bleu"]
return resultsMetric Comparison:
┌─────────────────────────────────────────────────────────────────┐
│ WHEN TO USE WHICH METRIC │
├─────────────────────────────────────────────────────────────────┤
│ │
│ CLASSIFICATION (discrete labels) │
│ ┌──────────────┬────────────────────────────────────────┐ │
│ │ Accuracy │ Overall % correct. Misleading if │ │
│ │ │ classes are imbalanced. │ │
│ │ F1 (macro) │ Harmonic mean of precision & recall. │ │
│ │ │ Good for imbalanced classes. │ │
│ │ Precision │ Of predicted positives, how many right? │ │
│ │ Recall │ Of actual positives, how many found? │ │
│ └──────────────┴────────────────────────────────────────┘ │
│ │
│ GENERATION (free-form text) │
│ ┌──────────────┬────────────────────────────────────────┐ │
│ │ BLEU │ n-gram precision. Good for translation. │ │
│ │ ROUGE │ n-gram recall. Good for summarization. │ │
│ │ BERTScore │ Semantic similarity with embeddings. │ │
│ │ │ Best for paraphrase and meaning. │ │
│ └──────────────┴────────────────────────────────────────┘ │
│ │
│ Rule of thumb: │
│ • Translation → BLEU + BERTScore │
│ • Summarization → ROUGE + BERTScore │
│ • Classification → F1 (weighted or macro) │
│ • QA → Exact Match + F1 (token-level) │
│ │
└─────────────────────────────────────────────────────────────────┘Step 3: Custom Metrics
"""Create custom evaluation metrics."""
import evaluate
import numpy as np
class ConsistencyMetric:
"""
Custom metric: measure response consistency.
Given the same question asked differently, do we get
the same answer? Useful for chatbot evaluation.
"""
def compute(
self,
predictions_a: list[str],
predictions_b: list[str],
) -> dict:
"""Compare predictions from paraphrased inputs."""
bertscore = evaluate.load("bertscore")
results = bertscore.compute(
predictions=predictions_a,
references=predictions_b,
lang="en",
)
return {
"consistency_score": float(np.mean(results["f1"])),
}
class LengthPenaltyMetric:
"""
Custom metric: penalize overly long or short responses.
Useful for ensuring summarization quality or checking
that generation stays within expected bounds.
"""
def compute(
self,
predictions: list[str],
target_length: int = 100,
tolerance: float = 0.3,
) -> dict:
"""Score based on how close to target length."""
lengths = [len(p.split()) for p in predictions]
penalties = []
for length in lengths:
ratio = length / target_length
if 1 - tolerance <= ratio <= 1 + tolerance:
penalties.append(1.0)
else:
penalties.append(max(0, 1 - abs(ratio - 1)))
return {
"length_score": float(np.mean(penalties)),
"avg_length": float(np.mean(lengths)),
"length_std": float(np.std(lengths)),
}
def combine_metrics(
predictions: list[str],
references: list[str],
weights: dict[str, float] | None = None,
) -> dict:
"""
Combine multiple metrics into a single evaluation report.
Args:
predictions: Model outputs
references: Ground truth
weights: Metric weights for combined score
"""
if weights is None:
weights = {"rouge1": 0.3, "bertscore_f1": 0.5, "length_score": 0.2}
gen_metrics = evaluate.combine([
evaluate.load("rouge"),
evaluate.load("bleu"),
])
results = gen_metrics.compute(
predictions=predictions,
references=references,
)
# Add BERTScore separately (different API)
bertscore = evaluate.load("bertscore")
bs_results = bertscore.compute(
predictions=predictions,
references=references,
lang="en",
)
results["bertscore_f1"] = float(np.mean(bs_results["f1"]))
# Add length metric
length_metric = LengthPenaltyMetric()
results.update(length_metric.compute(predictions))
# Compute weighted combined score
combined = sum(
results.get(k, 0) * v
for k, v in weights.items()
)
results["combined_score"] = combined
return resultsStep 4: Model Comparison
"""Compare multiple models on the same benchmark."""
from transformers import pipeline
from datasets import load_dataset
from .metrics import MetricSuite, GenerationMetrics
def compare_classification_models(
model_names: list[str],
dataset_name: str = "imdb",
split: str = "test[:200]",
) -> dict:
"""Compare classification models on the same test set."""
dataset = load_dataset(dataset_name, split=split)
metrics = MetricSuite()
results = {}
for model_name in model_names:
print(f"Evaluating {model_name}...")
pipe = pipeline("text-classification", model=model_name, device=0)
predictions = []
for example in dataset:
result = pipe(example["text"], truncation=True, max_length=512)
# Map label string to int
pred = 1 if result[0]["label"] == "POSITIVE" else 0
predictions.append(pred)
references = dataset["label"]
scores = metrics.classification_report(predictions, references)
results[model_name] = scores
print(f" Accuracy: {scores['accuracy']:.4f}, F1: {scores['f1']:.4f}")
return results
def compare_summarization_models(
model_names: list[str],
dataset_name: str = "cnn_dailymail",
dataset_config: str = "3.0.0",
num_samples: int = 100,
) -> dict:
"""Compare summarization models on the same test set."""
dataset = load_dataset(
dataset_name, dataset_config, split=f"test[:{num_samples}]"
)
gen_metrics = GenerationMetrics()
results = {}
for model_name in model_names:
print(f"Evaluating {model_name}...")
pipe = pipeline("summarization", model=model_name, device=0)
predictions = []
for example in dataset:
summary = pipe(
example["article"],
max_length=130,
min_length=30,
truncation=True,
)
predictions.append(summary[0]["summary_text"])
references = dataset["highlights"]
scores = gen_metrics.full_report(predictions, references)
results[model_name] = scores
print(f" ROUGE-1: {scores['rouge1']:.4f}, BERTScore: {scores['bertscore_f1']:.4f}")
return results
def format_comparison_table(results: dict) -> str:
"""Format comparison results as a markdown table."""
if not results:
return ""
models = list(results.keys())
metrics = list(results[models[0]].keys())
# Header
header = "| Model | " + " | ".join(metrics) + " |"
separator = "|" + "|".join(["---"] * (len(metrics) + 1)) + "|"
rows = []
for model in models:
values = [f"{results[model][m]:.4f}" for m in metrics]
rows.append(f"| {model} | " + " | ".join(values) + " |")
return "\n".join([header, separator] + rows)Step 5: Model Cards
"""Generate model cards with evaluation results."""
from huggingface_hub import ModelCard, ModelCardData
def create_model_card(
model_name: str,
base_model: str,
eval_results: dict,
task: str = "text-classification",
dataset: str = "imdb",
training_details: dict | None = None,
) -> str:
"""
Generate a model card with evaluation results.
Model cards are standardized documentation for ML models,
required by HuggingFace Hub for good practices.
"""
card_data = ModelCardData(
model_name=model_name,
language="en",
license="mit",
library_name="transformers",
tags=["text-classification", "fine-tuned"],
datasets=[dataset],
metrics=list(eval_results.keys()),
eval_results=[
{
"task": {"type": task},
"dataset": {"type": dataset, "name": dataset},
"metrics": [
{"type": k, "value": v}
for k, v in eval_results.items()
],
}
],
)
training_info = ""
if training_details:
training_info = "\n## Training Details\n\n"
for key, value in training_details.items():
training_info += f"- **{key}**: {value}\n"
card_content = f"""---
{card_data.to_yaml()}
---
# {model_name}
Fine-tuned from [{base_model}](https://huggingface.co/{base_model}) on {dataset}.
## Evaluation Results
| Metric | Score |
|--------|-------|
""" + "\n".join(f"| {k} | {v:.4f} |" for k, v in eval_results.items())
card_content += training_info
return card_contentStep 6: Benchmark Suite
"""Design reusable benchmark suites."""
from dataclasses import dataclass
from typing import Callable
from datasets import load_dataset
from .metrics import MetricSuite, GenerationMetrics
@dataclass
class BenchmarkTask:
"""A single benchmark task definition."""
name: str
dataset: str
dataset_config: str | None
split: str
input_column: str
target_column: str
task_type: str # "classification" or "generation"
num_samples: int = 200
class BenchmarkSuite:
"""Run a suite of benchmarks against any model."""
STANDARD_TASKS = [
BenchmarkTask(
name="Sentiment (IMDB)",
dataset="imdb",
dataset_config=None,
split="test[:200]",
input_column="text",
target_column="label",
task_type="classification",
),
BenchmarkTask(
name="Summarization (CNN/DM)",
dataset="cnn_dailymail",
dataset_config="3.0.0",
split="test[:50]",
input_column="article",
target_column="highlights",
task_type="generation",
),
]
def __init__(self, tasks: list[BenchmarkTask] | None = None):
self.tasks = tasks or self.STANDARD_TASKS
self.classification_metrics = MetricSuite()
self.generation_metrics = GenerationMetrics()
def run(
self,
predict_fn: Callable[[list[str]], list],
task_filter: list[str] | None = None,
) -> dict:
"""
Run all benchmark tasks.
Args:
predict_fn: Function that takes list[str] and returns predictions
task_filter: Only run tasks with these names
"""
results = {}
for task in self.tasks:
if task_filter and task.name not in task_filter:
continue
print(f"Running: {task.name}")
dataset = load_dataset(
task.dataset,
task.dataset_config,
split=task.split,
)
inputs = dataset[task.input_column]
targets = dataset[task.target_column]
predictions = predict_fn(inputs)
if task.task_type == "classification":
scores = self.classification_metrics.classification_report(
predictions, targets,
)
else:
scores = self.generation_metrics.full_report(
predictions, targets,
)
results[task.name] = scores
print(f" {scores}")
return resultsRunning the Project
# Install dependencies
pip install -r requirements.txt
# Evaluate a classification model
python -c "
from src.metrics import MetricSuite
metrics = MetricSuite()
results = metrics.classification_report(
predictions=[1, 0, 1, 1, 0],
references=[1, 0, 0, 1, 0],
)
print(results)
"
# Compare summarization models
python -c "
from src.comparison import compare_summarization_models
results = compare_summarization_models(
model_names=['facebook/bart-large-cnn', 't5-small'],
num_samples=20,
)
"
# Generate a model card
python -c "
from src.model_card import create_model_card
card = create_model_card(
model_name='my-sentiment-model',
base_model='bert-base-uncased',
eval_results={'accuracy': 0.92, 'f1': 0.91},
)
print(card)
"Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| BLEU | N-gram precision score for generation | Standard for machine translation evaluation |
| ROUGE | N-gram recall score for generation | Standard for summarization evaluation |
| BERTScore | Semantic similarity via embeddings | Captures meaning beyond exact word match |
| F1 Score | Harmonic mean of precision and recall | Balanced metric for classification |
evaluate.combine | Run multiple metrics at once | Efficient multi-metric evaluation |
| Model Card | Standardized model documentation | Required for responsible model sharing |
| Benchmark Suite | Reusable multi-task evaluation | Consistent model comparison |
Next Steps
- Preference Alignment with TRL — Evaluate alignment quality
- Production AI Workbench — Build an evaluation dashboard