HuggingFace EcosystemBeginner
Pipelines & Hub
Use pre-trained models via pipelines and interact with the HuggingFace Hub API
Pipelines & Hub
TL;DR
Use HuggingFace pipeline() to run pre-trained models for 20+ NLP tasks in a single line of code, and the huggingface_hub library to search, download, and manage models programmatically.
Build a multi-task NLP application using HuggingFace Pipelines, and learn to interact with the Hub API for model discovery and management.
What You'll Learn
- Running inference with
pipeline()for text, image, and audio tasks - AutoModel and AutoTokenizer patterns for direct model loading
- Hub API for searching, downloading, and uploading models
- Device placement and dtype configuration
- Batched inference for throughput
Tech Stack
| Component | Technology |
|---|---|
| Pipelines | transformers |
| Hub API | huggingface_hub |
| API | FastAPI |
| Python | 3.10+ |
Architecture
┌──────────────────────────────────────────────────────────────────────────────┐
│ PIPELINE-BASED NLP SERVICE │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌────────────────────────────────────────────┐ │
│ │ FastAPI │ │ Pipeline Registry │ │
│ │ Endpoints │───▶│ │ │
│ │ │ │ ┌────────────┐ ┌────────────┐ │ │
│ │ /classify │ │ │ Sentiment │ │ NER │ │ │
│ │ /ner │ │ │ Pipeline │ │ Pipeline │ │ │
│ │ /summarize │ │ └────────────┘ └────────────┘ │ │
│ │ /generate │ │ ┌────────────┐ ┌────────────┐ │ │
│ │ /qa │ │ │ Summarize │ │ Text Gen │ │ │
│ │ /translate │ │ │ Pipeline │ │ Pipeline │ │ │
│ └─────────────┘ │ └────────────┘ └────────────┘ │ │
│ └────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────┐ │
│ │ HuggingFace Hub │ │
│ │ • Model discovery (list_models) │ │
│ │ • Model metadata (model_info) │ │
│ │ • Weight download (hf_hub_download) │ │
│ └────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘Project Structure
pipelines-and-hub/
├── src/
│ ├── __init__.py
│ ├── pipelines.py # Pipeline wrappers for each task
│ ├── hub_client.py # Hub API interactions
│ ├── auto_models.py # AutoModel/AutoTokenizer patterns
│ └── config.py # Device and model configuration
├── api/
│ └── main.py # FastAPI application
├── examples/
│ └── demo.py # Interactive demo script
├── requirements.txt
└── README.mdImplementation
Step 1: Dependencies
transformers>=4.40.0
huggingface_hub>=0.23.0
torch>=2.0.0
fastapi>=0.111.0
uvicorn>=0.30.0Step 2: Configuration
"""Device and model configuration."""
import torch
def get_device() -> str:
"""Select the best available device."""
if torch.cuda.is_available():
return "cuda"
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
return "mps"
return "cpu"
def get_dtype(device: str) -> torch.dtype:
"""Select optimal dtype for the device."""
if device == "cuda":
return torch.float16
return torch.float32
DEVICE = get_device()
DTYPE = get_dtype(DEVICE)
# Default models for each task
DEFAULT_MODELS = {
"text-classification": "distilbert-base-uncased-finetuned-sst-2-english",
"ner": "dslim/bert-base-NER",
"summarization": "facebook/bart-large-cnn",
"translation": "Helsinki-NLP/opus-mt-en-fr",
"text-generation": "gpt2",
"question-answering": "distilbert-base-cased-distilled-squad",
"zero-shot-classification": "facebook/bart-large-mnli",
"fill-mask": "bert-base-uncased",
}Understanding Device Selection:
┌─────────────────────────────────────────────────────────────────┐
│ DEVICE PRIORITY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Check CUDA (NVIDIA GPU) │
│ │ │
│ ├── Available? ──► Use "cuda" + float16 │
│ │ (fastest, half-precision saves VRAM) │
│ │ │
│ └── No ──► Check MPS (Apple Silicon) │
│ │ │
│ ├── Available? ──► Use "mps" + float32 │
│ │ (GPU accel on Mac) │
│ │ │
│ └── No ──► Use "cpu" + float32 │
│ (universal fallback) │
│ │
│ Why float16 on CUDA only: │
│ • Halves memory usage (fits larger models) │
│ • NVIDIA Tensor Cores accelerate fp16 natively │
│ • MPS/CPU don't benefit as much from fp16 │
│ │
└─────────────────────────────────────────────────────────────────┘Step 3: Pipeline Wrappers
"""HuggingFace Pipeline wrappers for multiple NLP tasks."""
from transformers import pipeline
from typing import Any
from .config import DEVICE, DEFAULT_MODELS
class PipelineRegistry:
"""
Lazy-loading registry for HuggingFace pipelines.
Pipelines are created on first use and cached for subsequent calls.
This avoids loading all models at startup.
"""
def __init__(self):
self._pipelines: dict[str, Any] = {}
def get(self, task: str, model: str | None = None) -> Any:
"""Get or create a pipeline for the given task."""
model = model or DEFAULT_MODELS.get(task)
cache_key = f"{task}:{model}"
if cache_key not in self._pipelines:
self._pipelines[cache_key] = pipeline(
task=task,
model=model,
device=DEVICE,
)
return self._pipelines[cache_key]
def classify(self, text: str) -> list[dict]:
"""Sentiment / text classification."""
pipe = self.get("text-classification")
return pipe(text)
def extract_entities(self, text: str) -> list[dict]:
"""Named entity recognition."""
pipe = self.get("ner")
results = pipe(text)
# Merge subword tokens into complete entities
merged = []
current = None
for entity in results:
if entity["word"].startswith("##"):
# Subword continuation — append to current entity
if current:
current["word"] += entity["word"][2:]
current["end"] = entity["end"]
current["score"] = min(current["score"], entity["score"])
else:
if current:
merged.append(current)
current = {**entity}
if current:
merged.append(current)
return merged
def summarize(
self,
text: str,
max_length: int = 130,
min_length: int = 30,
) -> str:
"""Text summarization."""
pipe = self.get("summarization")
result = pipe(text, max_length=max_length, min_length=min_length)
return result[0]["summary_text"]
def generate(
self,
prompt: str,
max_new_tokens: int = 100,
temperature: float = 0.7,
top_p: float = 0.9,
) -> str:
"""Text generation."""
pipe = self.get("text-generation")
result = pipe(
prompt,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p,
do_sample=True,
)
return result[0]["generated_text"]
def answer_question(self, question: str, context: str) -> dict:
"""Extractive question answering."""
pipe = self.get("question-answering")
return pipe(question=question, context=context)
def translate(self, text: str, model: str | None = None) -> str:
"""Translation (default: English to French)."""
pipe = self.get("translation", model=model)
result = pipe(text)
return result[0]["translation_text"]
def zero_shot_classify(
self,
text: str,
candidate_labels: list[str],
) -> dict:
"""Zero-shot classification with candidate labels."""
pipe = self.get("zero-shot-classification")
return pipe(text, candidate_labels=candidate_labels)
def fill_mask(self, text: str) -> list[dict]:
"""Fill masked tokens in text."""
pipe = self.get("fill-mask")
return pipe(text)
# Global registry instance
registry = PipelineRegistry()How pipeline() Works Under the Hood:
┌─────────────────────────────────────────────────────────────────┐
│ pipeline("text-classification", model="distilbert-base-...") │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Resolve model ──► Download from Hub (cached locally) │
│ │
│ 2. Load components: │
│ ┌──────────────┐ ┌───────────────┐ ┌────────────────┐ │
│ │ Tokenizer │ │ Model │ │ Post-process │ │
│ │ (AutoToken- │ │ (AutoModel- │ │ (softmax, │ │
│ │ izer) │ │ ForSeqClass)│ │ label map) │ │
│ └──────┬───────┘ └───────┬───────┘ └────────┬───────┘ │
│ │ │ │ │
│ 3. __call__(text): │ │ │
│ text ──► tokenize ──► model.forward() ──► post-process │
│ │ │
│ 4. Return: [{"label": "POSITIVE", "score": 0.99}] │
│ │
│ Key insight: pipeline() auto-selects the correct AutoModel │
│ subclass based on the task (ForSequenceClassification, │
│ ForTokenClassification, ForSeq2SeqLM, etc.) │
│ │
└─────────────────────────────────────────────────────────────────┘Pipeline Task Reference:
| Task | What It Does | Example Model |
|---|---|---|
text-classification | Sentiment, topic labels | distilbert-base-uncased-finetuned-sst-2-english |
ner | Named entity extraction | dslim/bert-base-NER |
summarization | Condense long text | facebook/bart-large-cnn |
translation_xx_to_yy | Language translation | Helsinki-NLP/opus-mt-en-fr |
text-generation | Continue a prompt | gpt2 |
question-answering | Extract answer from context | distilbert-base-cased-distilled-squad |
zero-shot-classification | Classify without training | facebook/bart-large-mnli |
fill-mask | Predict masked tokens | bert-base-uncased |
Step 4: Hub API Client
"""HuggingFace Hub API client for model discovery and management."""
from huggingface_hub import (
HfApi,
hf_hub_download,
model_info,
list_models,
ModelFilter,
)
from dataclasses import dataclass
@dataclass
class ModelSummary:
"""Compact model information."""
model_id: str
author: str
downloads: int
likes: int
pipeline_tag: str | None
tags: list[str]
class HubClient:
"""Client for HuggingFace Hub operations."""
def __init__(self, token: str | None = None):
self.api = HfApi(token=token)
def search_models(
self,
query: str | None = None,
task: str | None = None,
library: str | None = None,
sort: str = "downloads",
limit: int = 10,
) -> list[ModelSummary]:
"""
Search for models on the Hub.
Args:
query: Free-text search query
task: Filter by pipeline task (e.g., "text-classification")
library: Filter by library (e.g., "pytorch", "transformers")
sort: Sort by "downloads", "likes", or "lastModified"
limit: Maximum results to return
"""
model_filter = ModelFilter(
task=task,
library=library,
)
models = list(
self.api.list_models(
search=query,
filter=model_filter,
sort=sort,
direction=-1,
limit=limit,
)
)
return [
ModelSummary(
model_id=m.id,
author=m.author or "unknown",
downloads=m.downloads or 0,
likes=m.likes or 0,
pipeline_tag=m.pipeline_tag,
tags=m.tags or [],
)
for m in models
]
def get_model_info(self, model_id: str) -> dict:
"""Get detailed information about a model."""
info = self.api.model_info(model_id)
return {
"model_id": info.id,
"author": info.author,
"sha": info.sha,
"pipeline_tag": info.pipeline_tag,
"tags": info.tags,
"downloads": info.downloads,
"likes": info.likes,
"library_name": info.library_name,
"languages": getattr(info, "languages", []),
"siblings": [
{"filename": s.rfilename, "size": s.size}
for s in (info.siblings or [])
],
}
def download_file(
self,
model_id: str,
filename: str,
cache_dir: str | None = None,
) -> str:
"""
Download a specific file from a model repository.
Returns the local path to the downloaded file.
"""
return hf_hub_download(
repo_id=model_id,
filename=filename,
cache_dir=cache_dir,
)
def list_model_files(self, model_id: str) -> list[str]:
"""List all files in a model repository."""
info = self.api.model_info(model_id)
return [s.rfilename for s in (info.siblings or [])]Hub API Operations:
┌─────────────────────────────────────────────────────────────────┐
│ HUGGINGFACE HUB WORKFLOW │
├─────────────────────────────────────────────────────────────────┤
│ │
│ DISCOVER │
│ list_models(task="text-classification", sort="downloads") │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Results: │ │
│ │ 1. distilbert-base-uncased-finetuned-sst-2 (50M downloads)│ │
│ │ 2. cardiffnlp/twitter-roberta-base-sentiment (10M) │ │
│ │ 3. nlptown/bert-base-multilingual-uncased-sentiment (5M) │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ INSPECT │
│ model_info("distilbert-base-uncased-finetuned-sst-2-english") │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Files: config.json, model.safetensors, tokenizer.json, ... │ │
│ │ Size: ~268 MB │ │
│ │ Library: transformers │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ USE │
│ pipeline("text-classification", model="distilbert-base-...") │
│ (auto-downloads and caches weights on first call) │
│ │
└─────────────────────────────────────────────────────────────────┘Step 5: AutoModel Patterns
"""Direct model loading with AutoModel and AutoTokenizer."""
import torch
from transformers import (
AutoTokenizer,
AutoModel,
AutoModelForSequenceClassification,
AutoModelForTokenClassification,
AutoModelForCausalLM,
)
from .config import DEVICE, DTYPE
def load_for_embeddings(model_name: str = "bert-base-uncased"):
"""
Load a model for extracting embeddings.
Uses AutoModel (no task-specific head) to get hidden states.
"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(
model_name,
torch_dtype=DTYPE,
).to(DEVICE)
model.eval()
return tokenizer, model
def get_embeddings(
texts: list[str],
tokenizer,
model,
pooling: str = "mean",
) -> torch.Tensor:
"""
Extract embeddings from texts.
Args:
texts: List of input texts
tokenizer: HuggingFace tokenizer
model: HuggingFace model
pooling: "mean", "cls", or "max"
"""
inputs = tokenizer(
texts,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt",
).to(DEVICE)
with torch.no_grad():
outputs = model(**inputs)
hidden_states = outputs.last_hidden_state # [batch, seq_len, hidden_dim]
attention_mask = inputs["attention_mask"]
if pooling == "cls":
# Use the [CLS] token representation
embeddings = hidden_states[:, 0, :]
elif pooling == "mean":
# Mean of all non-padding tokens
mask = attention_mask.unsqueeze(-1).float()
embeddings = (hidden_states * mask).sum(dim=1) / mask.sum(dim=1)
elif pooling == "max":
# Max pooling over sequence dimension
hidden_states[attention_mask == 0] = -1e9
embeddings = hidden_states.max(dim=1).values
else:
raise ValueError(f"Unknown pooling: {pooling}")
return embeddings
def load_for_classification(
model_name: str,
num_labels: int = 2,
):
"""Load a model with a classification head."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=num_labels,
torch_dtype=DTYPE,
).to(DEVICE)
return tokenizer, model
def load_for_generation(
model_name: str = "gpt2",
):
"""Load a causal language model for text generation."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=DTYPE,
).to(DEVICE)
# GPT-2 doesn't have a pad token by default
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
return tokenizer, modelAutoModel Class Hierarchy:
┌─────────────────────────────────────────────────────────────────┐
│ AUTOMODEL SELECTION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ AutoModel (base — no task head) │
│ └── Returns raw hidden states [batch, seq, hidden_dim] │
│ Use for: embeddings, custom heads │
│ │
│ AutoModelForSequenceClassification │
│ └── Adds Linear(hidden_dim, num_labels) on top of [CLS] │
│ Use for: sentiment, topic classification │
│ │
│ AutoModelForTokenClassification │
│ └── Adds Linear(hidden_dim, num_labels) on every token │
│ Use for: NER, POS tagging │
│ │
│ AutoModelForCausalLM │
│ └── Adds LM head for next-token prediction │
│ Use for: text generation (GPT-style) │
│ │
│ AutoModelForSeq2SeqLM │
│ └── Encoder-decoder with generation head │
│ Use for: translation, summarization (T5, BART) │
│ │
│ The "Auto" prefix means: automatically detect the correct │
│ model architecture from config.json in the model repo. │
│ │
└─────────────────────────────────────────────────────────────────┘Pooling Strategy Comparison:
| Strategy | How It Works | Best For |
|---|---|---|
cls | Use the [CLS] token embedding | Models fine-tuned with [CLS] as aggregate |
mean | Average all non-padding token embeddings | General-purpose similarity, most robust |
max | Max value across each dimension | Capturing strongest feature activations |
Step 6: FastAPI Application
"""FastAPI application exposing pipeline endpoints."""
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from src.pipelines import registry
from src.hub_client import HubClient
app = FastAPI(title="HuggingFace Pipelines API")
hub = HubClient()
class TextInput(BaseModel):
text: str = Field(..., min_length=1)
class QAInput(BaseModel):
question: str
context: str
class ZeroShotInput(BaseModel):
text: str
labels: list[str]
class GenerateInput(BaseModel):
prompt: str
max_new_tokens: int = 100
temperature: float = 0.7
class SearchModelsInput(BaseModel):
query: str | None = None
task: str | None = None
limit: int = 10
@app.get("/health")
async def health():
return {"status": "healthy"}
@app.post("/classify")
async def classify(req: TextInput):
return registry.classify(req.text)
@app.post("/ner")
async def ner(req: TextInput):
entities = registry.extract_entities(req.text)
return {"entities": entities}
@app.post("/summarize")
async def summarize(req: TextInput):
summary = registry.summarize(req.text)
return {"summary": summary}
@app.post("/generate")
async def generate(req: GenerateInput):
text = registry.generate(
req.prompt,
max_new_tokens=req.max_new_tokens,
temperature=req.temperature,
)
return {"generated_text": text}
@app.post("/qa")
async def question_answering(req: QAInput):
return registry.answer_question(req.question, req.context)
@app.post("/zero-shot")
async def zero_shot(req: ZeroShotInput):
return registry.zero_shot_classify(req.text, req.labels)
@app.post("/hub/search")
async def search_models(req: SearchModelsInput):
results = hub.search_models(
query=req.query,
task=req.task,
limit=req.limit,
)
return {"models": [r.__dict__ for r in results]}
@app.get("/hub/model/{model_id:path}")
async def get_model(model_id: str):
try:
return hub.get_model_info(model_id)
except Exception as e:
raise HTTPException(status_code=404, detail=str(e))Step 7: Interactive Demo
"""Interactive demo showing all pipeline capabilities."""
from src.pipelines import registry
from src.hub_client import HubClient
def demo_classification():
"""Sentiment analysis."""
texts = [
"I love this product! It works perfectly.",
"Terrible experience. Would not recommend.",
"It's okay, nothing special.",
]
print("=== Text Classification ===")
for text in texts:
result = registry.classify(text)
print(f" '{text[:50]}...' → {result[0]['label']} ({result[0]['score']:.3f})")
def demo_ner():
"""Named entity recognition."""
text = "Elon Musk founded SpaceX in Hawthorne, California in 2002."
print("\n=== Named Entity Recognition ===")
print(f" Text: {text}")
entities = registry.extract_entities(text)
for ent in entities:
print(f" {ent['word']:20s} → {ent['entity']:10s} (score: {ent['score']:.3f})")
def demo_summarization():
"""Text summarization."""
text = (
"The tower is 324 metres (1,063 ft) tall, about the same height as an "
"81-storey building, and the tallest structure in Paris. Its base is square, "
"measuring 125 metres (410 ft) on each side. During its construction, the "
"Eiffel Tower surpassed the Washington Monument to become the tallest "
"man-made structure in the world, a title it held for 41 years until the "
"Chrysler Building in New York City was finished in 1930."
)
print("\n=== Summarization ===")
summary = registry.summarize(text)
print(f" Summary: {summary}")
def demo_qa():
"""Question answering."""
context = (
"Python is a programming language created by Guido van Rossum and first "
"released in 1991. Python's design philosophy emphasizes code readability "
"with the use of significant indentation."
)
question = "Who created Python?"
print("\n=== Question Answering ===")
result = registry.answer_question(question, context)
print(f" Q: {question}")
print(f" A: {result['answer']} (score: {result['score']:.3f})")
def demo_zero_shot():
"""Zero-shot classification."""
text = "The stock market crashed today after the Federal Reserve raised rates."
labels = ["politics", "finance", "sports", "technology"]
print("\n=== Zero-Shot Classification ===")
result = registry.zero_shot_classify(text, labels)
for label, score in zip(result["labels"], result["scores"]):
print(f" {label:15s} → {score:.3f}")
def demo_hub():
"""Hub model search."""
hub = HubClient()
print("\n=== Hub Search: Top Sentiment Models ===")
models = hub.search_models(task="text-classification", limit=5)
for m in models:
print(f" {m.model_id:50s} ({m.downloads:>10,} downloads)")
if __name__ == "__main__":
demo_classification()
demo_ner()
demo_summarization()
demo_qa()
demo_zero_shot()
demo_hub()Running the Project
# Install dependencies
pip install -r requirements.txt
# Run the interactive demo
python examples/demo.py
# Start the API server
uvicorn api.main:app --reload
# Test endpoints
curl -X POST http://localhost:8000/classify \
-H "Content-Type: application/json" \
-d '{"text": "This is amazing!"}'
curl -X POST http://localhost:8000/hub/search \
-H "Content-Type: application/json" \
-d '{"task": "text-classification", "limit": 5}'Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
pipeline() | One-line inference for 20+ tasks | Fastest way from zero to working model |
| AutoModel | Automatic model architecture detection | Load any model without knowing its class |
| AutoTokenizer | Automatic tokenizer loading | Matches the correct tokenizer to any model |
| HfApi | Programmatic Hub access | Search, download, and manage models in code |
| Lazy Loading | Create pipelines on first use | Avoid loading all models into memory at startup |
| Device Placement | .to(device) for GPU/CPU | Run inference on the fastest available hardware |
| Subword Merging | Combine ## tokens in NER | Models tokenize words into pieces; merge them back |
Next Steps
- Tokenizers Deep Dive — Understand what happens inside the tokenizer
- Datasets Mastery — Load and process data for model training