Pipelines & Hub

TL;DR

Use HuggingFace pipeline() to run pre-trained models for 20+ NLP tasks in a single line of code, and the huggingface_hub library to search, download, and manage models programmatically.

What You'll Learn

Running inference with pipeline() for text, image, and audio tasks
AutoModel and AutoTokenizer patterns for direct model loading
Hub API for searching, downloading, and uploading models
Device placement and dtype configuration
Batched inference for throughput

Why HuggingFace Pipelines and Hub?

HuggingFace Hub is the GitHub of machine learning -- over 1 million models, 200K datasets, and 300K demo apps, all accessible through a unified API. Instead of writing boilerplate for every task (tokenization, model loading, post-processing), pipeline() abstracts it into a single function call. This means you go from zero to working inference in under 5 lines of code, for any of 20+ NLP, vision, and audio tasks. Understanding the Hub API also lets you programmatically search, filter, and download models -- essential when building systems that need to select the right model at runtime.

Property	Value
Difficulty	Beginner
Time	~2 hours
Lines of Code	~200
Prerequisites	Basic Python

Tech Stack

Component	Technology	Why
Pipelines	`transformers`	Unified inference API for 20+ tasks
Hub API	`huggingface_hub`	Programmatic model discovery and download
API	FastAPI	Async REST endpoints for serving
Python	3.10+	Type hint support (union `\|` syntax)

Architecture

Pipeline-Based NLP Service

FastAPI Endpoints

/classify

/ner

/summarize

/generate

/qa

/translate

Pipeline Registry

Sentiment Pipeline

NER Pipeline

Summarize Pipeline

Text Gen Pipeline

HuggingFace Hub

Model discovery (list_models)

Model metadata (model_info)

Weight download (hf_hub_download)

Project Structure

pipelines-and-hub/
├── src/
│   ├── __init__.py
│   ├── pipelines.py          # Pipeline wrappers for each task
│   ├── hub_client.py         # Hub API interactions
│   ├── auto_models.py        # AutoModel/AutoTokenizer patterns
│   └── config.py             # Device and model configuration
├── api/
│   └── main.py               # FastAPI application
├── examples/
│   └── demo.py               # Interactive demo script
├── requirements.txt
└── README.md

Implementation

Step 1: Dependencies

requirements.txt

transformers>=4.40.0
huggingface_hub>=0.23.0
torch>=2.0.0
fastapi>=0.111.0
uvicorn>=0.30.0

Step 2: Configuration

src/config.py

"""Device and model configuration."""

import torch


def get_device() -> str:
    """Select the best available device."""
    if torch.cuda.is_available():
        return "cuda"
    elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        return "mps"
    return "cpu"


def get_dtype(device: str) -> torch.dtype:
    """Select optimal dtype for the device."""
    if device == "cuda":
        return torch.float16
    return torch.float32


DEVICE = get_device()
DTYPE = get_dtype(DEVICE)

# Default models for each task
DEFAULT_MODELS = {
    "text-classification": "distilbert-base-uncased-finetuned-sst-2-english",
    "ner": "dslim/bert-base-NER",
    "summarization": "facebook/bart-large-cnn",
    "translation": "Helsinki-NLP/opus-mt-en-fr",
    "text-generation": "gpt2",
    "question-answering": "distilbert-base-cased-distilled-squad",
    "zero-shot-classification": "facebook/bart-large-mnli",
    "fill-mask": "bert-base-uncased",
}

Understanding Device Selection:

Device Priority

CUDA (NVIDIA GPU)

Recommended

Use "cuda" + float16 — fastest, half-precision saves VRAM. NVIDIA Tensor Cores accelerate fp16 natively.

MPS (Apple Silicon)

Use "mps" + float32 — GPU acceleration on Mac. MPS doesn't benefit as much from fp16.

CPU

Use "cpu" + float32 — universal fallback. Slowest option but works everywhere.

Step 3: Pipeline Wrappers

src/pipelines.py

"""HuggingFace Pipeline wrappers for multiple NLP tasks."""

from transformers import pipeline
from typing import Any

from .config import DEVICE, DEFAULT_MODELS


class PipelineRegistry:
    """
    Lazy-loading registry for HuggingFace pipelines.

    Pipelines are created on first use and cached for subsequent calls.
    This avoids loading all models at startup.
    """

    def __init__(self):
        self._pipelines: dict[str, Any] = {}

    def get(self, task: str, model: str | None = None) -> Any:
        """Get or create a pipeline for the given task."""
        model = model or DEFAULT_MODELS.get(task)
        cache_key = f"{task}:{model}"

        if cache_key not in self._pipelines:
            self._pipelines[cache_key] = pipeline(
                task=task,
                model=model,
                device=DEVICE,
            )

        return self._pipelines[cache_key]

    def classify(self, text: str) -> list[dict]:
        """Sentiment / text classification."""
        pipe = self.get("text-classification")
        return pipe(text)

    def extract_entities(self, text: str) -> list[dict]:
        """Named entity recognition."""
        pipe = self.get("ner")
        results = pipe(text)

        # Merge subword tokens into complete entities
        merged = []
        current = None

        for entity in results:
            if entity["word"].startswith("##"):
                # Subword continuation — append to current entity
                if current:
                    current["word"] += entity["word"][2:]
                    current["end"] = entity["end"]
                    current["score"] = min(current["score"], entity["score"])
            else:
                if current:
                    merged.append(current)
                current = {**entity}

        if current:
            merged.append(current)

        return merged

    def summarize(
        self,
        text: str,
        max_length: int = 130,
        min_length: int = 30,
    ) -> str:
        """Text summarization."""
        pipe = self.get("summarization")
        result = pipe(text, max_length=max_length, min_length=min_length)
        return result[0]["summary_text"]

    def generate(
        self,
        prompt: str,
        max_new_tokens: int = 100,
        temperature: float = 0.7,
        top_p: float = 0.9,
    ) -> str:
        """Text generation."""
        pipe = self.get("text-generation")
        result = pipe(
            prompt,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
        )
        return result[0]["generated_text"]

    def answer_question(self, question: str, context: str) -> dict:
        """Extractive question answering."""
        pipe = self.get("question-answering")
        return pipe(question=question, context=context)

    def translate(self, text: str, model: str | None = None) -> str:
        """Translation (default: English to French)."""
        pipe = self.get("translation", model=model)
        result = pipe(text)
        return result[0]["translation_text"]

    def zero_shot_classify(
        self,
        text: str,
        candidate_labels: list[str],
    ) -> dict:
        """Zero-shot classification with candidate labels."""
        pipe = self.get("zero-shot-classification")
        return pipe(text, candidate_labels=candidate_labels)

    def fill_mask(self, text: str) -> list[dict]:
        """Fill masked tokens in text."""
        pipe = self.get("fill-mask")
        return pipe(text)


# Global registry instance
registry = PipelineRegistry()

How pipeline() Works Under the Hood:

How pipeline() Works Under the Hood

Resolve ModelDownload from Hub (cached locally)

Load ComponentsTokenizer (AutoTokenizer) + Model (AutoModelForSeqClass) + Post-process (softmax, label map)

__call__(text)text → tokenize → model.forward() → post-process

Return Result[{"label": "POSITIVE", "score": 0.99}]

Key insight: pipeline() auto-selects the correct AutoModel subclass based on the task (ForSequenceClassification, ForTokenClassification, ForSeq2SeqLM, etc.)

Pipeline Task Reference:

Task	What It Does	Example Model
`text-classification`	Sentiment, topic labels	`distilbert-base-uncased-finetuned-sst-2-english`
`ner`	Named entity extraction	`dslim/bert-base-NER`
`summarization`	Condense long text	`facebook/bart-large-cnn`
`translation_xx_to_yy`	Language translation	`Helsinki-NLP/opus-mt-en-fr`
`text-generation`	Continue a prompt	`HuggingFaceTB/SmolLM2-360M-Instruct`
`question-answering`	Extract answer from context	`distilbert-base-cased-distilled-squad`
`zero-shot-classification`	Classify without training	`facebook/bart-large-mnli`
`fill-mask`	Predict masked tokens	`bert-base-uncased`

Step 4: Hub API Client

src/hub_client.py

"""HuggingFace Hub API client for model discovery and management."""

from huggingface_hub import (
    HfApi,
    hf_hub_download,
    model_info,
    list_models,
    ModelFilter,
)
from dataclasses import dataclass


@dataclass
class ModelSummary:
    """Compact model information."""

    model_id: str
    author: str
    downloads: int
    likes: int
    pipeline_tag: str | None
    tags: list[str]


class HubClient:
    """Client for HuggingFace Hub operations."""

    def __init__(self, token: str | None = None):
        self.api = HfApi(token=token)

    def search_models(
        self,
        query: str | None = None,
        task: str | None = None,
        library: str | None = None,
        sort: str = "downloads",
        limit: int = 10,
    ) -> list[ModelSummary]:
        """
        Search for models on the Hub.

        Args:
            query: Free-text search query
            task: Filter by pipeline task (e.g., "text-classification")
            library: Filter by library (e.g., "pytorch", "transformers")
            sort: Sort by "downloads", "likes", or "lastModified"
            limit: Maximum results to return
        """
        model_filter = ModelFilter(
            task=task,
            library=library,
        )

        models = list(
            self.api.list_models(
                search=query,
                filter=model_filter,
                sort=sort,
                direction=-1,
                limit=limit,
            )
        )

        return [
            ModelSummary(
                model_id=m.id,
                author=m.author or "unknown",
                downloads=m.downloads or 0,
                likes=m.likes or 0,
                pipeline_tag=m.pipeline_tag,
                tags=m.tags or [],
            )
            for m in models
        ]

    def get_model_info(self, model_id: str) -> dict:
        """Get detailed information about a model."""
        info = self.api.model_info(model_id)
        return {
            "model_id": info.id,
            "author": info.author,
            "sha": info.sha,
            "pipeline_tag": info.pipeline_tag,
            "tags": info.tags,
            "downloads": info.downloads,
            "likes": info.likes,
            "library_name": info.library_name,
            "languages": getattr(info, "languages", []),
            "siblings": [
                {"filename": s.rfilename, "size": s.size}
                for s in (info.siblings or [])
            ],
        }

    def download_file(
        self,
        model_id: str,
        filename: str,
        cache_dir: str | None = None,
    ) -> str:
        """
        Download a specific file from a model repository.

        Returns the local path to the downloaded file.
        """
        return hf_hub_download(
            repo_id=model_id,
            filename=filename,
            cache_dir=cache_dir,
        )

    def list_model_files(self, model_id: str) -> list[str]:
        """List all files in a model repository."""
        info = self.api.model_info(model_id)
        return [s.rfilename for s in (info.siblings or [])]

Understanding the Hub Client:

The HubClient wraps the HfApi class, which communicates with the Hub's REST API. The key pattern is list_models() for discovery (filtered by task, library, or free-text search) and model_info() for inspection (file sizes, tags, download counts). The ModelFilter object composes these filters so you can, for example, find the top PyTorch text-classification models sorted by downloads. The hf_hub_download() function fetches individual files from a repo and caches them locally in ~/.cache/huggingface/hub, so subsequent calls are instant.

Hub API Operations:

HuggingFace Hub Workflow

Discoverlist_models(task="text-classification", sort="downloads") — returns top models by download count

Inspectmodel_info("distilbert-base-...") — files, size (~268 MB), library, metadata

Usepipeline("text-classification", model="distilbert-base-...") — auto-downloads and caches weights on first call

Step 5: AutoModel Patterns

src/auto_models.py

"""Direct model loading with AutoModel and AutoTokenizer."""

import torch
from transformers import (
    AutoTokenizer,
    AutoModel,
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    AutoModelForCausalLM,
)

from .config import DEVICE, DTYPE


def load_for_embeddings(model_name: str = "bert-base-uncased"):
    """
    Load a model for extracting embeddings.

    Uses AutoModel (no task-specific head) to get hidden states.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(
        model_name,
        torch_dtype=DTYPE,
    ).to(DEVICE)
    model.eval()

    return tokenizer, model


def get_embeddings(
    texts: list[str],
    tokenizer,
    model,
    pooling: str = "mean",
) -> torch.Tensor:
    """
    Extract embeddings from texts.

    Args:
        texts: List of input texts
        tokenizer: HuggingFace tokenizer
        model: HuggingFace model
        pooling: "mean", "cls", or "max"
    """
    inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors="pt",
    ).to(DEVICE)

    with torch.no_grad():
        outputs = model(**inputs)

    hidden_states = outputs.last_hidden_state  # [batch, seq_len, hidden_dim]
    attention_mask = inputs["attention_mask"]

    if pooling == "cls":
        # Use the [CLS] token representation
        embeddings = hidden_states[:, 0, :]
    elif pooling == "mean":
        # Mean of all non-padding tokens
        mask = attention_mask.unsqueeze(-1).float()
        embeddings = (hidden_states * mask).sum(dim=1) / mask.sum(dim=1)
    elif pooling == "max":
        # Max pooling over sequence dimension
        hidden_states[attention_mask == 0] = -1e9
        embeddings = hidden_states.max(dim=1).values
    else:
        raise ValueError(f"Unknown pooling: {pooling}")

    return embeddings


def load_for_classification(
    model_name: str,
    num_labels: int = 2,
):
    """Load a model with a classification head."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=num_labels,
        torch_dtype=DTYPE,
    ).to(DEVICE)

    return tokenizer, model


def load_for_generation(
    model_name: str = "gpt2",
):
    """Load a causal language model for text generation."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=DTYPE,
    ).to(DEVICE)

    # GPT-2 doesn't have a pad token by default
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    return tokenizer, model

AutoModel Class Hierarchy:

AutoModel Selection

AutoModel (base)

Returns raw hidden states [batch, seq, hidden_dim]. Use for: embeddings, custom heads.

AutoModelForSequenceClassification

Adds Linear(hidden_dim, num_labels) on top of [CLS]. Use for: sentiment, topic classification.

AutoModelForTokenClassification

Adds Linear(hidden_dim, num_labels) on every token. Use for: NER, POS tagging.

AutoModelForCausalLM

Adds LM head for next-token prediction. Use for: text generation (GPT-style).

AutoModelForSeq2SeqLM

Encoder-decoder with generation head. Use for: translation, summarization (T5, BART).

Note: The "Auto" prefix means the correct model architecture is automatically detected from config.json in the model repo.

Pooling Strategy Comparison:

Strategy	How It Works	Best For
`cls`	Use the `[CLS]` token embedding	Models fine-tuned with `[CLS]` as aggregate
`mean`	Average all non-padding token embeddings	General-purpose similarity, most robust
`max`	Max value across each dimension	Capturing strongest feature activations

Step 6: FastAPI Application

api/main.py

"""FastAPI application exposing pipeline endpoints."""

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field

from src.pipelines import registry
from src.hub_client import HubClient

app = FastAPI(title="HuggingFace Pipelines API")
hub = HubClient()


class TextInput(BaseModel):
    text: str = Field(..., min_length=1)


class QAInput(BaseModel):
    question: str
    context: str


class ZeroShotInput(BaseModel):
    text: str
    labels: list[str]


class GenerateInput(BaseModel):
    prompt: str
    max_new_tokens: int = 100
    temperature: float = 0.7


class SearchModelsInput(BaseModel):
    query: str | None = None
    task: str | None = None
    limit: int = 10


@app.get("/health")
async def health():
    return {"status": "healthy"}


@app.post("/classify")
async def classify(req: TextInput):
    return registry.classify(req.text)


@app.post("/ner")
async def ner(req: TextInput):
    entities = registry.extract_entities(req.text)
    return {"entities": entities}


@app.post("/summarize")
async def summarize(req: TextInput):
    summary = registry.summarize(req.text)
    return {"summary": summary}


@app.post("/generate")
async def generate(req: GenerateInput):
    text = registry.generate(
        req.prompt,
        max_new_tokens=req.max_new_tokens,
        temperature=req.temperature,
    )
    return {"generated_text": text}


@app.post("/qa")
async def question_answering(req: QAInput):
    return registry.answer_question(req.question, req.context)


@app.post("/zero-shot")
async def zero_shot(req: ZeroShotInput):
    return registry.zero_shot_classify(req.text, req.labels)


@app.post("/hub/search")
async def search_models(req: SearchModelsInput):
    results = hub.search_models(
        query=req.query,
        task=req.task,
        limit=req.limit,
    )
    return {"models": [r.__dict__ for r in results]}


@app.get("/hub/model/{model_id:path}")
async def get_model(model_id: str):
    try:
        return hub.get_model_info(model_id)
    except Exception as e:
        raise HTTPException(status_code=404, detail=str(e))

Step 7: Interactive Demo

examples/demo.py

"""Interactive demo showing all pipeline capabilities."""

from src.pipelines import registry
from src.hub_client import HubClient


def demo_classification():
    """Sentiment analysis."""
    texts = [
        "I love this product! It works perfectly.",
        "Terrible experience. Would not recommend.",
        "It's okay, nothing special.",
    ]
    print("=== Text Classification ===")
    for text in texts:
        result = registry.classify(text)
        print(f"  '{text[:50]}...' → {result[0]['label']} ({result[0]['score']:.3f})")


def demo_ner():
    """Named entity recognition."""
    text = "Elon Musk founded SpaceX in Hawthorne, California in 2002."
    print("\n=== Named Entity Recognition ===")
    print(f"  Text: {text}")
    entities = registry.extract_entities(text)
    for ent in entities:
        print(f"  {ent['word']:20s} → {ent['entity']:10s} (score: {ent['score']:.3f})")


def demo_summarization():
    """Text summarization."""
    text = (
        "The tower is 324 metres (1,063 ft) tall, about the same height as an "
        "81-storey building, and the tallest structure in Paris. Its base is square, "
        "measuring 125 metres (410 ft) on each side. During its construction, the "
        "Eiffel Tower surpassed the Washington Monument to become the tallest "
        "man-made structure in the world, a title it held for 41 years until the "
        "Chrysler Building in New York City was finished in 1930."
    )
    print("\n=== Summarization ===")
    summary = registry.summarize(text)
    print(f"  Summary: {summary}")


def demo_qa():
    """Question answering."""
    context = (
        "Python is a programming language created by Guido van Rossum and first "
        "released in 1991. Python's design philosophy emphasizes code readability "
        "with the use of significant indentation."
    )
    question = "Who created Python?"
    print("\n=== Question Answering ===")
    result = registry.answer_question(question, context)
    print(f"  Q: {question}")
    print(f"  A: {result['answer']} (score: {result['score']:.3f})")


def demo_zero_shot():
    """Zero-shot classification."""
    text = "The stock market crashed today after the Federal Reserve raised rates."
    labels = ["politics", "finance", "sports", "technology"]
    print("\n=== Zero-Shot Classification ===")
    result = registry.zero_shot_classify(text, labels)
    for label, score in zip(result["labels"], result["scores"]):
        print(f"  {label:15s} → {score:.3f}")


def demo_hub():
    """Hub model search."""
    hub = HubClient()
    print("\n=== Hub Search: Top Sentiment Models ===")
    models = hub.search_models(task="text-classification", limit=5)
    for m in models:
        print(f"  {m.model_id:50s} ({m.downloads:>10,} downloads)")


if __name__ == "__main__":
    demo_classification()
    demo_ner()
    demo_summarization()
    demo_qa()
    demo_zero_shot()
    demo_hub()

Running the Project

# Install dependencies
pip install -r requirements.txt

# Run the interactive demo
python examples/demo.py

# Start the API server
uvicorn api.main:app --reload

# Test endpoints
curl -X POST http://localhost:8000/classify \
  -H "Content-Type: application/json" \
  -d '{"text": "This is amazing!"}'

curl -X POST http://localhost:8000/hub/search \
  -H "Content-Type: application/json" \
  -d '{"task": "text-classification", "limit": 5}'

Key Concepts Recap

Concept	What It Is	Why It Matters
`pipeline()`	One-line inference for 20+ tasks	Fastest way from zero to working model
AutoModel	Automatic model architecture detection	Load any model without knowing its class
AutoTokenizer	Automatic tokenizer loading	Matches the correct tokenizer to any model
HfApi	Programmatic Hub access	Search, download, and manage models in code
Lazy Loading	Create pipelines on first use	Avoid loading all models into memory at startup
Device Placement	`.to(device)` for GPU/CPU	Run inference on the fastest available hardware
Subword Merging	Combine `##` tokens in NER	Models tokenize words into pieces; merge them back

Next Steps

Tokenizers Deep Dive — Understand what happens inside the tokenizer
Datasets Mastery — Load and process data for model training

Pipelines & Hub

TL;DR

Use HuggingFace pipeline() to run pre-trained models for 20+ NLP tasks in a single line of code, and the huggingface_hub library to search, download, and manage models programmatically.

What You'll Learn

Running inference with pipeline() for text, image, and audio tasks
AutoModel and AutoTokenizer patterns for direct model loading
Hub API for searching, downloading, and uploading models
Device placement and dtype configuration
Batched inference for throughput

Why HuggingFace Pipelines and Hub?

Property	Value
Difficulty	Beginner
Time	~2 hours
Lines of Code	~200
Prerequisites	Basic Python

Tech Stack

Component	Technology	Why
Pipelines	`transformers`	Unified inference API for 20+ tasks
Hub API	`huggingface_hub`	Programmatic model discovery and download
API	FastAPI	Async REST endpoints for serving
Python	3.10+	Type hint support (union `\|` syntax)

Architecture

Pipeline-Based NLP Service

FastAPI Endpoints

/classify

/ner

/summarize

/generate

/qa

/translate

Pipeline Registry

Sentiment Pipeline

NER Pipeline

Summarize Pipeline

Text Gen Pipeline

HuggingFace Hub

Model discovery (list_models)

Model metadata (model_info)

Weight download (hf_hub_download)

Project Structure

pipelines-and-hub/
├── src/
│   ├── __init__.py
│   ├── pipelines.py          # Pipeline wrappers for each task
│   ├── hub_client.py         # Hub API interactions
│   ├── auto_models.py        # AutoModel/AutoTokenizer patterns
│   └── config.py             # Device and model configuration
├── api/
│   └── main.py               # FastAPI application
├── examples/
│   └── demo.py               # Interactive demo script
├── requirements.txt
└── README.md

Implementation

Step 1: Dependencies

requirements.txt

transformers>=4.40.0
huggingface_hub>=0.23.0
torch>=2.0.0
fastapi>=0.111.0
uvicorn>=0.30.0

Step 2: Configuration

src/config.py

"""Device and model configuration."""

import torch


def get_device() -> str:
    """Select the best available device."""
    if torch.cuda.is_available():
        return "cuda"
    elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        return "mps"
    return "cpu"


def get_dtype(device: str) -> torch.dtype:
    """Select optimal dtype for the device."""
    if device == "cuda":
        return torch.float16
    return torch.float32


DEVICE = get_device()
DTYPE = get_dtype(DEVICE)

# Default models for each task
DEFAULT_MODELS = {
    "text-classification": "distilbert-base-uncased-finetuned-sst-2-english",
    "ner": "dslim/bert-base-NER",
    "summarization": "facebook/bart-large-cnn",
    "translation": "Helsinki-NLP/opus-mt-en-fr",
    "text-generation": "gpt2",
    "question-answering": "distilbert-base-cased-distilled-squad",
    "zero-shot-classification": "facebook/bart-large-mnli",
    "fill-mask": "bert-base-uncased",
}

Understanding Device Selection:

Device Priority

CUDA (NVIDIA GPU)

Recommended

Use "cuda" + float16 — fastest, half-precision saves VRAM. NVIDIA Tensor Cores accelerate fp16 natively.

MPS (Apple Silicon)

Use "mps" + float32 — GPU acceleration on Mac. MPS doesn't benefit as much from fp16.

CPU

Use "cpu" + float32 — universal fallback. Slowest option but works everywhere.

Step 3: Pipeline Wrappers

src/pipelines.py

"""HuggingFace Pipeline wrappers for multiple NLP tasks."""

from transformers import pipeline
from typing import Any

from .config import DEVICE, DEFAULT_MODELS


class PipelineRegistry:
    """
    Lazy-loading registry for HuggingFace pipelines.

    Pipelines are created on first use and cached for subsequent calls.
    This avoids loading all models at startup.
    """

    def __init__(self):
        self._pipelines: dict[str, Any] = {}

    def get(self, task: str, model: str | None = None) -> Any:
        """Get or create a pipeline for the given task."""
        model = model or DEFAULT_MODELS.get(task)
        cache_key = f"{task}:{model}"

        if cache_key not in self._pipelines:
            self._pipelines[cache_key] = pipeline(
                task=task,
                model=model,
                device=DEVICE,
            )

        return self._pipelines[cache_key]

    def classify(self, text: str) -> list[dict]:
        """Sentiment / text classification."""
        pipe = self.get("text-classification")
        return pipe(text)

    def extract_entities(self, text: str) -> list[dict]:
        """Named entity recognition."""
        pipe = self.get("ner")
        results = pipe(text)

        # Merge subword tokens into complete entities
        merged = []
        current = None

        for entity in results:
            if entity["word"].startswith("##"):
                # Subword continuation — append to current entity
                if current:
                    current["word"] += entity["word"][2:]
                    current["end"] = entity["end"]
                    current["score"] = min(current["score"], entity["score"])
            else:
                if current:
                    merged.append(current)
                current = {**entity}

        if current:
            merged.append(current)

        return merged

    def summarize(
        self,
        text: str,
        max_length: int = 130,
        min_length: int = 30,
    ) -> str:
        """Text summarization."""
        pipe = self.get("summarization")
        result = pipe(text, max_length=max_length, min_length=min_length)
        return result[0]["summary_text"]

    def generate(
        self,
        prompt: str,
        max_new_tokens: int = 100,
        temperature: float = 0.7,
        top_p: float = 0.9,
    ) -> str:
        """Text generation."""
        pipe = self.get("text-generation")
        result = pipe(
            prompt,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
        )
        return result[0]["generated_text"]

    def answer_question(self, question: str, context: str) -> dict:
        """Extractive question answering."""
        pipe = self.get("question-answering")
        return pipe(question=question, context=context)

    def translate(self, text: str, model: str | None = None) -> str:
        """Translation (default: English to French)."""
        pipe = self.get("translation", model=model)
        result = pipe(text)
        return result[0]["translation_text"]

    def zero_shot_classify(
        self,
        text: str,
        candidate_labels: list[str],
    ) -> dict:
        """Zero-shot classification with candidate labels."""
        pipe = self.get("zero-shot-classification")
        return pipe(text, candidate_labels=candidate_labels)

    def fill_mask(self, text: str) -> list[dict]:
        """Fill masked tokens in text."""
        pipe = self.get("fill-mask")
        return pipe(text)


# Global registry instance
registry = PipelineRegistry()

How pipeline() Works Under the Hood:

How pipeline() Works Under the Hood

Resolve ModelDownload from Hub (cached locally)

Load ComponentsTokenizer (AutoTokenizer) + Model (AutoModelForSeqClass) + Post-process (softmax, label map)

__call__(text)text → tokenize → model.forward() → post-process

Return Result[{"label": "POSITIVE", "score": 0.99}]

Key insight: pipeline() auto-selects the correct AutoModel subclass based on the task (ForSequenceClassification, ForTokenClassification, ForSeq2SeqLM, etc.)

Pipeline Task Reference:

Task	What It Does	Example Model
`text-classification`	Sentiment, topic labels	`distilbert-base-uncased-finetuned-sst-2-english`
`ner`	Named entity extraction	`dslim/bert-base-NER`
`summarization`	Condense long text	`facebook/bart-large-cnn`
`translation_xx_to_yy`	Language translation	`Helsinki-NLP/opus-mt-en-fr`
`text-generation`	Continue a prompt	`HuggingFaceTB/SmolLM2-360M-Instruct`
`question-answering`	Extract answer from context	`distilbert-base-cased-distilled-squad`
`zero-shot-classification`	Classify without training	`facebook/bart-large-mnli`
`fill-mask`	Predict masked tokens	`bert-base-uncased`

Step 4: Hub API Client

src/hub_client.py

"""HuggingFace Hub API client for model discovery and management."""

from huggingface_hub import (
    HfApi,
    hf_hub_download,
    model_info,
    list_models,
    ModelFilter,
)
from dataclasses import dataclass


@dataclass
class ModelSummary:
    """Compact model information."""

    model_id: str
    author: str
    downloads: int
    likes: int
    pipeline_tag: str | None
    tags: list[str]


class HubClient:
    """Client for HuggingFace Hub operations."""

    def __init__(self, token: str | None = None):
        self.api = HfApi(token=token)

    def search_models(
        self,
        query: str | None = None,
        task: str | None = None,
        library: str | None = None,
        sort: str = "downloads",
        limit: int = 10,
    ) -> list[ModelSummary]:
        """
        Search for models on the Hub.

        Args:
            query: Free-text search query
            task: Filter by pipeline task (e.g., "text-classification")
            library: Filter by library (e.g., "pytorch", "transformers")
            sort: Sort by "downloads", "likes", or "lastModified"
            limit: Maximum results to return
        """
        model_filter = ModelFilter(
            task=task,
            library=library,
        )

        models = list(
            self.api.list_models(
                search=query,
                filter=model_filter,
                sort=sort,
                direction=-1,
                limit=limit,
            )
        )

        return [
            ModelSummary(
                model_id=m.id,
                author=m.author or "unknown",
                downloads=m.downloads or 0,
                likes=m.likes or 0,
                pipeline_tag=m.pipeline_tag,
                tags=m.tags or [],
            )
            for m in models
        ]

    def get_model_info(self, model_id: str) -> dict:
        """Get detailed information about a model."""
        info = self.api.model_info(model_id)
        return {
            "model_id": info.id,
            "author": info.author,
            "sha": info.sha,
            "pipeline_tag": info.pipeline_tag,
            "tags": info.tags,
            "downloads": info.downloads,
            "likes": info.likes,
            "library_name": info.library_name,
            "languages": getattr(info, "languages", []),
            "siblings": [
                {"filename": s.rfilename, "size": s.size}
                for s in (info.siblings or [])
            ],
        }

    def download_file(
        self,
        model_id: str,
        filename: str,
        cache_dir: str | None = None,
    ) -> str:
        """
        Download a specific file from a model repository.

        Returns the local path to the downloaded file.
        """
        return hf_hub_download(
            repo_id=model_id,
            filename=filename,
            cache_dir=cache_dir,
        )

    def list_model_files(self, model_id: str) -> list[str]:
        """List all files in a model repository."""
        info = self.api.model_info(model_id)
        return [s.rfilename for s in (info.siblings or [])]

Understanding the Hub Client:

Hub API Operations:

HuggingFace Hub Workflow

Discoverlist_models(task="text-classification", sort="downloads") — returns top models by download count

Inspectmodel_info("distilbert-base-...") — files, size (~268 MB), library, metadata

Usepipeline("text-classification", model="distilbert-base-...") — auto-downloads and caches weights on first call

Step 5: AutoModel Patterns

src/auto_models.py

"""Direct model loading with AutoModel and AutoTokenizer."""

import torch
from transformers import (
    AutoTokenizer,
    AutoModel,
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    AutoModelForCausalLM,
)

from .config import DEVICE, DTYPE


def load_for_embeddings(model_name: str = "bert-base-uncased"):
    """
    Load a model for extracting embeddings.

    Uses AutoModel (no task-specific head) to get hidden states.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(
        model_name,
        torch_dtype=DTYPE,
    ).to(DEVICE)
    model.eval()

    return tokenizer, model


def get_embeddings(
    texts: list[str],
    tokenizer,
    model,
    pooling: str = "mean",
) -> torch.Tensor:
    """
    Extract embeddings from texts.

    Args:
        texts: List of input texts
        tokenizer: HuggingFace tokenizer
        model: HuggingFace model
        pooling: "mean", "cls", or "max"
    """
    inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors="pt",
    ).to(DEVICE)

    with torch.no_grad():
        outputs = model(**inputs)

    hidden_states = outputs.last_hidden_state  # [batch, seq_len, hidden_dim]
    attention_mask = inputs["attention_mask"]

    if pooling == "cls":
        # Use the [CLS] token representation
        embeddings = hidden_states[:, 0, :]
    elif pooling == "mean":
        # Mean of all non-padding tokens
        mask = attention_mask.unsqueeze(-1).float()
        embeddings = (hidden_states * mask).sum(dim=1) / mask.sum(dim=1)
    elif pooling == "max":
        # Max pooling over sequence dimension
        hidden_states[attention_mask == 0] = -1e9
        embeddings = hidden_states.max(dim=1).values
    else:
        raise ValueError(f"Unknown pooling: {pooling}")

    return embeddings


def load_for_classification(
    model_name: str,
    num_labels: int = 2,
):
    """Load a model with a classification head."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=num_labels,
        torch_dtype=DTYPE,
    ).to(DEVICE)

    return tokenizer, model


def load_for_generation(
    model_name: str = "gpt2",
):
    """Load a causal language model for text generation."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=DTYPE,
    ).to(DEVICE)

    # GPT-2 doesn't have a pad token by default
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    return tokenizer, model

AutoModel Class Hierarchy:

AutoModel Selection

AutoModel (base)

Returns raw hidden states [batch, seq, hidden_dim]. Use for: embeddings, custom heads.

AutoModelForSequenceClassification

Adds Linear(hidden_dim, num_labels) on top of [CLS]. Use for: sentiment, topic classification.

AutoModelForTokenClassification

Adds Linear(hidden_dim, num_labels) on every token. Use for: NER, POS tagging.

AutoModelForCausalLM

Adds LM head for next-token prediction. Use for: text generation (GPT-style).

AutoModelForSeq2SeqLM

Encoder-decoder with generation head. Use for: translation, summarization (T5, BART).

Note: The "Auto" prefix means the correct model architecture is automatically detected from config.json in the model repo.

Pooling Strategy Comparison:

Strategy	How It Works	Best For
`cls`	Use the `[CLS]` token embedding	Models fine-tuned with `[CLS]` as aggregate
`mean`	Average all non-padding token embeddings	General-purpose similarity, most robust
`max`	Max value across each dimension	Capturing strongest feature activations

Step 6: FastAPI Application

api/main.py

"""FastAPI application exposing pipeline endpoints."""

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field

from src.pipelines import registry
from src.hub_client import HubClient

app = FastAPI(title="HuggingFace Pipelines API")
hub = HubClient()


class TextInput(BaseModel):
    text: str = Field(..., min_length=1)


class QAInput(BaseModel):
    question: str
    context: str


class ZeroShotInput(BaseModel):
    text: str
    labels: list[str]


class GenerateInput(BaseModel):
    prompt: str
    max_new_tokens: int = 100
    temperature: float = 0.7


class SearchModelsInput(BaseModel):
    query: str | None = None
    task: str | None = None
    limit: int = 10


@app.get("/health")
async def health():
    return {"status": "healthy"}


@app.post("/classify")
async def classify(req: TextInput):
    return registry.classify(req.text)


@app.post("/ner")
async def ner(req: TextInput):
    entities = registry.extract_entities(req.text)
    return {"entities": entities}


@app.post("/summarize")
async def summarize(req: TextInput):
    summary = registry.summarize(req.text)
    return {"summary": summary}


@app.post("/generate")
async def generate(req: GenerateInput):
    text = registry.generate(
        req.prompt,
        max_new_tokens=req.max_new_tokens,
        temperature=req.temperature,
    )
    return {"generated_text": text}


@app.post("/qa")
async def question_answering(req: QAInput):
    return registry.answer_question(req.question, req.context)


@app.post("/zero-shot")
async def zero_shot(req: ZeroShotInput):
    return registry.zero_shot_classify(req.text, req.labels)


@app.post("/hub/search")
async def search_models(req: SearchModelsInput):
    results = hub.search_models(
        query=req.query,
        task=req.task,
        limit=req.limit,
    )
    return {"models": [r.__dict__ for r in results]}


@app.get("/hub/model/{model_id:path}")
async def get_model(model_id: str):
    try:
        return hub.get_model_info(model_id)
    except Exception as e:
        raise HTTPException(status_code=404, detail=str(e))

Step 7: Interactive Demo

examples/demo.py

"""Interactive demo showing all pipeline capabilities."""

from src.pipelines import registry
from src.hub_client import HubClient


def demo_classification():
    """Sentiment analysis."""
    texts = [
        "I love this product! It works perfectly.",
        "Terrible experience. Would not recommend.",
        "It's okay, nothing special.",
    ]
    print("=== Text Classification ===")
    for text in texts:
        result = registry.classify(text)
        print(f"  '{text[:50]}...' → {result[0]['label']} ({result[0]['score']:.3f})")


def demo_ner():
    """Named entity recognition."""
    text = "Elon Musk founded SpaceX in Hawthorne, California in 2002."
    print("\n=== Named Entity Recognition ===")
    print(f"  Text: {text}")
    entities = registry.extract_entities(text)
    for ent in entities:
        print(f"  {ent['word']:20s} → {ent['entity']:10s} (score: {ent['score']:.3f})")


def demo_summarization():
    """Text summarization."""
    text = (
        "The tower is 324 metres (1,063 ft) tall, about the same height as an "
        "81-storey building, and the tallest structure in Paris. Its base is square, "
        "measuring 125 metres (410 ft) on each side. During its construction, the "
        "Eiffel Tower surpassed the Washington Monument to become the tallest "
        "man-made structure in the world, a title it held for 41 years until the "
        "Chrysler Building in New York City was finished in 1930."
    )
    print("\n=== Summarization ===")
    summary = registry.summarize(text)
    print(f"  Summary: {summary}")


def demo_qa():
    """Question answering."""
    context = (
        "Python is a programming language created by Guido van Rossum and first "
        "released in 1991. Python's design philosophy emphasizes code readability "
        "with the use of significant indentation."
    )
    question = "Who created Python?"
    print("\n=== Question Answering ===")
    result = registry.answer_question(question, context)
    print(f"  Q: {question}")
    print(f"  A: {result['answer']} (score: {result['score']:.3f})")


def demo_zero_shot():
    """Zero-shot classification."""
    text = "The stock market crashed today after the Federal Reserve raised rates."
    labels = ["politics", "finance", "sports", "technology"]
    print("\n=== Zero-Shot Classification ===")
    result = registry.zero_shot_classify(text, labels)
    for label, score in zip(result["labels"], result["scores"]):
        print(f"  {label:15s} → {score:.3f}")


def demo_hub():
    """Hub model search."""
    hub = HubClient()
    print("\n=== Hub Search: Top Sentiment Models ===")
    models = hub.search_models(task="text-classification", limit=5)
    for m in models:
        print(f"  {m.model_id:50s} ({m.downloads:>10,} downloads)")


if __name__ == "__main__":
    demo_classification()
    demo_ner()
    demo_summarization()
    demo_qa()
    demo_zero_shot()
    demo_hub()

Running the Project

# Install dependencies
pip install -r requirements.txt

# Run the interactive demo
python examples/demo.py

# Start the API server
uvicorn api.main:app --reload

# Test endpoints
curl -X POST http://localhost:8000/classify \
  -H "Content-Type: application/json" \
  -d '{"text": "This is amazing!"}'

curl -X POST http://localhost:8000/hub/search \
  -H "Content-Type: application/json" \
  -d '{"task": "text-classification", "limit": 5}'

Key Concepts Recap

Concept	What It Is	Why It Matters
`pipeline()`	One-line inference for 20+ tasks	Fastest way from zero to working model
AutoModel	Automatic model architecture detection	Load any model without knowing its class
AutoTokenizer	Automatic tokenizer loading	Matches the correct tokenizer to any model
HfApi	Programmatic Hub access	Search, download, and manage models in code
Lazy Loading	Create pipelines on first use	Avoid loading all models into memory at startup
Device Placement	`.to(device)` for GPU/CPU	Run inference on the fastest available hardware
Subword Merging	Combine `##` tokens in NER	Models tokenize words into pieces; merge them back

Next Steps

Tokenizers Deep Dive — Understand what happens inside the tokenizer
Datasets Mastery — Load and process data for model training

Pipelines & Hub

Pipelines & Hub

What You'll Learn

Why HuggingFace Pipelines and Hub?

Tech Stack

Architecture

Project Structure

Implementation

Step 1: Dependencies

Step 2: Configuration

Step 3: Pipeline Wrappers

Step 4: Hub API Client

Step 5: AutoModel Patterns

Step 6: FastAPI Application

Step 7: Interactive Demo

Running the Project

Key Concepts Recap

Next Steps

On this page

Pipelines & Hub

Pipelines & Hub

What You'll Learn

Why HuggingFace Pipelines and Hub?

Tech Stack

Architecture

Project Structure

Implementation

Step 1: Dependencies

Step 2: Configuration

Step 3: Pipeline Wrappers

Step 4: Hub API Client

Step 5: AutoModel Patterns

Step 6: FastAPI Application

Step 7: Interactive Demo

Running the Project

Key Concepts Recap

Next Steps

On this page