Production AI Workbench

TL;DR

Build a complete AI workbench with Gradio that unifies text generation, semantic search, image generation, and model evaluation in a single tabbed interface. Deploy to HuggingFace Spaces for free hosting, using safetensors for model loading and the HF Inference API as a fallback.

This capstone project brings together the entire HuggingFace ecosystem into a production-ready application with a Gradio web interface, multiple AI capabilities, model caching, and HuggingFace Spaces deployment.

What You'll Learn

Building multi-tab Gradio applications
Integrating transformers, sentence-transformers, and diffusers
HuggingFace Inference API for serverless models
Model caching and resource management
safetensors for fast, safe model loading
Deploying to HuggingFace Spaces

Tech Stack

Component	Technology
UI	`gradio`
Text Generation	`transformers`
Search	`sentence-transformers`
Image Gen	`diffusers`
Evaluation	`evaluate`
Storage	`safetensors`
Hub	`huggingface_hub`
Python	3.10+

Architecture

┌──────────────────────────────────────────────────────────────────────────────┐
│                        PRODUCTION AI WORKBENCH                                │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  GRADIO UI                                                                   │
│  ┌────────────┬─────────────┬─────────────┬────────────┬──────────────┐     │
│  │  Text Gen  │  Semantic   │  Image Gen  │ Evaluation │   Settings   │     │
│  │            │  Search     │             │            │              │     │
│  │ • Chat     │ • Index     │ • txt2img   │ • BLEU     │ • Model      │     │
│  │ • Complete │ • Query     │ • img2img   │ • ROUGE    │ • Device     │     │
│  │ • Params   │ • Upload    │ • Inpaint   │ • Compare  │ • API Key    │     │
│  └─────┬──────┴──────┬──────┴──────┬──────┴─────┬──────┴──────────────┘     │
│        │             │             │            │                            │
│        ▼             ▼             ▼            ▼                            │
│  ┌──────────────────────────────────────────────────────────────┐           │
│  │                     MODEL MANAGER                            │           │
│  │  ┌──────────────┐  ┌──────────┐  ┌────────────────────┐    │           │
│  │  │ Local Models  │  │  Cache   │  │ HF Inference API   │    │           │
│  │  │ (safetensors) │  │ (LRU)   │  │ (serverless)       │    │           │
│  │  └──────────────┘  └──────────┘  └────────────────────┘    │           │
│  └──────────────────────────────────────────────────────────────┘           │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Project Structure

production-workbench/
├── app.py                     # Main Gradio application
├── src/
│   ├── __init__.py
│   ├── model_manager.py       # Model loading, caching, and fallback
│   ├── text_gen.py            # Text generation tab
│   ├── search.py              # Semantic search tab
│   ├── image_gen.py           # Image generation tab
│   ├── evaluation.py          # Model evaluation tab
│   └── inference_api.py       # HF Inference API client
├── requirements.txt
├── Dockerfile                 # For HF Spaces deployment
└── README.md

Implementation

Step 1: Dependencies

requirements.txt

gradio>=4.31.0
transformers>=4.40.0
sentence-transformers>=3.0.0
diffusers>=0.28.0
evaluate>=0.4.0
safetensors>=0.4.0
huggingface_hub>=0.23.0
accelerate>=0.30.0
torch>=2.0.0
Pillow>=10.0.0
rouge-score>=0.1.2
faiss-cpu>=1.8.0
numpy>=1.26.0

Step 2: Model Manager

src/model_manager.py

"""Centralized model loading, caching, and resource management."""

import torch
from functools import lru_cache
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from sentence_transformers import SentenceTransformer
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from typing import Any


class ModelManager:
    """
    Manages model lifecycle: loading, caching, and resource cleanup.

    Design decisions:
    - Models are loaded lazily (only when first needed)
    - LRU cache prevents reloading recently used models
    - Falls back to HF Inference API when GPU memory is low
    - Uses safetensors format for secure, fast loading
    """

    def __init__(
        self,
        device: str | None = None,
        max_cached_models: int = 3,
        hf_token: str | None = None,
    ):
        if device is None:
            if torch.cuda.is_available():
                self.device = "cuda"
            elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
                self.device = "mps"
            else:
                self.device = "cpu"
        else:
            self.device = device

        self.hf_token = hf_token
        self._cache: dict[str, Any] = {}
        self._max_cache = max_cached_models

    def get_text_model(
        self,
        model_name: str = "gpt2",
    ) -> tuple:
        """Load a text generation model."""
        cache_key = f"text:{model_name}"

        if cache_key not in self._cache:
            self._evict_if_needed()

            tokenizer = AutoTokenizer.from_pretrained(model_name)
            model = AutoModelForCausalLM.from_pretrained(
                model_name,
                torch_dtype=torch.float16 if self.device == "cuda" else torch.float32,
                device_map="auto" if self.device == "cuda" else None,
            )

            if self.device != "cuda":
                model = model.to(self.device)

            if tokenizer.pad_token is None:
                tokenizer.pad_token = tokenizer.eos_token

            self._cache[cache_key] = (model, tokenizer)

        return self._cache[cache_key]

    def get_embedding_model(
        self,
        model_name: str = "all-MiniLM-L6-v2",
    ) -> SentenceTransformer:
        """Load a sentence-transformers model."""
        cache_key = f"embed:{model_name}"

        if cache_key not in self._cache:
            self._evict_if_needed()
            model = SentenceTransformer(model_name, device=self.device)
            self._cache[cache_key] = model

        return self._cache[cache_key]

    def get_pipeline(
        self,
        task: str,
        model_name: str | None = None,
    ) -> Any:
        """Get a HuggingFace pipeline."""
        cache_key = f"pipe:{task}:{model_name}"

        if cache_key not in self._cache:
            self._evict_if_needed()
            pipe = pipeline(
                task,
                model=model_name,
                device=self.device if self.device != "cuda" else 0,
            )
            self._cache[cache_key] = pipe

        return self._cache[cache_key]

    def _evict_if_needed(self):
        """Remove oldest model if cache is full."""
        while len(self._cache) >= self._max_cache:
            oldest_key = next(iter(self._cache))
            del self._cache[oldest_key]
            if self.device == "cuda":
                torch.cuda.empty_cache()

    def clear_cache(self):
        """Clear all cached models."""
        self._cache.clear()
        if self.device == "cuda":
            torch.cuda.empty_cache()

    @property
    def status(self) -> dict:
        """Return current cache and device status."""
        info = {
            "device": self.device,
            "cached_models": list(self._cache.keys()),
            "cache_size": len(self._cache),
        }
        if self.device == "cuda":
            info["gpu_memory_allocated"] = f"{torch.cuda.memory_allocated() / 1e9:.2f} GB"
            info["gpu_memory_total"] = f"{torch.cuda.get_device_properties(0).total_mem / 1e9:.2f} GB"
        return info

Model Caching Strategy:

┌─────────────────────────────────────────────────────────────────┐
│ MODEL LIFECYCLE                                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Request: "Generate text with GPT-2"                             │
│       │                                                          │
│       ▼                                                          │
│  In cache? ──► Yes ──► Return cached model (instant)             │
│       │                                                          │
│       No                                                         │
│       │                                                          │
│       ▼                                                          │
│  Cache full? ──► Yes ──► Evict oldest model ──► Free GPU memory  │
│       │                                                          │
│       No                                                         │
│       │                                                          │
│       ▼                                                          │
│  Local weights available? ──► Yes ──► Load from safetensors      │
│       │                                                          │
│       No                                                         │
│       │                                                          │
│       ▼                                                          │
│  Download from Hub ──► Cache locally ──► Load to device          │
│                                                                  │
│  Fallback: If GPU OOM, use HF Inference API (serverless)         │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Step 3: HF Inference API Client

src/inference_api.py

"""HuggingFace Inference API for serverless model execution."""

from huggingface_hub import InferenceClient


class InferenceAPIClient:
    """
    Client for HuggingFace's serverless Inference API.

    Use when:
    - No local GPU available
    - Model is too large for local hardware
    - Quick prototyping without downloads
    - Production with auto-scaling
    """

    def __init__(self, token: str | None = None):
        self.client = InferenceClient(token=token)

    def generate_text(
        self,
        prompt: str,
        model: str = "meta-llama/Llama-3.1-8B-Instruct",
        max_new_tokens: int = 256,
        temperature: float = 0.7,
    ) -> str:
        """Generate text using the Inference API."""
        response = self.client.text_generation(
            prompt,
            model=model,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
        )
        return response

    def compute_embeddings(
        self,
        texts: list[str],
        model: str = "sentence-transformers/all-MiniLM-L6-v2",
    ) -> list[list[float]]:
        """Compute embeddings via the API."""
        response = self.client.feature_extraction(
            texts,
            model=model,
        )
        return response

    def generate_image(
        self,
        prompt: str,
        model: str = "stabilityai/stable-diffusion-xl-base-1.0",
    ):
        """Generate an image via the API."""
        image = self.client.text_to_image(
            prompt,
            model=model,
        )
        return image

    def classify_text(
        self,
        text: str,
        model: str = "distilbert-base-uncased-finetuned-sst-2-english",
    ) -> list[dict]:
        """Classify text via the API."""
        return self.client.text_classification(text, model=model)

Step 4: Text Generation Tab

src/text_gen.py

"""Text generation tab for the workbench."""

import gradio as gr
from .model_manager import ModelManager


def create_text_gen_tab(manager: ModelManager) -> gr.Tab:
    """Create the text generation tab."""

    def generate(
        prompt: str,
        model_name: str,
        max_tokens: int,
        temperature: float,
        top_p: float,
    ) -> str:
        model, tokenizer = manager.get_text_model(model_name)
        inputs = tokenizer(prompt, return_tensors="pt").to(manager.device)

        import torch
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                temperature=temperature,
                top_p=top_p,
                do_sample=True,
            )

        return tokenizer.decode(outputs[0], skip_special_tokens=True)

    with gr.Tab("Text Generation") as tab:
        gr.Markdown("## Text Generation")
        with gr.Row():
            with gr.Column(scale=2):
                prompt = gr.Textbox(
                    label="Prompt",
                    placeholder="Enter your prompt...",
                    lines=4,
                )
                output = gr.Textbox(label="Output", lines=10)
            with gr.Column(scale=1):
                model_name = gr.Dropdown(
                    choices=["gpt2", "gpt2-medium", "gpt2-large"],
                    value="gpt2",
                    label="Model",
                )
                max_tokens = gr.Slider(10, 500, value=100, label="Max Tokens")
                temperature = gr.Slider(0.1, 2.0, value=0.7, label="Temperature")
                top_p = gr.Slider(0.1, 1.0, value=0.9, label="Top-p")

        generate_btn = gr.Button("Generate", variant="primary")
        generate_btn.click(
            generate,
            inputs=[prompt, model_name, max_tokens, temperature, top_p],
            outputs=output,
        )

    return tab

Step 5: Semantic Search Tab

src/search.py

"""Semantic search tab for the workbench."""

import gradio as gr
import numpy as np
import faiss
from .model_manager import ModelManager


class SearchState:
    """Manages the search index state."""

    def __init__(self):
        self.index = None
        self.documents = []
        self.dimension = None

    def build_index(self, embeddings: np.ndarray, documents: list[str]):
        self.dimension = embeddings.shape[1]
        self.index = faiss.IndexFlatIP(self.dimension)
        # Normalize for cosine similarity
        faiss.normalize_L2(embeddings)
        self.index.add(embeddings.astype("float32"))
        self.documents = documents

    def search(self, query_embedding: np.ndarray, k: int = 5) -> list[tuple]:
        if self.index is None:
            return []
        faiss.normalize_L2(query_embedding)
        scores, indices = self.index.search(query_embedding.astype("float32"), k)
        return [
            (self.documents[idx], float(score))
            for score, idx in zip(scores[0], indices[0])
            if idx != -1
        ]


def create_search_tab(manager: ModelManager) -> gr.Tab:
    """Create the semantic search tab."""
    state = SearchState()

    def index_documents(text: str, model_name: str) -> str:
        documents = [line.strip() for line in text.split("\n") if line.strip()]
        if not documents:
            return "No documents to index."

        embed_model = manager.get_embedding_model(model_name)
        embeddings = embed_model.encode(documents)
        state.build_index(embeddings, documents)

        return f"Indexed {len(documents)} documents with {model_name}"

    def search(query: str, model_name: str, k: int) -> str:
        if state.index is None:
            return "Please index documents first."

        embed_model = manager.get_embedding_model(model_name)
        query_emb = embed_model.encode([query])
        results = state.search(query_emb, k=k)

        output = ""
        for i, (doc, score) in enumerate(results, 1):
            output += f"**{i}. (score: {score:.4f})**\n{doc}\n\n"

        return output or "No results found."

    with gr.Tab("Semantic Search") as tab:
        gr.Markdown("## Semantic Search")

        model_name = gr.Dropdown(
            choices=["all-MiniLM-L6-v2", "all-mpnet-base-v2"],
            value="all-MiniLM-L6-v2",
            label="Embedding Model",
        )

        with gr.Row():
            with gr.Column():
                docs_input = gr.Textbox(
                    label="Documents (one per line)",
                    lines=10,
                    placeholder="Enter documents to index...",
                )
                index_btn = gr.Button("Index Documents")
                index_status = gr.Textbox(label="Status")
            with gr.Column():
                query_input = gr.Textbox(label="Search Query")
                k_slider = gr.Slider(1, 20, value=5, step=1, label="Results")
                search_btn = gr.Button("Search", variant="primary")
                results_output = gr.Markdown(label="Results")

        index_btn.click(index_documents, [docs_input, model_name], index_status)
        search_btn.click(search, [query_input, model_name, k_slider], results_output)

    return tab

Step 6: Evaluation Tab

src/evaluation.py

"""Model evaluation tab for the workbench."""

import gradio as gr
import evaluate


def create_evaluation_tab() -> gr.Tab:
    """Create the model evaluation tab."""

    def compute_metrics(
        predictions_text: str,
        references_text: str,
        metrics: list[str],
    ) -> str:
        predictions = [line.strip() for line in predictions_text.split("\n") if line.strip()]
        references = [line.strip() for line in references_text.split("\n") if line.strip()]

        if len(predictions) != len(references):
            return f"Mismatch: {len(predictions)} predictions vs {len(references)} references"

        results = {}

        if "ROUGE" in metrics:
            rouge = evaluate.load("rouge")
            results.update(rouge.compute(predictions=predictions, references=references))

        if "BLEU" in metrics:
            bleu = evaluate.load("bleu")
            refs = [[r] for r in references]
            bleu_result = bleu.compute(predictions=predictions, references=refs)
            results["bleu"] = bleu_result["bleu"]

        output = "## Evaluation Results\n\n"
        output += "| Metric | Score |\n|--------|-------|\n"
        for metric, score in results.items():
            if isinstance(score, float):
                output += f"| {metric} | {score:.4f} |\n"

        return output

    with gr.Tab("Evaluation") as tab:
        gr.Markdown("## Model Evaluation")

        metrics_select = gr.CheckboxGroup(
            choices=["ROUGE", "BLEU"],
            value=["ROUGE"],
            label="Metrics",
        )

        with gr.Row():
            preds_input = gr.Textbox(
                label="Predictions (one per line)",
                lines=8,
            )
            refs_input = gr.Textbox(
                label="References (one per line)",
                lines=8,
            )

        eval_btn = gr.Button("Evaluate", variant="primary")
        results_output = gr.Markdown()

        eval_btn.click(
            compute_metrics,
            [preds_input, refs_input, metrics_select],
            results_output,
        )

    return tab

Step 7: Main Application

app.py

"""Main Gradio application — Production AI Workbench."""

import gradio as gr
from src.model_manager import ModelManager
from src.text_gen import create_text_gen_tab
from src.search import create_search_tab
from src.evaluation import create_evaluation_tab


def create_app() -> gr.Blocks:
    """Create the complete Gradio application."""
    manager = ModelManager()

    with gr.Blocks(
        title="AI Workbench",
        theme=gr.themes.Soft(),
    ) as app:
        gr.Markdown(
            "# AI Workbench\n"
            "A unified interface for text generation, semantic search, "
            "and model evaluation powered by the HuggingFace ecosystem."
        )

        create_text_gen_tab(manager)
        create_search_tab(manager)
        create_evaluation_tab()

        with gr.Tab("Settings"):
            gr.Markdown("## System Status")
            status_output = gr.JSON(label="Status")
            refresh_btn = gr.Button("Refresh Status")
            refresh_btn.click(lambda: manager.status, outputs=status_output)

            clear_btn = gr.Button("Clear Model Cache", variant="stop")
            clear_btn.click(
                lambda: (manager.clear_cache(), "Cache cleared")[1],
                outputs=gr.Textbox(label="Result"),
            )

    return app


if __name__ == "__main__":
    app = create_app()
    app.launch(
        server_name="0.0.0.0",
        server_port=7860,
        share=False,
    )

Step 8: HuggingFace Spaces Deployment

Dockerfile

FROM python:3.11-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Expose Gradio port
EXPOSE 7860

# Run the app
CMD ["python", "app.py"]

Deploying to HuggingFace Spaces:

# 1. Create a Space on huggingface.co/new-space
#    Select "Gradio" as the SDK

# 2. Clone the Space repo
git clone https://huggingface.co/spaces/YOUR_USERNAME/ai-workbench

# 3. Copy your project files
cp -r production-workbench/* ai-workbench/

# 4. Push to deploy
cd ai-workbench
git add .
git commit -m "Initial deployment"
git push

# The Space will build and deploy automatically.
# Free tier: 2 vCPU, 16GB RAM (CPU-only)
# Upgraded: T4 GPU ($0.60/hr) or A10G ($1.05/hr)

Deployment Architecture:

┌─────────────────────────────────────────────────────────────────┐
│ HUGGINGFACE SPACES DEPLOYMENT                                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  your-username/ai-workbench.hf.space                             │
│                                                                  │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │ HF Spaces Infrastructure                                   │  │
│  │                                                            │  │
│  │  ┌──────────┐    ┌──────────┐    ┌──────────────────────┐ │  │
│  │  │  Docker   │───▶│  Gradio  │───▶│  Public URL          │ │  │
│  │  │  Build    │    │  Server  │    │  (HTTPS, auto-SSL)   │ │  │
│  │  └──────────┘    └──────────┘    └──────────────────────┘ │  │
│  │                                                            │  │
│  │  Features:                                                 │  │
│  │  • Auto-build from Dockerfile or requirements.txt          │  │
│  │  • Git-based deployment (push to deploy)                   │  │
│  │  • Sleep after inactivity (free tier)                      │  │
│  │  • Persistent storage (optional, paid)                     │  │
│  │  • GPU upgrade available                                   │  │
│  │  • Custom domain support                                   │  │
│  └────────────────────────────────────────────────────────────┘  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Running the Project

# Install dependencies
pip install -r requirements.txt

# Run locally
python app.py
# Open http://localhost:7860

# Run with GPU
CUDA_VISIBLE_DEVICES=0 python app.py

# Deploy to Spaces
# Follow the deployment steps above

Key Concepts Recap

Concept	What It Is	Why It Matters
Gradio Blocks	Flexible UI framework for ML apps	Build complex multi-tab interfaces
Model Manager	Centralized model loading with LRU cache	Prevents OOM, shares models across tabs
HF Inference API	Serverless model execution	Run large models without local GPU
safetensors	Secure tensor file format	2-5x faster loading, no code execution risk
HuggingFace Spaces	Free ML app hosting	Deploy Gradio apps with one git push
FAISS	Similarity search library	Sub-millisecond search for the search tab
evaluate	Standardized metrics library	Consistent evaluation across models

Next Steps

Congratulations on completing the HuggingFace Ecosystem category! Consider:

Deep Learning — Understand the foundations under these abstractions
RAG Projects — Apply embeddings and search to RAG pipelines
AI Agents — Build autonomous agents using HuggingFace models

Production AI Workbench

TL;DR

What You'll Learn

Building multi-tab Gradio applications
Integrating transformers, sentence-transformers, and diffusers
HuggingFace Inference API for serverless models
Model caching and resource management
safetensors for fast, safe model loading
Deploying to HuggingFace Spaces

Tech Stack

Component	Technology
UI	`gradio`
Text Generation	`transformers`
Search	`sentence-transformers`
Image Gen	`diffusers`
Evaluation	`evaluate`
Storage	`safetensors`
Hub	`huggingface_hub`
Python	3.10+

Architecture

┌──────────────────────────────────────────────────────────────────────────────┐
│                        PRODUCTION AI WORKBENCH                                │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  GRADIO UI                                                                   │
│  ┌────────────┬─────────────┬─────────────┬────────────┬──────────────┐     │
│  │  Text Gen  │  Semantic   │  Image Gen  │ Evaluation │   Settings   │     │
│  │            │  Search     │             │            │              │     │
│  │ • Chat     │ • Index     │ • txt2img   │ • BLEU     │ • Model      │     │
│  │ • Complete │ • Query     │ • img2img   │ • ROUGE    │ • Device     │     │
│  │ • Params   │ • Upload    │ • Inpaint   │ • Compare  │ • API Key    │     │
│  └─────┬──────┴──────┬──────┴──────┬──────┴─────┬──────┴──────────────┘     │
│        │             │             │            │                            │
│        ▼             ▼             ▼            ▼                            │
│  ┌──────────────────────────────────────────────────────────────┐           │
│  │                     MODEL MANAGER                            │           │
│  │  ┌──────────────┐  ┌──────────┐  ┌────────────────────┐    │           │
│  │  │ Local Models  │  │  Cache   │  │ HF Inference API   │    │           │
│  │  │ (safetensors) │  │ (LRU)   │  │ (serverless)       │    │           │
│  │  └──────────────┘  └──────────┘  └────────────────────┘    │           │
│  └──────────────────────────────────────────────────────────────┘           │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Project Structure

production-workbench/
├── app.py                     # Main Gradio application
├── src/
│   ├── __init__.py
│   ├── model_manager.py       # Model loading, caching, and fallback
│   ├── text_gen.py            # Text generation tab
│   ├── search.py              # Semantic search tab
│   ├── image_gen.py           # Image generation tab
│   ├── evaluation.py          # Model evaluation tab
│   └── inference_api.py       # HF Inference API client
├── requirements.txt
├── Dockerfile                 # For HF Spaces deployment
└── README.md

Implementation

Step 1: Dependencies

requirements.txt

gradio>=4.31.0
transformers>=4.40.0
sentence-transformers>=3.0.0
diffusers>=0.28.0
evaluate>=0.4.0
safetensors>=0.4.0
huggingface_hub>=0.23.0
accelerate>=0.30.0
torch>=2.0.0
Pillow>=10.0.0
rouge-score>=0.1.2
faiss-cpu>=1.8.0
numpy>=1.26.0

Step 2: Model Manager

src/model_manager.py

"""Centralized model loading, caching, and resource management."""

import torch
from functools import lru_cache
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from sentence_transformers import SentenceTransformer
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from typing import Any


class ModelManager:
    """
    Manages model lifecycle: loading, caching, and resource cleanup.

    Design decisions:
    - Models are loaded lazily (only when first needed)
    - LRU cache prevents reloading recently used models
    - Falls back to HF Inference API when GPU memory is low
    - Uses safetensors format for secure, fast loading
    """

    def __init__(
        self,
        device: str | None = None,
        max_cached_models: int = 3,
        hf_token: str | None = None,
    ):
        if device is None:
            if torch.cuda.is_available():
                self.device = "cuda"
            elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
                self.device = "mps"
            else:
                self.device = "cpu"
        else:
            self.device = device

        self.hf_token = hf_token
        self._cache: dict[str, Any] = {}
        self._max_cache = max_cached_models

    def get_text_model(
        self,
        model_name: str = "gpt2",
    ) -> tuple:
        """Load a text generation model."""
        cache_key = f"text:{model_name}"

        if cache_key not in self._cache:
            self._evict_if_needed()

            tokenizer = AutoTokenizer.from_pretrained(model_name)
            model = AutoModelForCausalLM.from_pretrained(
                model_name,
                torch_dtype=torch.float16 if self.device == "cuda" else torch.float32,
                device_map="auto" if self.device == "cuda" else None,
            )

            if self.device != "cuda":
                model = model.to(self.device)

            if tokenizer.pad_token is None:
                tokenizer.pad_token = tokenizer.eos_token

            self._cache[cache_key] = (model, tokenizer)

        return self._cache[cache_key]

    def get_embedding_model(
        self,
        model_name: str = "all-MiniLM-L6-v2",
    ) -> SentenceTransformer:
        """Load a sentence-transformers model."""
        cache_key = f"embed:{model_name}"

        if cache_key not in self._cache:
            self._evict_if_needed()
            model = SentenceTransformer(model_name, device=self.device)
            self._cache[cache_key] = model

        return self._cache[cache_key]

    def get_pipeline(
        self,
        task: str,
        model_name: str | None = None,
    ) -> Any:
        """Get a HuggingFace pipeline."""
        cache_key = f"pipe:{task}:{model_name}"

        if cache_key not in self._cache:
            self._evict_if_needed()
            pipe = pipeline(
                task,
                model=model_name,
                device=self.device if self.device != "cuda" else 0,
            )
            self._cache[cache_key] = pipe

        return self._cache[cache_key]

    def _evict_if_needed(self):
        """Remove oldest model if cache is full."""
        while len(self._cache) >= self._max_cache:
            oldest_key = next(iter(self._cache))
            del self._cache[oldest_key]
            if self.device == "cuda":
                torch.cuda.empty_cache()

    def clear_cache(self):
        """Clear all cached models."""
        self._cache.clear()
        if self.device == "cuda":
            torch.cuda.empty_cache()

    @property
    def status(self) -> dict:
        """Return current cache and device status."""
        info = {
            "device": self.device,
            "cached_models": list(self._cache.keys()),
            "cache_size": len(self._cache),
        }
        if self.device == "cuda":
            info["gpu_memory_allocated"] = f"{torch.cuda.memory_allocated() / 1e9:.2f} GB"
            info["gpu_memory_total"] = f"{torch.cuda.get_device_properties(0).total_mem / 1e9:.2f} GB"
        return info

Model Caching Strategy:

┌─────────────────────────────────────────────────────────────────┐
│ MODEL LIFECYCLE                                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Request: "Generate text with GPT-2"                             │
│       │                                                          │
│       ▼                                                          │
│  In cache? ──► Yes ──► Return cached model (instant)             │
│       │                                                          │
│       No                                                         │
│       │                                                          │
│       ▼                                                          │
│  Cache full? ──► Yes ──► Evict oldest model ──► Free GPU memory  │
│       │                                                          │
│       No                                                         │
│       │                                                          │
│       ▼                                                          │
│  Local weights available? ──► Yes ──► Load from safetensors      │
│       │                                                          │
│       No                                                         │
│       │                                                          │
│       ▼                                                          │
│  Download from Hub ──► Cache locally ──► Load to device          │
│                                                                  │
│  Fallback: If GPU OOM, use HF Inference API (serverless)         │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Step 3: HF Inference API Client

src/inference_api.py

"""HuggingFace Inference API for serverless model execution."""

from huggingface_hub import InferenceClient


class InferenceAPIClient:
    """
    Client for HuggingFace's serverless Inference API.

    Use when:
    - No local GPU available
    - Model is too large for local hardware
    - Quick prototyping without downloads
    - Production with auto-scaling
    """

    def __init__(self, token: str | None = None):
        self.client = InferenceClient(token=token)

    def generate_text(
        self,
        prompt: str,
        model: str = "meta-llama/Llama-3.1-8B-Instruct",
        max_new_tokens: int = 256,
        temperature: float = 0.7,
    ) -> str:
        """Generate text using the Inference API."""
        response = self.client.text_generation(
            prompt,
            model=model,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
        )
        return response

    def compute_embeddings(
        self,
        texts: list[str],
        model: str = "sentence-transformers/all-MiniLM-L6-v2",
    ) -> list[list[float]]:
        """Compute embeddings via the API."""
        response = self.client.feature_extraction(
            texts,
            model=model,
        )
        return response

    def generate_image(
        self,
        prompt: str,
        model: str = "stabilityai/stable-diffusion-xl-base-1.0",
    ):
        """Generate an image via the API."""
        image = self.client.text_to_image(
            prompt,
            model=model,
        )
        return image

    def classify_text(
        self,
        text: str,
        model: str = "distilbert-base-uncased-finetuned-sst-2-english",
    ) -> list[dict]:
        """Classify text via the API."""
        return self.client.text_classification(text, model=model)

Step 4: Text Generation Tab

src/text_gen.py

"""Text generation tab for the workbench."""

import gradio as gr
from .model_manager import ModelManager


def create_text_gen_tab(manager: ModelManager) -> gr.Tab:
    """Create the text generation tab."""

    def generate(
        prompt: str,
        model_name: str,
        max_tokens: int,
        temperature: float,
        top_p: float,
    ) -> str:
        model, tokenizer = manager.get_text_model(model_name)
        inputs = tokenizer(prompt, return_tensors="pt").to(manager.device)

        import torch
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                temperature=temperature,
                top_p=top_p,
                do_sample=True,
            )

        return tokenizer.decode(outputs[0], skip_special_tokens=True)

    with gr.Tab("Text Generation") as tab:
        gr.Markdown("## Text Generation")
        with gr.Row():
            with gr.Column(scale=2):
                prompt = gr.Textbox(
                    label="Prompt",
                    placeholder="Enter your prompt...",
                    lines=4,
                )
                output = gr.Textbox(label="Output", lines=10)
            with gr.Column(scale=1):
                model_name = gr.Dropdown(
                    choices=["gpt2", "gpt2-medium", "gpt2-large"],
                    value="gpt2",
                    label="Model",
                )
                max_tokens = gr.Slider(10, 500, value=100, label="Max Tokens")
                temperature = gr.Slider(0.1, 2.0, value=0.7, label="Temperature")
                top_p = gr.Slider(0.1, 1.0, value=0.9, label="Top-p")

        generate_btn = gr.Button("Generate", variant="primary")
        generate_btn.click(
            generate,
            inputs=[prompt, model_name, max_tokens, temperature, top_p],
            outputs=output,
        )

    return tab

Step 5: Semantic Search Tab

src/search.py

"""Semantic search tab for the workbench."""

import gradio as gr
import numpy as np
import faiss
from .model_manager import ModelManager


class SearchState:
    """Manages the search index state."""

    def __init__(self):
        self.index = None
        self.documents = []
        self.dimension = None

    def build_index(self, embeddings: np.ndarray, documents: list[str]):
        self.dimension = embeddings.shape[1]
        self.index = faiss.IndexFlatIP(self.dimension)
        # Normalize for cosine similarity
        faiss.normalize_L2(embeddings)
        self.index.add(embeddings.astype("float32"))
        self.documents = documents

    def search(self, query_embedding: np.ndarray, k: int = 5) -> list[tuple]:
        if self.index is None:
            return []
        faiss.normalize_L2(query_embedding)
        scores, indices = self.index.search(query_embedding.astype("float32"), k)
        return [
            (self.documents[idx], float(score))
            for score, idx in zip(scores[0], indices[0])
            if idx != -1
        ]


def create_search_tab(manager: ModelManager) -> gr.Tab:
    """Create the semantic search tab."""
    state = SearchState()

    def index_documents(text: str, model_name: str) -> str:
        documents = [line.strip() for line in text.split("\n") if line.strip()]
        if not documents:
            return "No documents to index."

        embed_model = manager.get_embedding_model(model_name)
        embeddings = embed_model.encode(documents)
        state.build_index(embeddings, documents)

        return f"Indexed {len(documents)} documents with {model_name}"

    def search(query: str, model_name: str, k: int) -> str:
        if state.index is None:
            return "Please index documents first."

        embed_model = manager.get_embedding_model(model_name)
        query_emb = embed_model.encode([query])
        results = state.search(query_emb, k=k)

        output = ""
        for i, (doc, score) in enumerate(results, 1):
            output += f"**{i}. (score: {score:.4f})**\n{doc}\n\n"

        return output or "No results found."

    with gr.Tab("Semantic Search") as tab:
        gr.Markdown("## Semantic Search")

        model_name = gr.Dropdown(
            choices=["all-MiniLM-L6-v2", "all-mpnet-base-v2"],
            value="all-MiniLM-L6-v2",
            label="Embedding Model",
        )

        with gr.Row():
            with gr.Column():
                docs_input = gr.Textbox(
                    label="Documents (one per line)",
                    lines=10,
                    placeholder="Enter documents to index...",
                )
                index_btn = gr.Button("Index Documents")
                index_status = gr.Textbox(label="Status")
            with gr.Column():
                query_input = gr.Textbox(label="Search Query")
                k_slider = gr.Slider(1, 20, value=5, step=1, label="Results")
                search_btn = gr.Button("Search", variant="primary")
                results_output = gr.Markdown(label="Results")

        index_btn.click(index_documents, [docs_input, model_name], index_status)
        search_btn.click(search, [query_input, model_name, k_slider], results_output)

    return tab

Step 6: Evaluation Tab

src/evaluation.py

"""Model evaluation tab for the workbench."""

import gradio as gr
import evaluate


def create_evaluation_tab() -> gr.Tab:
    """Create the model evaluation tab."""

    def compute_metrics(
        predictions_text: str,
        references_text: str,
        metrics: list[str],
    ) -> str:
        predictions = [line.strip() for line in predictions_text.split("\n") if line.strip()]
        references = [line.strip() for line in references_text.split("\n") if line.strip()]

        if len(predictions) != len(references):
            return f"Mismatch: {len(predictions)} predictions vs {len(references)} references"

        results = {}

        if "ROUGE" in metrics:
            rouge = evaluate.load("rouge")
            results.update(rouge.compute(predictions=predictions, references=references))

        if "BLEU" in metrics:
            bleu = evaluate.load("bleu")
            refs = [[r] for r in references]
            bleu_result = bleu.compute(predictions=predictions, references=refs)
            results["bleu"] = bleu_result["bleu"]

        output = "## Evaluation Results\n\n"
        output += "| Metric | Score |\n|--------|-------|\n"
        for metric, score in results.items():
            if isinstance(score, float):
                output += f"| {metric} | {score:.4f} |\n"

        return output

    with gr.Tab("Evaluation") as tab:
        gr.Markdown("## Model Evaluation")

        metrics_select = gr.CheckboxGroup(
            choices=["ROUGE", "BLEU"],
            value=["ROUGE"],
            label="Metrics",
        )

        with gr.Row():
            preds_input = gr.Textbox(
                label="Predictions (one per line)",
                lines=8,
            )
            refs_input = gr.Textbox(
                label="References (one per line)",
                lines=8,
            )

        eval_btn = gr.Button("Evaluate", variant="primary")
        results_output = gr.Markdown()

        eval_btn.click(
            compute_metrics,
            [preds_input, refs_input, metrics_select],
            results_output,
        )

    return tab

Step 7: Main Application

app.py

"""Main Gradio application — Production AI Workbench."""

import gradio as gr
from src.model_manager import ModelManager
from src.text_gen import create_text_gen_tab
from src.search import create_search_tab
from src.evaluation import create_evaluation_tab


def create_app() -> gr.Blocks:
    """Create the complete Gradio application."""
    manager = ModelManager()

    with gr.Blocks(
        title="AI Workbench",
        theme=gr.themes.Soft(),
    ) as app:
        gr.Markdown(
            "# AI Workbench\n"
            "A unified interface for text generation, semantic search, "
            "and model evaluation powered by the HuggingFace ecosystem."
        )

        create_text_gen_tab(manager)
        create_search_tab(manager)
        create_evaluation_tab()

        with gr.Tab("Settings"):
            gr.Markdown("## System Status")
            status_output = gr.JSON(label="Status")
            refresh_btn = gr.Button("Refresh Status")
            refresh_btn.click(lambda: manager.status, outputs=status_output)

            clear_btn = gr.Button("Clear Model Cache", variant="stop")
            clear_btn.click(
                lambda: (manager.clear_cache(), "Cache cleared")[1],
                outputs=gr.Textbox(label="Result"),
            )

    return app


if __name__ == "__main__":
    app = create_app()
    app.launch(
        server_name="0.0.0.0",
        server_port=7860,
        share=False,
    )

Step 8: HuggingFace Spaces Deployment

Dockerfile

FROM python:3.11-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Expose Gradio port
EXPOSE 7860

# Run the app
CMD ["python", "app.py"]

Deploying to HuggingFace Spaces:

# 1. Create a Space on huggingface.co/new-space
#    Select "Gradio" as the SDK

# 2. Clone the Space repo
git clone https://huggingface.co/spaces/YOUR_USERNAME/ai-workbench

# 3. Copy your project files
cp -r production-workbench/* ai-workbench/

# 4. Push to deploy
cd ai-workbench
git add .
git commit -m "Initial deployment"
git push

# The Space will build and deploy automatically.
# Free tier: 2 vCPU, 16GB RAM (CPU-only)
# Upgraded: T4 GPU ($0.60/hr) or A10G ($1.05/hr)

Deployment Architecture:

┌─────────────────────────────────────────────────────────────────┐
│ HUGGINGFACE SPACES DEPLOYMENT                                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  your-username/ai-workbench.hf.space                             │
│                                                                  │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │ HF Spaces Infrastructure                                   │  │
│  │                                                            │  │
│  │  ┌──────────┐    ┌──────────┐    ┌──────────────────────┐ │  │
│  │  │  Docker   │───▶│  Gradio  │───▶│  Public URL          │ │  │
│  │  │  Build    │    │  Server  │    │  (HTTPS, auto-SSL)   │ │  │
│  │  └──────────┘    └──────────┘    └──────────────────────┘ │  │
│  │                                                            │  │
│  │  Features:                                                 │  │
│  │  • Auto-build from Dockerfile or requirements.txt          │  │
│  │  • Git-based deployment (push to deploy)                   │  │
│  │  • Sleep after inactivity (free tier)                      │  │
│  │  • Persistent storage (optional, paid)                     │  │
│  │  • GPU upgrade available                                   │  │
│  │  • Custom domain support                                   │  │
│  └────────────────────────────────────────────────────────────┘  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Running the Project

# Install dependencies
pip install -r requirements.txt

# Run locally
python app.py
# Open http://localhost:7860

# Run with GPU
CUDA_VISIBLE_DEVICES=0 python app.py

# Deploy to Spaces
# Follow the deployment steps above

Key Concepts Recap

Concept	What It Is	Why It Matters
Gradio Blocks	Flexible UI framework for ML apps	Build complex multi-tab interfaces
Model Manager	Centralized model loading with LRU cache	Prevents OOM, shares models across tabs
HF Inference API	Serverless model execution	Run large models without local GPU
safetensors	Secure tensor file format	2-5x faster loading, no code execution risk
HuggingFace Spaces	Free ML app hosting	Deploy Gradio apps with one git push
FAISS	Similarity search library	Sub-millisecond search for the search tab
evaluate	Standardized metrics library	Consistent evaluation across models

Next Steps

Congratulations on completing the HuggingFace Ecosystem category! Consider:

Deep Learning — Understand the foundations under these abstractions
RAG Projects — Apply embeddings and search to RAG pipelines
AI Agents — Build autonomous agents using HuggingFace models

Production AI Workbench

Production AI Workbench

What You'll Learn

Tech Stack

Architecture

Project Structure

Implementation

Step 1: Dependencies

Step 2: Model Manager

Step 3: HF Inference API Client

Step 4: Text Generation Tab

Step 5: Semantic Search Tab

Step 6: Evaluation Tab

Step 7: Main Application

Step 8: HuggingFace Spaces Deployment

Running the Project

Key Concepts Recap

Next Steps

On this page

Production AI Workbench

Production AI Workbench

What You'll Learn

Tech Stack

Architecture

Project Structure

Implementation

Step 1: Dependencies

Step 2: Model Manager

Step 3: HF Inference API Client

Step 4: Text Generation Tab

Step 5: Semantic Search Tab

Step 6: Evaluation Tab

Step 7: Main Application

Step 8: HuggingFace Spaces Deployment

Running the Project

Key Concepts Recap

Next Steps

On this page