HuggingFace EcosystemAdvanced
Production AI Workbench
Capstone: Full Gradio app with text gen, search, image gen, and evaluation
Production AI Workbench
TL;DR
Build a complete AI workbench with Gradio that unifies text generation, semantic search, image generation, and model evaluation in a single tabbed interface. Deploy to HuggingFace Spaces for free hosting, using safetensors for model loading and the HF Inference API as a fallback.
This capstone project brings together the entire HuggingFace ecosystem into a production-ready application with a Gradio web interface, multiple AI capabilities, model caching, and HuggingFace Spaces deployment.
What You'll Learn
- Building multi-tab Gradio applications
- Integrating transformers, sentence-transformers, and diffusers
- HuggingFace Inference API for serverless models
- Model caching and resource management
- safetensors for fast, safe model loading
- Deploying to HuggingFace Spaces
Tech Stack
| Component | Technology |
|---|---|
| UI | gradio |
| Text Generation | transformers |
| Search | sentence-transformers |
| Image Gen | diffusers |
| Evaluation | evaluate |
| Storage | safetensors |
| Hub | huggingface_hub |
| Python | 3.10+ |
Architecture
┌──────────────────────────────────────────────────────────────────────────────┐
│ PRODUCTION AI WORKBENCH │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ GRADIO UI │
│ ┌────────────┬─────────────┬─────────────┬────────────┬──────────────┐ │
│ │ Text Gen │ Semantic │ Image Gen │ Evaluation │ Settings │ │
│ │ │ Search │ │ │ │ │
│ │ • Chat │ • Index │ • txt2img │ • BLEU │ • Model │ │
│ │ • Complete │ • Query │ • img2img │ • ROUGE │ • Device │ │
│ │ • Params │ • Upload │ • Inpaint │ • Compare │ • API Key │ │
│ └─────┬──────┴──────┬──────┴──────┬──────┴─────┬──────┴──────────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ MODEL MANAGER │ │
│ │ ┌──────────────┐ ┌──────────┐ ┌────────────────────┐ │ │
│ │ │ Local Models │ │ Cache │ │ HF Inference API │ │ │
│ │ │ (safetensors) │ │ (LRU) │ │ (serverless) │ │ │
│ │ └──────────────┘ └──────────┘ └────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘Project Structure
production-workbench/
├── app.py # Main Gradio application
├── src/
│ ├── __init__.py
│ ├── model_manager.py # Model loading, caching, and fallback
│ ├── text_gen.py # Text generation tab
│ ├── search.py # Semantic search tab
│ ├── image_gen.py # Image generation tab
│ ├── evaluation.py # Model evaluation tab
│ └── inference_api.py # HF Inference API client
├── requirements.txt
├── Dockerfile # For HF Spaces deployment
└── README.mdImplementation
Step 1: Dependencies
gradio>=4.31.0
transformers>=4.40.0
sentence-transformers>=3.0.0
diffusers>=0.28.0
evaluate>=0.4.0
safetensors>=0.4.0
huggingface_hub>=0.23.0
accelerate>=0.30.0
torch>=2.0.0
Pillow>=10.0.0
rouge-score>=0.1.2
faiss-cpu>=1.8.0
numpy>=1.26.0Step 2: Model Manager
"""Centralized model loading, caching, and resource management."""
import torch
from functools import lru_cache
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from sentence_transformers import SentenceTransformer
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from typing import Any
class ModelManager:
"""
Manages model lifecycle: loading, caching, and resource cleanup.
Design decisions:
- Models are loaded lazily (only when first needed)
- LRU cache prevents reloading recently used models
- Falls back to HF Inference API when GPU memory is low
- Uses safetensors format for secure, fast loading
"""
def __init__(
self,
device: str | None = None,
max_cached_models: int = 3,
hf_token: str | None = None,
):
if device is None:
if torch.cuda.is_available():
self.device = "cuda"
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
self.device = "mps"
else:
self.device = "cpu"
else:
self.device = device
self.hf_token = hf_token
self._cache: dict[str, Any] = {}
self._max_cache = max_cached_models
def get_text_model(
self,
model_name: str = "gpt2",
) -> tuple:
"""Load a text generation model."""
cache_key = f"text:{model_name}"
if cache_key not in self._cache:
self._evict_if_needed()
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16 if self.device == "cuda" else torch.float32,
device_map="auto" if self.device == "cuda" else None,
)
if self.device != "cuda":
model = model.to(self.device)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
self._cache[cache_key] = (model, tokenizer)
return self._cache[cache_key]
def get_embedding_model(
self,
model_name: str = "all-MiniLM-L6-v2",
) -> SentenceTransformer:
"""Load a sentence-transformers model."""
cache_key = f"embed:{model_name}"
if cache_key not in self._cache:
self._evict_if_needed()
model = SentenceTransformer(model_name, device=self.device)
self._cache[cache_key] = model
return self._cache[cache_key]
def get_pipeline(
self,
task: str,
model_name: str | None = None,
) -> Any:
"""Get a HuggingFace pipeline."""
cache_key = f"pipe:{task}:{model_name}"
if cache_key not in self._cache:
self._evict_if_needed()
pipe = pipeline(
task,
model=model_name,
device=self.device if self.device != "cuda" else 0,
)
self._cache[cache_key] = pipe
return self._cache[cache_key]
def _evict_if_needed(self):
"""Remove oldest model if cache is full."""
while len(self._cache) >= self._max_cache:
oldest_key = next(iter(self._cache))
del self._cache[oldest_key]
if self.device == "cuda":
torch.cuda.empty_cache()
def clear_cache(self):
"""Clear all cached models."""
self._cache.clear()
if self.device == "cuda":
torch.cuda.empty_cache()
@property
def status(self) -> dict:
"""Return current cache and device status."""
info = {
"device": self.device,
"cached_models": list(self._cache.keys()),
"cache_size": len(self._cache),
}
if self.device == "cuda":
info["gpu_memory_allocated"] = f"{torch.cuda.memory_allocated() / 1e9:.2f} GB"
info["gpu_memory_total"] = f"{torch.cuda.get_device_properties(0).total_mem / 1e9:.2f} GB"
return infoModel Caching Strategy:
┌─────────────────────────────────────────────────────────────────┐
│ MODEL LIFECYCLE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Request: "Generate text with GPT-2" │
│ │ │
│ ▼ │
│ In cache? ──► Yes ──► Return cached model (instant) │
│ │ │
│ No │
│ │ │
│ ▼ │
│ Cache full? ──► Yes ──► Evict oldest model ──► Free GPU memory │
│ │ │
│ No │
│ │ │
│ ▼ │
│ Local weights available? ──► Yes ──► Load from safetensors │
│ │ │
│ No │
│ │ │
│ ▼ │
│ Download from Hub ──► Cache locally ──► Load to device │
│ │
│ Fallback: If GPU OOM, use HF Inference API (serverless) │
│ │
└─────────────────────────────────────────────────────────────────┘Step 3: HF Inference API Client
"""HuggingFace Inference API for serverless model execution."""
from huggingface_hub import InferenceClient
class InferenceAPIClient:
"""
Client for HuggingFace's serverless Inference API.
Use when:
- No local GPU available
- Model is too large for local hardware
- Quick prototyping without downloads
- Production with auto-scaling
"""
def __init__(self, token: str | None = None):
self.client = InferenceClient(token=token)
def generate_text(
self,
prompt: str,
model: str = "meta-llama/Llama-3.1-8B-Instruct",
max_new_tokens: int = 256,
temperature: float = 0.7,
) -> str:
"""Generate text using the Inference API."""
response = self.client.text_generation(
prompt,
model=model,
max_new_tokens=max_new_tokens,
temperature=temperature,
)
return response
def compute_embeddings(
self,
texts: list[str],
model: str = "sentence-transformers/all-MiniLM-L6-v2",
) -> list[list[float]]:
"""Compute embeddings via the API."""
response = self.client.feature_extraction(
texts,
model=model,
)
return response
def generate_image(
self,
prompt: str,
model: str = "stabilityai/stable-diffusion-xl-base-1.0",
):
"""Generate an image via the API."""
image = self.client.text_to_image(
prompt,
model=model,
)
return image
def classify_text(
self,
text: str,
model: str = "distilbert-base-uncased-finetuned-sst-2-english",
) -> list[dict]:
"""Classify text via the API."""
return self.client.text_classification(text, model=model)Step 4: Text Generation Tab
"""Text generation tab for the workbench."""
import gradio as gr
from .model_manager import ModelManager
def create_text_gen_tab(manager: ModelManager) -> gr.Tab:
"""Create the text generation tab."""
def generate(
prompt: str,
model_name: str,
max_tokens: int,
temperature: float,
top_p: float,
) -> str:
model, tokenizer = manager.get_text_model(model_name)
inputs = tokenizer(prompt, return_tensors="pt").to(manager.device)
import torch
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
do_sample=True,
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
with gr.Tab("Text Generation") as tab:
gr.Markdown("## Text Generation")
with gr.Row():
with gr.Column(scale=2):
prompt = gr.Textbox(
label="Prompt",
placeholder="Enter your prompt...",
lines=4,
)
output = gr.Textbox(label="Output", lines=10)
with gr.Column(scale=1):
model_name = gr.Dropdown(
choices=["gpt2", "gpt2-medium", "gpt2-large"],
value="gpt2",
label="Model",
)
max_tokens = gr.Slider(10, 500, value=100, label="Max Tokens")
temperature = gr.Slider(0.1, 2.0, value=0.7, label="Temperature")
top_p = gr.Slider(0.1, 1.0, value=0.9, label="Top-p")
generate_btn = gr.Button("Generate", variant="primary")
generate_btn.click(
generate,
inputs=[prompt, model_name, max_tokens, temperature, top_p],
outputs=output,
)
return tabStep 5: Semantic Search Tab
"""Semantic search tab for the workbench."""
import gradio as gr
import numpy as np
import faiss
from .model_manager import ModelManager
class SearchState:
"""Manages the search index state."""
def __init__(self):
self.index = None
self.documents = []
self.dimension = None
def build_index(self, embeddings: np.ndarray, documents: list[str]):
self.dimension = embeddings.shape[1]
self.index = faiss.IndexFlatIP(self.dimension)
# Normalize for cosine similarity
faiss.normalize_L2(embeddings)
self.index.add(embeddings.astype("float32"))
self.documents = documents
def search(self, query_embedding: np.ndarray, k: int = 5) -> list[tuple]:
if self.index is None:
return []
faiss.normalize_L2(query_embedding)
scores, indices = self.index.search(query_embedding.astype("float32"), k)
return [
(self.documents[idx], float(score))
for score, idx in zip(scores[0], indices[0])
if idx != -1
]
def create_search_tab(manager: ModelManager) -> gr.Tab:
"""Create the semantic search tab."""
state = SearchState()
def index_documents(text: str, model_name: str) -> str:
documents = [line.strip() for line in text.split("\n") if line.strip()]
if not documents:
return "No documents to index."
embed_model = manager.get_embedding_model(model_name)
embeddings = embed_model.encode(documents)
state.build_index(embeddings, documents)
return f"Indexed {len(documents)} documents with {model_name}"
def search(query: str, model_name: str, k: int) -> str:
if state.index is None:
return "Please index documents first."
embed_model = manager.get_embedding_model(model_name)
query_emb = embed_model.encode([query])
results = state.search(query_emb, k=k)
output = ""
for i, (doc, score) in enumerate(results, 1):
output += f"**{i}. (score: {score:.4f})**\n{doc}\n\n"
return output or "No results found."
with gr.Tab("Semantic Search") as tab:
gr.Markdown("## Semantic Search")
model_name = gr.Dropdown(
choices=["all-MiniLM-L6-v2", "all-mpnet-base-v2"],
value="all-MiniLM-L6-v2",
label="Embedding Model",
)
with gr.Row():
with gr.Column():
docs_input = gr.Textbox(
label="Documents (one per line)",
lines=10,
placeholder="Enter documents to index...",
)
index_btn = gr.Button("Index Documents")
index_status = gr.Textbox(label="Status")
with gr.Column():
query_input = gr.Textbox(label="Search Query")
k_slider = gr.Slider(1, 20, value=5, step=1, label="Results")
search_btn = gr.Button("Search", variant="primary")
results_output = gr.Markdown(label="Results")
index_btn.click(index_documents, [docs_input, model_name], index_status)
search_btn.click(search, [query_input, model_name, k_slider], results_output)
return tabStep 6: Evaluation Tab
"""Model evaluation tab for the workbench."""
import gradio as gr
import evaluate
def create_evaluation_tab() -> gr.Tab:
"""Create the model evaluation tab."""
def compute_metrics(
predictions_text: str,
references_text: str,
metrics: list[str],
) -> str:
predictions = [line.strip() for line in predictions_text.split("\n") if line.strip()]
references = [line.strip() for line in references_text.split("\n") if line.strip()]
if len(predictions) != len(references):
return f"Mismatch: {len(predictions)} predictions vs {len(references)} references"
results = {}
if "ROUGE" in metrics:
rouge = evaluate.load("rouge")
results.update(rouge.compute(predictions=predictions, references=references))
if "BLEU" in metrics:
bleu = evaluate.load("bleu")
refs = [[r] for r in references]
bleu_result = bleu.compute(predictions=predictions, references=refs)
results["bleu"] = bleu_result["bleu"]
output = "## Evaluation Results\n\n"
output += "| Metric | Score |\n|--------|-------|\n"
for metric, score in results.items():
if isinstance(score, float):
output += f"| {metric} | {score:.4f} |\n"
return output
with gr.Tab("Evaluation") as tab:
gr.Markdown("## Model Evaluation")
metrics_select = gr.CheckboxGroup(
choices=["ROUGE", "BLEU"],
value=["ROUGE"],
label="Metrics",
)
with gr.Row():
preds_input = gr.Textbox(
label="Predictions (one per line)",
lines=8,
)
refs_input = gr.Textbox(
label="References (one per line)",
lines=8,
)
eval_btn = gr.Button("Evaluate", variant="primary")
results_output = gr.Markdown()
eval_btn.click(
compute_metrics,
[preds_input, refs_input, metrics_select],
results_output,
)
return tabStep 7: Main Application
"""Main Gradio application — Production AI Workbench."""
import gradio as gr
from src.model_manager import ModelManager
from src.text_gen import create_text_gen_tab
from src.search import create_search_tab
from src.evaluation import create_evaluation_tab
def create_app() -> gr.Blocks:
"""Create the complete Gradio application."""
manager = ModelManager()
with gr.Blocks(
title="AI Workbench",
theme=gr.themes.Soft(),
) as app:
gr.Markdown(
"# AI Workbench\n"
"A unified interface for text generation, semantic search, "
"and model evaluation powered by the HuggingFace ecosystem."
)
create_text_gen_tab(manager)
create_search_tab(manager)
create_evaluation_tab()
with gr.Tab("Settings"):
gr.Markdown("## System Status")
status_output = gr.JSON(label="Status")
refresh_btn = gr.Button("Refresh Status")
refresh_btn.click(lambda: manager.status, outputs=status_output)
clear_btn = gr.Button("Clear Model Cache", variant="stop")
clear_btn.click(
lambda: (manager.clear_cache(), "Cache cleared")[1],
outputs=gr.Textbox(label="Result"),
)
return app
if __name__ == "__main__":
app = create_app()
app.launch(
server_name="0.0.0.0",
server_port=7860,
share=False,
)Step 8: HuggingFace Spaces Deployment
FROM python:3.11-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Expose Gradio port
EXPOSE 7860
# Run the app
CMD ["python", "app.py"]Deploying to HuggingFace Spaces:
# 1. Create a Space on huggingface.co/new-space
# Select "Gradio" as the SDK
# 2. Clone the Space repo
git clone https://huggingface.co/spaces/YOUR_USERNAME/ai-workbench
# 3. Copy your project files
cp -r production-workbench/* ai-workbench/
# 4. Push to deploy
cd ai-workbench
git add .
git commit -m "Initial deployment"
git push
# The Space will build and deploy automatically.
# Free tier: 2 vCPU, 16GB RAM (CPU-only)
# Upgraded: T4 GPU ($0.60/hr) or A10G ($1.05/hr)Deployment Architecture:
┌─────────────────────────────────────────────────────────────────┐
│ HUGGINGFACE SPACES DEPLOYMENT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ your-username/ai-workbench.hf.space │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ HF Spaces Infrastructure │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────────────────┐ │ │
│ │ │ Docker │───▶│ Gradio │───▶│ Public URL │ │ │
│ │ │ Build │ │ Server │ │ (HTTPS, auto-SSL) │ │ │
│ │ └──────────┘ └──────────┘ └──────────────────────┘ │ │
│ │ │ │
│ │ Features: │ │
│ │ • Auto-build from Dockerfile or requirements.txt │ │
│ │ • Git-based deployment (push to deploy) │ │
│ │ • Sleep after inactivity (free tier) │ │
│ │ • Persistent storage (optional, paid) │ │
│ │ • GPU upgrade available │ │
│ │ • Custom domain support │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Running the Project
# Install dependencies
pip install -r requirements.txt
# Run locally
python app.py
# Open http://localhost:7860
# Run with GPU
CUDA_VISIBLE_DEVICES=0 python app.py
# Deploy to Spaces
# Follow the deployment steps aboveKey Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| Gradio Blocks | Flexible UI framework for ML apps | Build complex multi-tab interfaces |
| Model Manager | Centralized model loading with LRU cache | Prevents OOM, shares models across tabs |
| HF Inference API | Serverless model execution | Run large models without local GPU |
| safetensors | Secure tensor file format | 2-5x faster loading, no code execution risk |
| HuggingFace Spaces | Free ML app hosting | Deploy Gradio apps with one git push |
| FAISS | Similarity search library | Sub-millisecond search for the search tab |
| evaluate | Standardized metrics library | Consistent evaluation across models |
Next Steps
Congratulations on completing the HuggingFace Ecosystem category! Consider:
- Deep Learning — Understand the foundations under these abstractions
- RAG Projects — Apply embeddings and search to RAG pipelines
- AI Agents — Build autonomous agents using HuggingFace models