HuggingFace EcosystemIntermediate
Text Embeddings & Semantic Search
Build a semantic search engine with sentence-transformers and FAISS
Text Embeddings & Semantic Search
TL;DR
Build a semantic search engine using sentence-transformers for embedding generation and FAISS for fast similarity search. Learn model selection from the MTEB leaderboard, embedding fine-tuning, and evaluation with NDCG/MAP/MRR metrics.
Build a complete semantic search system using HuggingFace sentence-transformers and Facebook's FAISS library, with model selection, fine-tuning, and evaluation.
What You'll Learn
- Generating embeddings with
sentence-transformers - Model selection from the MTEB leaderboard
- FAISS index types and configuration
- Fine-tuning embeddings for your domain
- Evaluation metrics (NDCG, MAP, MRR)
- FastAPI search API with batched inference
Tech Stack
| Component | Technology |
|---|---|
| Embeddings | sentence-transformers |
| Vector Index | faiss-cpu |
| Evaluation | MTEB metrics |
| API | FastAPI |
| Python | 3.10+ |
Architecture
┌──────────────────────────────────────────────────────────────────────────────┐
│ SEMANTIC SEARCH ENGINE │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ INDEXING PIPELINE │
│ ┌───────────┐ ┌──────────────────┐ ┌────────────────┐ │
│ │ Documents │──▶│ sentence- │──▶│ FAISS Index │ │
│ │ (corpus) │ │ transformers │ │ (IVF + PQ) │ │
│ └───────────┘ │ .encode() │ └────────────────┘ │
│ └──────────────────┘ │
│ │
│ QUERY PIPELINE │
│ ┌───────────┐ ┌──────────────────┐ ┌────────────────┐ ┌──────────┐ │
│ │ Query │──▶│ sentence- │──▶│ FAISS Search │──▶│ Results │ │
│ │ │ │ transformers │ │ (k nearest) │ │ + Scores │ │
│ └───────────┘ │ .encode() │ └────────────────┘ └──────────┘ │
│ └──────────────────┘ │
│ │
│ FINE-TUNING PIPELINE │
│ ┌───────────────┐ ┌─────────────────┐ ┌─────────────────────────────┐ │
│ │ Training Pairs │──▶│ Contrastive │──▶│ Domain-adapted Embeddings │ │
│ │ (query, pos, │ │ Loss Function │ │ (better for your data) │ │
│ │ neg) │ └─────────────────┘ └─────────────────────────────┘ │
│ └───────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘Project Structure
embeddings-search/
├── src/
│ ├── __init__.py
│ ├── embeddings.py # Embedding generation and model loading
│ ├── index.py # FAISS index management
│ ├── search.py # Search engine combining embeddings + index
│ ├── finetune.py # Fine-tune embeddings for your domain
│ └── evaluate.py # NDCG, MAP, MRR evaluation
├── api/
│ └── main.py # FastAPI search application
├── data/
│ └── corpus.jsonl
├── requirements.txt
└── README.mdImplementation
Step 1: Dependencies
sentence-transformers>=3.0.0
faiss-cpu>=1.8.0
transformers>=4.40.0
datasets>=2.19.0
fastapi>=0.111.0
uvicorn>=0.30.0
numpy>=1.26.0Step 2: Embedding Generation
"""Embedding generation with sentence-transformers."""
from sentence_transformers import SentenceTransformer
import numpy as np
class EmbeddingModel:
"""
Wrapper around sentence-transformers for embedding generation.
sentence-transformers differs from raw transformers by:
1. Adding mean pooling over token embeddings by default
2. Normalizing embeddings to unit length (for cosine similarity)
3. Optimized batch encoding with progress bars
"""
# Top models from the MTEB leaderboard (as of 2025)
RECOMMENDED_MODELS = {
"fast": "all-MiniLM-L6-v2", # 384-dim, 80MB, fast
"balanced": "all-mpnet-base-v2", # 768-dim, 420MB, good quality
"quality": "BAAI/bge-large-en-v1.5", # 1024-dim, 1.3GB, best quality
"multilingual": "intfloat/multilingual-e5-large", # 1024-dim, multi-lang
}
def __init__(
self,
model_name: str = "all-MiniLM-L6-v2",
device: str | None = None,
):
self.model = SentenceTransformer(model_name, device=device)
self.dimension = self.model.get_sentence_embedding_dimension()
self.model_name = model_name
def encode(
self,
texts: list[str],
batch_size: int = 64,
normalize: bool = True,
show_progress: bool = True,
) -> np.ndarray:
"""
Encode texts into embeddings.
Args:
texts: List of texts to encode
batch_size: Encoding batch size
normalize: L2 normalize embeddings (required for cosine similarity)
show_progress: Show encoding progress bar
Returns:
numpy array of shape [len(texts), dimension]
"""
embeddings = self.model.encode(
texts,
batch_size=batch_size,
normalize_embeddings=normalize,
show_progress_bar=show_progress,
)
return embeddings
def similarity(
self,
texts_a: list[str],
texts_b: list[str],
) -> np.ndarray:
"""Compute pairwise cosine similarity between two text lists."""
emb_a = self.encode(texts_a, show_progress=False)
emb_b = self.encode(texts_b, show_progress=False)
return np.dot(emb_a, emb_b.T)Model Selection Guide:
| Model | Dimensions | Size | Speed | Quality | Use Case |
|---|---|---|---|---|---|
all-MiniLM-L6-v2 | 384 | 80MB | Fast | Good | Prototyping, low-resource |
all-mpnet-base-v2 | 768 | 420MB | Medium | Better | Production general-purpose |
BAAI/bge-large-en-v1.5 | 1024 | 1.3GB | Slow | Best | Quality-critical applications |
intfloat/multilingual-e5-large | 1024 | 1.3GB | Slow | Best | Multi-language support |
Step 3: FAISS Index
"""FAISS index management for fast similarity search."""
import faiss
import numpy as np
from pathlib import Path
class FAISSIndex:
"""
FAISS index for fast nearest-neighbor search.
Index types:
- Flat: Exact search (brute-force). Best for <100K vectors.
- IVF: Inverted file index. Partitions space into clusters.
- HNSW: Hierarchical navigable small world graph. Fast, good recall.
- PQ: Product quantization. Compresses vectors for memory savings.
"""
def __init__(self, dimension: int, index_type: str = "flat"):
self.dimension = dimension
self.index_type = index_type
self.index = self._create_index(dimension, index_type)
self.documents: list[dict] = []
def _create_index(self, dim: int, index_type: str) -> faiss.Index:
"""Create a FAISS index of the specified type."""
if index_type == "flat":
# Exact search — best quality, O(n) per query
return faiss.IndexFlatIP(dim) # Inner product (cosine for normalized vecs)
elif index_type == "ivf":
# Approximate search — partition into 100 clusters
nlist = 100
quantizer = faiss.IndexFlatIP(dim)
index = faiss.IndexIVFFlat(quantizer, dim, nlist, faiss.METRIC_INNER_PRODUCT)
return index
elif index_type == "hnsw":
# Graph-based approximate search
index = faiss.IndexHNSWFlat(dim, 32) # 32 neighbors per node
index.hnsw.efConstruction = 200 # Build-time quality
index.hnsw.efSearch = 64 # Search-time quality
return index
elif index_type == "ivf_pq":
# Memory-efficient approximate search
nlist = 100
m = 8 # Number of sub-quantizers
quantizer = faiss.IndexFlatIP(dim)
index = faiss.IndexIVFPQ(
quantizer, dim, nlist, m, 8, faiss.METRIC_INNER_PRODUCT,
)
return index
else:
raise ValueError(f"Unknown index type: {index_type}")
def add(
self,
embeddings: np.ndarray,
documents: list[dict],
):
"""Add embeddings and their associated documents to the index."""
if self.index_type in ("ivf", "ivf_pq") and not self.index.is_trained:
self.index.train(embeddings)
self.index.add(embeddings.astype("float32"))
self.documents.extend(documents)
def search(
self,
query_embedding: np.ndarray,
k: int = 10,
) -> list[dict]:
"""Search for k nearest neighbors."""
if query_embedding.ndim == 1:
query_embedding = query_embedding.reshape(1, -1)
scores, indices = self.index.search(query_embedding.astype("float32"), k)
results = []
for score, idx in zip(scores[0], indices[0]):
if idx == -1: # FAISS returns -1 for missing results
continue
result = {**self.documents[idx], "score": float(score)}
results.append(result)
return results
def save(self, path: str):
"""Save index to disk."""
faiss.write_index(self.index, f"{path}.index")
import json
with open(f"{path}.meta.json", "w") as f:
json.dump(self.documents, f)
def load(self, path: str):
"""Load index from disk."""
self.index = faiss.read_index(f"{path}.index")
import json
with open(f"{path}.meta.json") as f:
self.documents = json.load(f)FAISS Index Type Comparison:
┌─────────────────────────────────────────────────────────────────┐
│ FAISS INDEX TYPES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Flat (exact) IVF (clusters) HNSW (graph) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ • • • • • • │ │ C1: • • • │ │ •─•─• │ │
│ │ • • • • • • │ │ C2: • • │ │ │╲│╲│ │ │
│ │ • • • • • • │ │ C3: • • • •│ │ •─•─• │ │
│ │ • • • • • • │ │ C4: • • │ │ ╲│╱ │ │
│ └─────────────┘ └─────────────┘ │ • │ │
│ Compare to ALL Search only └─────────────┘ │
│ 100% recall nearby clusters Navigate graph │
│ Slow for >100K Fast, tunable Fast, high recall │
│ │
│ Speed: Flat < IVF < HNSW │
│ Recall: Flat > HNSW > IVF │
│ Memory: Flat = HNSW > IVF_PQ │
│ │
└─────────────────────────────────────────────────────────────────┘Step 4: Search Engine
"""Search engine combining embeddings and FAISS index."""
from .embeddings import EmbeddingModel
from .index import FAISSIndex
class SearchEngine:
"""Semantic search engine with indexing and querying."""
def __init__(
self,
model_name: str = "all-MiniLM-L6-v2",
index_type: str = "flat",
):
self.embed_model = EmbeddingModel(model_name)
self.index = FAISSIndex(
dimension=self.embed_model.dimension,
index_type=index_type,
)
def index_documents(self, documents: list[dict], text_field: str = "text"):
"""Index a list of documents."""
texts = [doc[text_field] for doc in documents]
embeddings = self.embed_model.encode(texts)
self.index.add(embeddings, documents)
print(f"Indexed {len(documents)} documents")
def search(self, query: str, k: int = 10) -> list[dict]:
"""Search for documents similar to the query."""
query_emb = self.embed_model.encode([query], show_progress=False)
return self.index.search(query_emb[0], k=k)
def batch_search(
self,
queries: list[str],
k: int = 10,
) -> list[list[dict]]:
"""Search for multiple queries at once."""
query_embs = self.embed_model.encode(queries, show_progress=False)
return [
self.index.search(emb, k=k)
for emb in query_embs
]Step 5: Fine-tuning Embeddings
"""Fine-tune sentence-transformers for domain-specific search."""
from sentence_transformers import (
SentenceTransformer,
InputExample,
losses,
evaluation,
)
from torch.utils.data import DataLoader
def finetune_with_pairs(
model_name: str = "all-MiniLM-L6-v2",
train_pairs: list[tuple[str, str, float]],
val_pairs: list[tuple[str, str, float]] | None = None,
epochs: int = 3,
batch_size: int = 16,
output_path: str = "models/finetuned-embeddings",
):
"""
Fine-tune an embedding model with query-document pairs.
Args:
model_name: Base model to fine-tune
train_pairs: List of (text_a, text_b, similarity_score) tuples
val_pairs: Validation pairs for evaluation
epochs: Training epochs
batch_size: Training batch size
output_path: Where to save the fine-tuned model
"""
model = SentenceTransformer(model_name)
# Convert to InputExamples
train_examples = [
InputExample(texts=[a, b], label=score)
for a, b, score in train_pairs
]
train_loader = DataLoader(
train_examples,
shuffle=True,
batch_size=batch_size,
)
# CosineSimilarityLoss: learns to match predicted cosine similarity to labels
train_loss = losses.CosineSimilarityLoss(model)
# Optional: evaluation during training
evaluator = None
if val_pairs:
sentences1 = [p[0] for p in val_pairs]
sentences2 = [p[1] for p in val_pairs]
scores = [p[2] for p in val_pairs]
evaluator = evaluation.EmbeddingSimilarityEvaluator(
sentences1, sentences2, scores,
)
# Train
model.fit(
train_objectives=[(train_loader, train_loss)],
epochs=epochs,
evaluator=evaluator,
evaluation_steps=100,
output_path=output_path,
show_progress_bar=True,
)
return model
def finetune_with_triplets(
model_name: str = "all-MiniLM-L6-v2",
triplets: list[tuple[str, str, str]],
epochs: int = 3,
output_path: str = "models/finetuned-triplet",
):
"""
Fine-tune with triplet loss: (anchor, positive, negative).
The model learns to push anchor closer to positive
and farther from negative.
"""
model = SentenceTransformer(model_name)
train_examples = [
InputExample(texts=[anchor, pos, neg])
for anchor, pos, neg in triplets
]
train_loader = DataLoader(
train_examples,
shuffle=True,
batch_size=16,
)
# TripletLoss: distance(anchor, positive) < distance(anchor, negative) + margin
train_loss = losses.TripletLoss(model)
model.fit(
train_objectives=[(train_loader, train_loss)],
epochs=epochs,
output_path=output_path,
)
return modelFine-tuning Loss Functions:
| Loss | Input Format | What It Learns |
|---|---|---|
CosineSimilarityLoss | (text_a, text_b, score) | Match cosine similarity to target score |
TripletLoss | (anchor, positive, negative) | Push positive close, negative far |
MultipleNegativesRankingLoss | (query, positive) | Treats other batch items as negatives |
ContrastiveLoss | (text_a, text_b, label) | Binary: similar (1) or dissimilar (0) |
Step 6: Evaluation
"""Evaluate search quality with standard IR metrics."""
import numpy as np
def compute_ndcg(relevance_scores: list[int], k: int = 10) -> float:
"""
Normalized Discounted Cumulative Gain (NDCG@k).
Measures ranking quality — rewards relevant results at higher positions.
"""
relevance = np.array(relevance_scores[:k])
dcg = np.sum(relevance / np.log2(np.arange(2, len(relevance) + 2)))
# Ideal DCG (perfect ranking)
ideal = np.sort(relevance)[::-1]
idcg = np.sum(ideal / np.log2(np.arange(2, len(ideal) + 2)))
return float(dcg / idcg) if idcg > 0 else 0.0
def compute_mrr(results_relevant: list[bool]) -> float:
"""
Mean Reciprocal Rank.
1/position of the first relevant result. Higher = better.
"""
for i, is_relevant in enumerate(results_relevant):
if is_relevant:
return 1.0 / (i + 1)
return 0.0
def compute_map(relevance_scores: list[bool], k: int = 10) -> float:
"""
Mean Average Precision (MAP@k).
Average of precision at each relevant result position.
"""
relevant = np.array(relevance_scores[:k])
if relevant.sum() == 0:
return 0.0
precisions = []
relevant_count = 0
for i, is_rel in enumerate(relevant):
if is_rel:
relevant_count += 1
precisions.append(relevant_count / (i + 1))
return float(np.mean(precisions))
def evaluate_search(
search_engine,
queries: list[str],
ground_truth: list[list[str]],
k: int = 10,
) -> dict:
"""
Evaluate a search engine against ground truth.
Args:
search_engine: SearchEngine instance
queries: List of query strings
ground_truth: For each query, list of relevant document IDs
k: Number of results to evaluate
"""
ndcg_scores = []
mrr_scores = []
map_scores = []
for query, relevant_ids in zip(queries, ground_truth):
results = search_engine.search(query, k=k)
result_ids = [r.get("id", "") for r in results]
is_relevant = [rid in relevant_ids for rid in result_ids]
relevance = [1 if rel else 0 for rel in is_relevant]
ndcg_scores.append(compute_ndcg(relevance, k))
mrr_scores.append(compute_mrr(is_relevant))
map_scores.append(compute_map(is_relevant, k))
return {
f"NDCG@{k}": float(np.mean(ndcg_scores)),
f"MAP@{k}": float(np.mean(map_scores)),
"MRR": float(np.mean(mrr_scores)),
}Step 7: FastAPI Application
"""FastAPI semantic search application."""
from fastapi import FastAPI
from pydantic import BaseModel, Field
from src.search import SearchEngine
app = FastAPI(title="Semantic Search API")
engine = SearchEngine(model_name="all-MiniLM-L6-v2", index_type="flat")
class IndexRequest(BaseModel):
documents: list[dict]
text_field: str = "text"
class SearchRequest(BaseModel):
query: str
k: int = Field(default=10, ge=1, le=100)
@app.post("/index")
async def index_documents(req: IndexRequest):
engine.index_documents(req.documents, req.text_field)
return {"indexed": len(req.documents)}
@app.post("/search")
async def search(req: SearchRequest):
results = engine.search(req.query, k=req.k)
return {"results": results}
@app.get("/health")
async def health():
return {"status": "healthy", "model": engine.embed_model.model_name}Running the Project
# Install dependencies
pip install -r requirements.txt
# Start the API
uvicorn api.main:app --reload
# Index documents
curl -X POST http://localhost:8000/index \
-H "Content-Type: application/json" \
-d '{"documents": [
{"id": "1", "text": "Python is a programming language"},
{"id": "2", "text": "Machine learning uses statistical methods"},
{"id": "3", "text": "Neural networks are inspired by the brain"}
]}'
# Search
curl -X POST http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{"query": "deep learning algorithms", "k": 3}'Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| sentence-transformers | Library for computing text embeddings | Produces semantic vectors optimized for similarity |
| FAISS | Facebook's similarity search library | Sub-millisecond search over millions of vectors |
| MTEB | Massive Text Embedding Benchmark | Standardized leaderboard for model selection |
| Cosine Similarity | Angle between two vectors | Standard metric for semantic similarity |
| NDCG | Normalized Discounted Cumulative Gain | Measures ranking quality with graded relevance |
| MRR | Mean Reciprocal Rank | How quickly you find the first relevant result |
| Triplet Loss | (anchor, positive, negative) training | Fine-tunes embeddings for domain-specific similarity |
Next Steps
- Image Generation with Diffusers — Generate images with Stable Diffusion
- Fine-Tuning with PEFT — Fine-tune full models with LoRA