Deep LearningBeginner
Embedding Model from Scratch
Create text embeddings with PyTorch
Embedding Model from Scratch
TL;DR
Build a text embedding model that converts text into dense vectors for semantic search. Learn pooling strategies (mean, CLS, attention), L2 normalization for cosine similarity, and ChromaDB integration for vector storage and retrieval.
Build a custom embedding model that converts text into dense vector representations for semantic search and similarity tasks.
What You'll Learn
- Word embeddings and tokenization
- Building embedding layers with nn.Embedding
- Mean pooling for sentence embeddings
- Cosine similarity for semantic search
- Integration with vector databases
Tech Stack
| Component | Technology |
|---|---|
| Framework | PyTorch |
| Tokenizer | HuggingFace tokenizers |
| Vector DB | ChromaDB |
| API | FastAPI |
Architecture
┌──────────────────────────────────────────────────────────────────────────────┐
│ TEXT EMBEDDING ARCHITECTURE │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ENCODING PIPELINE │
│ ┌────────────┐ ┌───────────┐ ┌───────────┐ ┌─────────────────────┐ │
│ │ Input Text │──▶│ Tokenizer │──▶│ Token IDs │──▶│ Embedding Layer │ │
│ └────────────┘ └───────────┘ └───────────┘ └──────────┬──────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Transformer Encoder Layers │ │
│ │ (Self-attention + Feed-forward) │ │
│ └──────────────────────┬──────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Pooling Layer (Mean / CLS / Attention) │ │
│ └──────────────────────┬──────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ L2 Normalize → Embedding Vector [1, output_dim] │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ SEMANTIC SEARCH │
│ ┌─────────┐ ┌────────────────┐ ┌──────────────────┐ ┌──────────┐ │
│ │ Query │───▶│ Encode Query │───▶│ Cosine Similarity │───▶│ Top-K │ │
│ └─────────┘ └────────────────┘ │ with Doc Embeds │ │ Results │ │
│ └──────────────────┘ └──────────┘ │
│ ▲ │
│ ┌───────────┐ │ │
│ │ Documents │─────────▶│ (Pre-computed embeddings in ChromaDB) │
│ └───────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘Project Structure
embedding-model/
├── src/
│ ├── __init__.py
│ ├── model.py # Embedding model
│ ├── tokenizer.py # Tokenization
│ ├── pooling.py # Pooling strategies
│ ├── similarity.py # Similarity functions
│ └── index.py # Vector indexing
├── api/
│ └── main.py # FastAPI application
├── tests/
│ └── test_embeddings.py
├── requirements.txt
└── DockerfileImplementation
Step 1: Dependencies
torch>=2.0.0
transformers>=4.30.0
tokenizers>=0.13.0
chromadb>=0.4.0
fastapi>=0.100.0
uvicorn>=0.23.0
numpy>=1.24.0Step 2: Tokenizer Wrapper
"""Tokenizer for embedding model."""
from typing import List, Dict, Optional
from transformers import AutoTokenizer
import torch
class EmbeddingTokenizer:
"""
Tokenizer wrapper for embedding models.
Uses HuggingFace tokenizers for subword tokenization,
which handles out-of-vocabulary words gracefully.
"""
def __init__(
self,
model_name: str = "bert-base-uncased",
max_length: int = 128
):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.max_length = max_length
self.vocab_size = self.tokenizer.vocab_size
self.pad_token_id = self.tokenizer.pad_token_id
def encode(
self,
texts: List[str],
return_tensors: bool = True
) -> Dict[str, torch.Tensor]:
"""
Encode texts to token IDs with attention masks.
Args:
texts: List of input texts
return_tensors: Whether to return PyTorch tensors
Returns:
Dictionary with input_ids and attention_mask
"""
encoded = self.tokenizer(
texts,
padding=True,
truncation=True,
max_length=self.max_length,
return_tensors="pt" if return_tensors else None
)
return {
"input_ids": encoded["input_ids"],
"attention_mask": encoded["attention_mask"]
}
def decode(self, token_ids: torch.Tensor) -> List[str]:
"""Decode token IDs back to text."""
return self.tokenizer.batch_decode(
token_ids,
skip_special_tokens=True
)
class SimpleTokenizer:
"""
Simple word-level tokenizer for learning purposes.
Demonstrates tokenization fundamentals without
external dependencies.
"""
def __init__(self, vocab_size: int = 10000):
self.vocab_size = vocab_size
self.word2idx: Dict[str, int] = {"<PAD>": 0, "<UNK>": 1}
self.idx2word: Dict[int, str] = {0: "<PAD>", 1: "<UNK>"}
def fit(self, texts: List[str]) -> None:
"""Build vocabulary from texts."""
from collections import Counter
# Count word frequencies
word_counts = Counter()
for text in texts:
words = text.lower().split()
word_counts.update(words)
# Add most common words to vocabulary
for idx, (word, _) in enumerate(
word_counts.most_common(self.vocab_size - 2),
start=2
):
self.word2idx[word] = idx
self.idx2word[idx] = word
def encode(
self,
texts: List[str],
max_length: int = 128
) -> Dict[str, torch.Tensor]:
"""Encode texts to padded sequences."""
batch_ids = []
batch_masks = []
for text in texts:
words = text.lower().split()
ids = [
self.word2idx.get(w, 1) # 1 = <UNK>
for w in words[:max_length]
]
# Pad or truncate
mask = [1] * len(ids)
padding_length = max_length - len(ids)
if padding_length > 0:
ids.extend([0] * padding_length)
mask.extend([0] * padding_length)
batch_ids.append(ids)
batch_masks.append(mask)
return {
"input_ids": torch.tensor(batch_ids, dtype=torch.long),
"attention_mask": torch.tensor(batch_masks, dtype=torch.long)
}Step 3: Pooling Strategies
"""Pooling strategies for converting token embeddings to sentence embeddings."""
import torch
import torch.nn as nn
from typing import Optional
class MeanPooling(nn.Module):
"""
Mean pooling over token embeddings.
Takes the average of all token embeddings, weighted by
attention mask to ignore padding tokens.
"""
def forward(
self,
token_embeddings: torch.Tensor,
attention_mask: torch.Tensor
) -> torch.Tensor:
"""
Args:
token_embeddings: [batch_size, seq_len, hidden_dim]
attention_mask: [batch_size, seq_len]
Returns:
Pooled embeddings [batch_size, hidden_dim]
"""
# Expand mask to match embedding dimensions
mask_expanded = attention_mask.unsqueeze(-1).expand(
token_embeddings.size()
).float()
# Sum embeddings (only non-padding tokens)
sum_embeddings = torch.sum(
token_embeddings * mask_expanded, dim=1
)
# Count non-padding tokens
sum_mask = mask_expanded.sum(dim=1)
sum_mask = torch.clamp(sum_mask, min=1e-9) # Avoid division by zero
# Compute mean
return sum_embeddings / sum_mask
class CLSPooling(nn.Module):
"""
CLS token pooling.
Uses only the [CLS] token embedding as the sentence representation.
Common in BERT-style models.
"""
def forward(
self,
token_embeddings: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""
Args:
token_embeddings: [batch_size, seq_len, hidden_dim]
Returns:
CLS embedding [batch_size, hidden_dim]
"""
return token_embeddings[:, 0, :]
class MaxPooling(nn.Module):
"""
Max pooling over token embeddings.
Takes the maximum value for each dimension across all tokens.
"""
def forward(
self,
token_embeddings: torch.Tensor,
attention_mask: torch.Tensor
) -> torch.Tensor:
"""
Args:
token_embeddings: [batch_size, seq_len, hidden_dim]
attention_mask: [batch_size, seq_len]
Returns:
Pooled embeddings [batch_size, hidden_dim]
"""
# Set padding tokens to large negative value
mask_expanded = attention_mask.unsqueeze(-1).expand(
token_embeddings.size()
)
token_embeddings = token_embeddings.masked_fill(
mask_expanded == 0, -1e9
)
# Max pool
return torch.max(token_embeddings, dim=1)[0]
class AttentionPooling(nn.Module):
"""
Attention-based pooling.
Learns to weight different tokens based on their importance.
"""
def __init__(self, hidden_dim: int):
super().__init__()
self.attention = nn.Sequential(
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, 1)
)
def forward(
self,
token_embeddings: torch.Tensor,
attention_mask: torch.Tensor
) -> torch.Tensor:
"""
Args:
token_embeddings: [batch_size, seq_len, hidden_dim]
attention_mask: [batch_size, seq_len]
Returns:
Pooled embeddings [batch_size, hidden_dim]
"""
# Compute attention weights
weights = self.attention(token_embeddings).squeeze(-1)
# Mask padding tokens
weights = weights.masked_fill(attention_mask == 0, -1e9)
weights = torch.softmax(weights, dim=1)
# Weighted sum
return torch.sum(
token_embeddings * weights.unsqueeze(-1), dim=1
)Understanding Pooling Strategies:
┌─────────────────────────────────────────────────────────────────────────────┐
│ POOLING: FROM TOKEN EMBEDDINGS TO SENTENCE EMBEDDING │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Input: Token embeddings [batch, seq_len, hidden_dim] │
│ Output: Single embedding [batch, hidden_dim] │
│ │
│ MEAN POOLING (Most Common): │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Tokens: [The] [cat] [sat] [PAD] [PAD] │ │
│ │ Mask: [ 1 ] [ 1 ] [ 1 ] [ 0 ] [ 0 ] │ │
│ │ │ │
│ │ Result = (emb[The] + emb[cat] + emb[sat]) / 3 │ │
│ │ ▲ Only non-PAD tokens contribute │ │
│ │ │ │
│ │ ✓ Works for any sequence length │ │
│ │ ✓ Uses information from all tokens │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ CLS POOLING (BERT-style): │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Tokens: [CLS] [The] [cat] [sat] [SEP] │ │
│ │ │ │ │
│ │ └──► Use ONLY this token's embedding │ │
│ │ │ │
│ │ ✓ Simple, fast │ │
│ │ ✗ Requires model trained with CLS pooling objective │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ATTENTION POOLING (Learned): │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Tokens: [The] [cat] [sat] [on] [mat] │ │
│ │ Weights: 0.1 0.4 0.3 0.1 0.1 (learned, sum to 1) │ │
│ │ ▲ │ │
│ │ └── Model learns "cat" is most important │ │
│ │ │ │
│ │ Result = 0.1*emb[The] + 0.4*emb[cat] + 0.3*emb[sat] + ... │ │
│ │ │ │
│ │ ✓ Adaptive - focuses on important tokens │ │
│ │ ✗ Requires training the attention layer │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘When to Use Each:
| Strategy | Best For | sentence-transformers Default |
|---|---|---|
| Mean | General text, any model | Yes (most models) |
| CLS | BERT-trained models | No |
| Attention | When you can train pooling | No |
| Max | Rare - when key info is sparse | No |
Step 4: Embedding Model
"""Custom embedding model."""
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Dict, Optional, List
from .pooling import MeanPooling, CLSPooling, AttentionPooling
class TextEmbeddingModel(nn.Module):
"""
Text embedding model that converts text to dense vectors.
Architecture:
- Embedding layer for token representations
- Transformer encoder for contextual understanding
- Pooling layer for sentence representation
- Optional projection head for dimension reduction
Args:
vocab_size: Size of vocabulary
embedding_dim: Dimension of token embeddings
hidden_dim: Hidden dimension of transformer
num_layers: Number of transformer layers
num_heads: Number of attention heads
output_dim: Final embedding dimension
pooling: Pooling strategy ('mean', 'cls', 'attention')
dropout: Dropout probability
"""
def __init__(
self,
vocab_size: int,
embedding_dim: int = 256,
hidden_dim: int = 512,
num_layers: int = 4,
num_heads: int = 8,
output_dim: int = 384,
pooling: str = "mean",
dropout: float = 0.1,
max_length: int = 128
):
super().__init__()
self.embedding_dim = embedding_dim
self.output_dim = output_dim
# Token embedding
self.token_embedding = nn.Embedding(
vocab_size,
embedding_dim,
padding_idx=0
)
# Positional embedding
self.position_embedding = nn.Embedding(max_length, embedding_dim)
# Transformer encoder
encoder_layer = nn.TransformerEncoderLayer(
d_model=embedding_dim,
nhead=num_heads,
dim_feedforward=hidden_dim,
dropout=dropout,
batch_first=True
)
self.transformer = nn.TransformerEncoder(
encoder_layer,
num_layers=num_layers
)
# Pooling layer
if pooling == "mean":
self.pooling = MeanPooling()
elif pooling == "cls":
self.pooling = CLSPooling()
elif pooling == "attention":
self.pooling = AttentionPooling(embedding_dim)
else:
raise ValueError(f"Unknown pooling: {pooling}")
# Projection head
self.projection = nn.Sequential(
nn.Linear(embedding_dim, hidden_dim),
nn.GELU(),
nn.Linear(hidden_dim, output_dim)
)
# Layer normalization
self.layer_norm = nn.LayerNorm(output_dim)
self._init_weights()
def _init_weights(self):
"""Initialize weights."""
for module in self.modules():
if isinstance(module, nn.Linear):
nn.init.xavier_uniform_(module.weight)
if module.bias is not None:
nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
nn.init.normal_(module.weight, std=0.02)
def forward(
self,
input_ids: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""
Compute embeddings for input texts.
Args:
input_ids: Token IDs [batch_size, seq_len]
attention_mask: Attention mask [batch_size, seq_len]
Returns:
Normalized embeddings [batch_size, output_dim]
"""
batch_size, seq_len = input_ids.shape
if attention_mask is None:
attention_mask = (input_ids != 0).long()
# Get token embeddings
token_emb = self.token_embedding(input_ids)
# Add positional embeddings
positions = torch.arange(seq_len, device=input_ids.device)
positions = positions.unsqueeze(0).expand(batch_size, -1)
pos_emb = self.position_embedding(positions)
embeddings = token_emb + pos_emb
# Create attention mask for transformer
# Convert from [batch, seq] to [batch, seq, seq] for self-attention
src_key_padding_mask = (attention_mask == 0)
# Apply transformer
hidden_states = self.transformer(
embeddings,
src_key_padding_mask=src_key_padding_mask
)
# Pool to sentence embedding
pooled = self.pooling(hidden_states, attention_mask)
# Project and normalize
projected = self.projection(pooled)
normalized = self.layer_norm(projected)
# L2 normalize for cosine similarity
normalized = F.normalize(normalized, p=2, dim=-1)
return normalized
@torch.no_grad()
def encode(
self,
texts: List[str],
tokenizer,
batch_size: int = 32
) -> torch.Tensor:
"""
Encode texts to embeddings.
Args:
texts: List of input texts
tokenizer: Tokenizer instance
batch_size: Batch size for encoding
Returns:
Embeddings [num_texts, output_dim]
"""
self.eval()
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch_texts = texts[i:i + batch_size]
encoded = tokenizer.encode(batch_texts)
input_ids = encoded["input_ids"].to(
next(self.parameters()).device
)
attention_mask = encoded["attention_mask"].to(
next(self.parameters()).device
)
embeddings = self.forward(input_ids, attention_mask)
all_embeddings.append(embeddings.cpu())
return torch.cat(all_embeddings, dim=0)Step 5: Similarity Functions
"""Similarity computation functions."""
import torch
import numpy as np
from typing import List, Tuple, Optional
def cosine_similarity(
query: torch.Tensor,
documents: torch.Tensor
) -> torch.Tensor:
"""
Compute cosine similarity between query and documents.
Args:
query: Query embedding [1, dim] or [dim]
documents: Document embeddings [num_docs, dim]
Returns:
Similarity scores [num_docs]
"""
if query.dim() == 1:
query = query.unsqueeze(0)
# Normalize (should already be normalized, but ensure)
query = torch.nn.functional.normalize(query, p=2, dim=-1)
documents = torch.nn.functional.normalize(documents, p=2, dim=-1)
# Compute similarity
similarity = torch.mm(query, documents.t()).squeeze(0)
return similarity
def euclidean_distance(
query: torch.Tensor,
documents: torch.Tensor
) -> torch.Tensor:
"""
Compute Euclidean distance between query and documents.
Args:
query: Query embedding [1, dim] or [dim]
documents: Document embeddings [num_docs, dim]
Returns:
Distances [num_docs] (lower is more similar)
"""
if query.dim() == 1:
query = query.unsqueeze(0)
# Compute pairwise distances
distances = torch.cdist(query, documents, p=2).squeeze(0)
return distances
def dot_product(
query: torch.Tensor,
documents: torch.Tensor
) -> torch.Tensor:
"""
Compute dot product similarity.
Args:
query: Query embedding [1, dim] or [dim]
documents: Document embeddings [num_docs, dim]
Returns:
Similarity scores [num_docs]
"""
if query.dim() == 1:
query = query.unsqueeze(0)
return torch.mm(query, documents.t()).squeeze(0)
class SemanticSearch:
"""
Semantic search using embedding similarity.
"""
def __init__(
self,
model,
tokenizer,
similarity_fn: str = "cosine"
):
self.model = model
self.tokenizer = tokenizer
self.similarity_fn = {
"cosine": cosine_similarity,
"euclidean": euclidean_distance,
"dot": dot_product
}[similarity_fn]
self.documents: List[str] = []
self.embeddings: Optional[torch.Tensor] = None
def index(self, documents: List[str]) -> None:
"""Index documents for search."""
self.documents = documents
self.embeddings = self.model.encode(documents, self.tokenizer)
print(f"Indexed {len(documents)} documents")
def search(
self,
query: str,
top_k: int = 5
) -> List[Tuple[str, float]]:
"""
Search for most similar documents.
Args:
query: Search query
top_k: Number of results to return
Returns:
List of (document, score) tuples
"""
if self.embeddings is None:
raise ValueError("No documents indexed")
# Encode query
query_embedding = self.model.encode([query], self.tokenizer)
# Compute similarities
scores = self.similarity_fn(query_embedding[0], self.embeddings)
# Get top-k
if self.similarity_fn == euclidean_distance:
# Lower is better for distance
top_indices = torch.argsort(scores)[:top_k]
else:
# Higher is better for similarity
top_indices = torch.argsort(scores, descending=True)[:top_k]
results = [
(self.documents[idx], scores[idx].item())
for idx in top_indices
]
return resultsStep 6: Vector Index with ChromaDB
"""Vector indexing with ChromaDB."""
import chromadb
from chromadb.config import Settings
from typing import List, Dict, Optional
import torch
class VectorIndex:
"""
Vector index using ChromaDB for persistent storage.
"""
def __init__(
self,
model,
tokenizer,
collection_name: str = "embeddings",
persist_directory: Optional[str] = None
):
self.model = model
self.tokenizer = tokenizer
# Initialize ChromaDB
if persist_directory:
self.client = chromadb.PersistentClient(
path=persist_directory
)
else:
self.client = chromadb.Client()
# Get or create collection
self.collection = self.client.get_or_create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"}
)
def add(
self,
documents: List[str],
ids: Optional[List[str]] = None,
metadata: Optional[List[Dict]] = None
) -> None:
"""
Add documents to the index.
Args:
documents: List of document texts
ids: Optional document IDs
metadata: Optional metadata for each document
"""
# Generate embeddings
embeddings = self.model.encode(documents, self.tokenizer)
embeddings_list = embeddings.numpy().tolist()
# Generate IDs if not provided
if ids is None:
existing_count = self.collection.count()
ids = [f"doc_{existing_count + i}" for i in range(len(documents))]
# Add to collection
self.collection.add(
documents=documents,
embeddings=embeddings_list,
ids=ids,
metadatas=metadata
)
print(f"Added {len(documents)} documents to index")
def search(
self,
query: str,
top_k: int = 5,
filter: Optional[Dict] = None
) -> List[Dict]:
"""
Search for similar documents.
Args:
query: Search query
top_k: Number of results
filter: Optional metadata filter
Returns:
List of results with document, score, and metadata
"""
# Encode query
query_embedding = self.model.encode([query], self.tokenizer)
query_list = query_embedding.numpy().tolist()
# Search
results = self.collection.query(
query_embeddings=query_list,
n_results=top_k,
where=filter
)
# Format results
formatted = []
for i in range(len(results["documents"][0])):
formatted.append({
"document": results["documents"][0][i],
"id": results["ids"][0][i],
"score": 1 - results["distances"][0][i], # Convert distance to similarity
"metadata": results["metadatas"][0][i] if results["metadatas"] else None
})
return formatted
def delete(self, ids: List[str]) -> None:
"""Delete documents by ID."""
self.collection.delete(ids=ids)
def count(self) -> int:
"""Get number of documents in index."""
return self.collection.count()Step 7: FastAPI Application
"""FastAPI application for embedding service."""
import torch
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from typing import List, Optional, Dict
from src.model import TextEmbeddingModel
from src.tokenizer import EmbeddingTokenizer
from src.index import VectorIndex
class EmbedRequest(BaseModel):
texts: List[str] = Field(..., min_length=1, max_length=100)
class EmbedResponse(BaseModel):
embeddings: List[List[float]]
dimension: int
class SearchRequest(BaseModel):
query: str = Field(..., min_length=1)
top_k: int = Field(default=5, ge=1, le=100)
filter: Optional[Dict] = None
class SearchResult(BaseModel):
document: str
score: float
id: str
metadata: Optional[Dict] = None
class IndexRequest(BaseModel):
documents: List[str] = Field(..., min_length=1)
ids: Optional[List[str]] = None
metadata: Optional[List[Dict]] = None
# Global instances
model: TextEmbeddingModel = None
tokenizer: EmbeddingTokenizer = None
index: VectorIndex = None
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
app = FastAPI(
title="Embedding Service API",
description="Generate and search text embeddings"
)
@app.on_event("startup")
async def startup():
global model, tokenizer, index
# Initialize tokenizer
tokenizer = EmbeddingTokenizer()
# Initialize model
model = TextEmbeddingModel(
vocab_size=tokenizer.vocab_size,
embedding_dim=256,
output_dim=384
)
model.to(device)
model.eval()
# Initialize index
index = VectorIndex(
model=model,
tokenizer=tokenizer,
collection_name="documents",
persist_directory="./chroma_db"
)
print(f"Service started on {device}")
@app.get("/health")
async def health():
return {
"status": "healthy",
"device": str(device),
"documents_indexed": index.count() if index else 0
}
@app.post("/embed", response_model=EmbedResponse)
async def embed(request: EmbedRequest):
"""Generate embeddings for texts."""
embeddings = model.encode(request.texts, tokenizer)
return EmbedResponse(
embeddings=embeddings.tolist(),
dimension=model.output_dim
)
@app.post("/index")
async def index_documents(request: IndexRequest):
"""Add documents to the search index."""
index.add(
documents=request.documents,
ids=request.ids,
metadata=request.metadata
)
return {"indexed": len(request.documents), "total": index.count()}
@app.post("/search", response_model=List[SearchResult])
async def search(request: SearchRequest):
"""Search for similar documents."""
results = index.search(
query=request.query,
top_k=request.top_k,
filter=request.filter
)
return [SearchResult(**r) for r in results]
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)Running the Project
# Install dependencies
pip install -r requirements.txt
# Run the API
uvicorn api.main:app --reload
# Generate embeddings
curl -X POST http://localhost:8000/embed \
-H "Content-Type: application/json" \
-d '{"texts": ["Hello world", "How are you?"]}'
# Index documents
curl -X POST http://localhost:8000/index \
-H "Content-Type: application/json" \
-d '{"documents": ["Python is great", "Machine learning is fun"]}'
# Search
curl -X POST http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{"query": "programming languages", "top_k": 3}'Key Concepts
Embeddings
Dense vector representations that capture semantic meaning:
# Similar texts have similar embeddings
embed("I love dogs") ≈ embed("I adore puppies")
embed("I love dogs") ≠ embed("The stock market crashed")Mean Pooling
Averaging token embeddings (weighted by attention mask):
# Ignore padding tokens
mask_expanded = attention_mask.unsqueeze(-1)
sum_embeddings = (token_embeddings * mask_expanded).sum(dim=1)
mean_embedding = sum_embeddings / mask_expanded.sum(dim=1)Cosine Similarity
Measures angle between vectors (independent of magnitude):
similarity = (A · B) / (||A|| × ||B||)
# Range: -1 (opposite) to 1 (identical)L2 Normalization
Normalizing embeddings to unit length for cosine similarity:
normalized = embedding / embedding.norm(p=2, dim=-1, keepdim=True)Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| Mean Pooling | Average token embeddings weighted by attention mask | Best general-purpose pooling, handles variable lengths |
| CLS Pooling | Use only the [CLS] token embedding | Common in BERT-style models, single-vector representation |
| Attention Pooling | Learned weights for each token position | Adaptive focus on important tokens |
| L2 Normalization | Scale embeddings to unit length | Required for cosine similarity, makes magnitudes comparable |
| Cosine Similarity | Angle between vectors (dot product of normalized) | Standard metric for semantic similarity, range [-1, 1] |
| Positional Embedding | Adds position information to tokens | Transformers need explicit position encoding |
| ChromaDB | Vector database with HNSW indexing | Fast approximate nearest neighbor search at scale |
| Subword Tokenization | Split words into smaller pieces (BPE/WordPiece) | Handles out-of-vocabulary words gracefully |
Next Steps
- Inference API - Serve models efficiently
- Custom Reranker - Train a reranker for RAG