Embedding Model from Scratch

TL;DR

Build a text embedding model that converts text into dense vectors for semantic search. Learn pooling strategies (mean, CLS, attention), L2 normalization for cosine similarity, and ChromaDB integration for vector storage and retrieval.

Build a custom embedding model that converts text into dense vector representations for semantic search and similarity tasks.

What You'll Learn

Word embeddings and tokenization
Building embedding layers with nn.Embedding
Mean pooling for sentence embeddings
Cosine similarity for semantic search
Integration with vector databases

Tech Stack

Component	Technology
Framework	PyTorch
Tokenizer	HuggingFace tokenizers
Vector DB	ChromaDB
API	FastAPI

Architecture

┌──────────────────────────────────────────────────────────────────────────────┐
│                        TEXT EMBEDDING ARCHITECTURE                           │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ENCODING PIPELINE                                                           │
│  ┌────────────┐   ┌───────────┐   ┌───────────┐   ┌─────────────────────┐   │
│  │ Input Text │──▶│ Tokenizer │──▶│ Token IDs │──▶│   Embedding Layer   │   │
│  └────────────┘   └───────────┘   └───────────┘   └──────────┬──────────┘   │
│                                                              │              │
│                                                              ▼              │
│                   ┌─────────────────────────────────────────────────────┐   │
│                   │             Transformer Encoder Layers              │   │
│                   │         (Self-attention + Feed-forward)             │   │
│                   └──────────────────────┬──────────────────────────────┘   │
│                                          │                                  │
│                                          ▼                                  │
│                   ┌─────────────────────────────────────────────────────┐   │
│                   │    Pooling Layer (Mean / CLS / Attention)           │   │
│                   └──────────────────────┬──────────────────────────────┘   │
│                                          │                                  │
│                                          ▼                                  │
│                   ┌─────────────────────────────────────────────────────┐   │
│                   │   L2 Normalize → Embedding Vector [1, output_dim]   │   │
│                   └─────────────────────────────────────────────────────┘   │
│                                                                              │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SEMANTIC SEARCH                                                             │
│  ┌─────────┐    ┌────────────────┐    ┌──────────────────┐    ┌──────────┐  │
│  │  Query  │───▶│ Encode Query   │───▶│ Cosine Similarity │───▶│  Top-K   │  │
│  └─────────┘    └────────────────┘    │   with Doc Embeds │    │ Results  │  │
│                                       └──────────────────┘    └──────────┘  │
│                         ▲                                                    │
│  ┌───────────┐          │                                                    │
│  │ Documents │─────────▶│ (Pre-computed embeddings in ChromaDB)              │
│  └───────────┘                                                               │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Project Structure

embedding-model/
├── src/
│   ├── __init__.py
│   ├── model.py           # Embedding model
│   ├── tokenizer.py       # Tokenization
│   ├── pooling.py         # Pooling strategies
│   ├── similarity.py      # Similarity functions
│   └── index.py           # Vector indexing
├── api/
│   └── main.py            # FastAPI application
├── tests/
│   └── test_embeddings.py
├── requirements.txt
└── Dockerfile

Implementation

Step 1: Dependencies

requirements.txt

torch>=2.0.0
transformers>=4.30.0
tokenizers>=0.13.0
chromadb>=0.4.0
fastapi>=0.100.0
uvicorn>=0.23.0
numpy>=1.24.0

Step 2: Tokenizer Wrapper

src/tokenizer.py

"""Tokenizer for embedding model."""

from typing import List, Dict, Optional
from transformers import AutoTokenizer
import torch


class EmbeddingTokenizer:
    """
    Tokenizer wrapper for embedding models.

    Uses HuggingFace tokenizers for subword tokenization,
    which handles out-of-vocabulary words gracefully.
    """

    def __init__(
        self,
        model_name: str = "bert-base-uncased",
        max_length: int = 128
    ):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.max_length = max_length
        self.vocab_size = self.tokenizer.vocab_size
        self.pad_token_id = self.tokenizer.pad_token_id

    def encode(
        self,
        texts: List[str],
        return_tensors: bool = True
    ) -> Dict[str, torch.Tensor]:
        """
        Encode texts to token IDs with attention masks.

        Args:
            texts: List of input texts
            return_tensors: Whether to return PyTorch tensors

        Returns:
            Dictionary with input_ids and attention_mask
        """
        encoded = self.tokenizer(
            texts,
            padding=True,
            truncation=True,
            max_length=self.max_length,
            return_tensors="pt" if return_tensors else None
        )

        return {
            "input_ids": encoded["input_ids"],
            "attention_mask": encoded["attention_mask"]
        }

    def decode(self, token_ids: torch.Tensor) -> List[str]:
        """Decode token IDs back to text."""
        return self.tokenizer.batch_decode(
            token_ids,
            skip_special_tokens=True
        )


class SimpleTokenizer:
    """
    Simple word-level tokenizer for learning purposes.

    Demonstrates tokenization fundamentals without
    external dependencies.
    """

    def __init__(self, vocab_size: int = 10000):
        self.vocab_size = vocab_size
        self.word2idx: Dict[str, int] = {"<PAD>": 0, "<UNK>": 1}
        self.idx2word: Dict[int, str] = {0: "<PAD>", 1: "<UNK>"}

    def fit(self, texts: List[str]) -> None:
        """Build vocabulary from texts."""
        from collections import Counter

        # Count word frequencies
        word_counts = Counter()
        for text in texts:
            words = text.lower().split()
            word_counts.update(words)

        # Add most common words to vocabulary
        for idx, (word, _) in enumerate(
            word_counts.most_common(self.vocab_size - 2),
            start=2
        ):
            self.word2idx[word] = idx
            self.idx2word[idx] = word

    def encode(
        self,
        texts: List[str],
        max_length: int = 128
    ) -> Dict[str, torch.Tensor]:
        """Encode texts to padded sequences."""
        batch_ids = []
        batch_masks = []

        for text in texts:
            words = text.lower().split()
            ids = [
                self.word2idx.get(w, 1)  # 1 = <UNK>
                for w in words[:max_length]
            ]

            # Pad or truncate
            mask = [1] * len(ids)
            padding_length = max_length - len(ids)

            if padding_length > 0:
                ids.extend([0] * padding_length)
                mask.extend([0] * padding_length)

            batch_ids.append(ids)
            batch_masks.append(mask)

        return {
            "input_ids": torch.tensor(batch_ids, dtype=torch.long),
            "attention_mask": torch.tensor(batch_masks, dtype=torch.long)
        }

Step 3: Pooling Strategies

src/pooling.py

"""Pooling strategies for converting token embeddings to sentence embeddings."""

import torch
import torch.nn as nn
from typing import Optional


class MeanPooling(nn.Module):
    """
    Mean pooling over token embeddings.

    Takes the average of all token embeddings, weighted by
    attention mask to ignore padding tokens.
    """

    def forward(
        self,
        token_embeddings: torch.Tensor,
        attention_mask: torch.Tensor
    ) -> torch.Tensor:
        """
        Args:
            token_embeddings: [batch_size, seq_len, hidden_dim]
            attention_mask: [batch_size, seq_len]

        Returns:
            Pooled embeddings [batch_size, hidden_dim]
        """
        # Expand mask to match embedding dimensions
        mask_expanded = attention_mask.unsqueeze(-1).expand(
            token_embeddings.size()
        ).float()

        # Sum embeddings (only non-padding tokens)
        sum_embeddings = torch.sum(
            token_embeddings * mask_expanded, dim=1
        )

        # Count non-padding tokens
        sum_mask = mask_expanded.sum(dim=1)
        sum_mask = torch.clamp(sum_mask, min=1e-9)  # Avoid division by zero

        # Compute mean
        return sum_embeddings / sum_mask


class CLSPooling(nn.Module):
    """
    CLS token pooling.

    Uses only the [CLS] token embedding as the sentence representation.
    Common in BERT-style models.
    """

    def forward(
        self,
        token_embeddings: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        """
        Args:
            token_embeddings: [batch_size, seq_len, hidden_dim]

        Returns:
            CLS embedding [batch_size, hidden_dim]
        """
        return token_embeddings[:, 0, :]


class MaxPooling(nn.Module):
    """
    Max pooling over token embeddings.

    Takes the maximum value for each dimension across all tokens.
    """

    def forward(
        self,
        token_embeddings: torch.Tensor,
        attention_mask: torch.Tensor
    ) -> torch.Tensor:
        """
        Args:
            token_embeddings: [batch_size, seq_len, hidden_dim]
            attention_mask: [batch_size, seq_len]

        Returns:
            Pooled embeddings [batch_size, hidden_dim]
        """
        # Set padding tokens to large negative value
        mask_expanded = attention_mask.unsqueeze(-1).expand(
            token_embeddings.size()
        )
        token_embeddings = token_embeddings.masked_fill(
            mask_expanded == 0, -1e9
        )

        # Max pool
        return torch.max(token_embeddings, dim=1)[0]


class AttentionPooling(nn.Module):
    """
    Attention-based pooling.

    Learns to weight different tokens based on their importance.
    """

    def __init__(self, hidden_dim: int):
        super().__init__()
        self.attention = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, 1)
        )

    def forward(
        self,
        token_embeddings: torch.Tensor,
        attention_mask: torch.Tensor
    ) -> torch.Tensor:
        """
        Args:
            token_embeddings: [batch_size, seq_len, hidden_dim]
            attention_mask: [batch_size, seq_len]

        Returns:
            Pooled embeddings [batch_size, hidden_dim]
        """
        # Compute attention weights
        weights = self.attention(token_embeddings).squeeze(-1)

        # Mask padding tokens
        weights = weights.masked_fill(attention_mask == 0, -1e9)
        weights = torch.softmax(weights, dim=1)

        # Weighted sum
        return torch.sum(
            token_embeddings * weights.unsqueeze(-1), dim=1
        )

Understanding Pooling Strategies:

┌─────────────────────────────────────────────────────────────────────────────┐
│ POOLING: FROM TOKEN EMBEDDINGS TO SENTENCE EMBEDDING                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Input: Token embeddings [batch, seq_len, hidden_dim]                       │
│  Output: Single embedding [batch, hidden_dim]                               │
│                                                                             │
│  MEAN POOLING (Most Common):                                                │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  Tokens: [The] [cat] [sat] [PAD] [PAD]                              │    │
│  │  Mask:   [ 1 ] [ 1 ] [ 1 ] [ 0 ] [ 0 ]                              │    │
│  │                                                                     │    │
│  │  Result = (emb[The] + emb[cat] + emb[sat]) / 3                      │    │
│  │           ▲ Only non-PAD tokens contribute                          │    │
│  │                                                                     │    │
│  │  ✓ Works for any sequence length                                    │    │
│  │  ✓ Uses information from all tokens                                 │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  CLS POOLING (BERT-style):                                                  │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  Tokens: [CLS] [The] [cat] [sat] [SEP]                              │    │
│  │              │                                                      │    │
│  │              └──► Use ONLY this token's embedding                   │    │
│  │                                                                     │    │
│  │  ✓ Simple, fast                                                     │    │
│  │  ✗ Requires model trained with CLS pooling objective                │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  ATTENTION POOLING (Learned):                                               │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  Tokens:   [The] [cat] [sat] [on] [mat]                             │    │
│  │  Weights:   0.1   0.4   0.3   0.1   0.1   (learned, sum to 1)       │    │
│  │                    ▲                                                │    │
│  │                    └── Model learns "cat" is most important         │    │
│  │                                                                     │    │
│  │  Result = 0.1*emb[The] + 0.4*emb[cat] + 0.3*emb[sat] + ...         │    │
│  │                                                                     │    │
│  │  ✓ Adaptive - focuses on important tokens                           │    │
│  │  ✗ Requires training the attention layer                            │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

When to Use Each:

Strategy	Best For	sentence-transformers Default
Mean	General text, any model	Yes (most models)
CLS	BERT-trained models	No
Attention	When you can train pooling	No
Max	Rare - when key info is sparse	No

Step 4: Embedding Model

src/model.py

"""Custom embedding model."""

import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Dict, Optional, List

from .pooling import MeanPooling, CLSPooling, AttentionPooling


class TextEmbeddingModel(nn.Module):
    """
    Text embedding model that converts text to dense vectors.

    Architecture:
    - Embedding layer for token representations
    - Transformer encoder for contextual understanding
    - Pooling layer for sentence representation
    - Optional projection head for dimension reduction

    Args:
        vocab_size: Size of vocabulary
        embedding_dim: Dimension of token embeddings
        hidden_dim: Hidden dimension of transformer
        num_layers: Number of transformer layers
        num_heads: Number of attention heads
        output_dim: Final embedding dimension
        pooling: Pooling strategy ('mean', 'cls', 'attention')
        dropout: Dropout probability
    """

    def __init__(
        self,
        vocab_size: int,
        embedding_dim: int = 256,
        hidden_dim: int = 512,
        num_layers: int = 4,
        num_heads: int = 8,
        output_dim: int = 384,
        pooling: str = "mean",
        dropout: float = 0.1,
        max_length: int = 128
    ):
        super().__init__()

        self.embedding_dim = embedding_dim
        self.output_dim = output_dim

        # Token embedding
        self.token_embedding = nn.Embedding(
            vocab_size,
            embedding_dim,
            padding_idx=0
        )

        # Positional embedding
        self.position_embedding = nn.Embedding(max_length, embedding_dim)

        # Transformer encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embedding_dim,
            nhead=num_heads,
            dim_feedforward=hidden_dim,
            dropout=dropout,
            batch_first=True
        )
        self.transformer = nn.TransformerEncoder(
            encoder_layer,
            num_layers=num_layers
        )

        # Pooling layer
        if pooling == "mean":
            self.pooling = MeanPooling()
        elif pooling == "cls":
            self.pooling = CLSPooling()
        elif pooling == "attention":
            self.pooling = AttentionPooling(embedding_dim)
        else:
            raise ValueError(f"Unknown pooling: {pooling}")

        # Projection head
        self.projection = nn.Sequential(
            nn.Linear(embedding_dim, hidden_dim),
            nn.GELU(),
            nn.Linear(hidden_dim, output_dim)
        )

        # Layer normalization
        self.layer_norm = nn.LayerNorm(output_dim)

        self._init_weights()

    def _init_weights(self):
        """Initialize weights."""
        for module in self.modules():
            if isinstance(module, nn.Linear):
                nn.init.xavier_uniform_(module.weight)
                if module.bias is not None:
                    nn.init.zeros_(module.bias)
            elif isinstance(module, nn.Embedding):
                nn.init.normal_(module.weight, std=0.02)

    def forward(
        self,
        input_ids: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        """
        Compute embeddings for input texts.

        Args:
            input_ids: Token IDs [batch_size, seq_len]
            attention_mask: Attention mask [batch_size, seq_len]

        Returns:
            Normalized embeddings [batch_size, output_dim]
        """
        batch_size, seq_len = input_ids.shape

        if attention_mask is None:
            attention_mask = (input_ids != 0).long()

        # Get token embeddings
        token_emb = self.token_embedding(input_ids)

        # Add positional embeddings
        positions = torch.arange(seq_len, device=input_ids.device)
        positions = positions.unsqueeze(0).expand(batch_size, -1)
        pos_emb = self.position_embedding(positions)

        embeddings = token_emb + pos_emb

        # Create attention mask for transformer
        # Convert from [batch, seq] to [batch, seq, seq] for self-attention
        src_key_padding_mask = (attention_mask == 0)

        # Apply transformer
        hidden_states = self.transformer(
            embeddings,
            src_key_padding_mask=src_key_padding_mask
        )

        # Pool to sentence embedding
        pooled = self.pooling(hidden_states, attention_mask)

        # Project and normalize
        projected = self.projection(pooled)
        normalized = self.layer_norm(projected)

        # L2 normalize for cosine similarity
        normalized = F.normalize(normalized, p=2, dim=-1)

        return normalized

    @torch.no_grad()
    def encode(
        self,
        texts: List[str],
        tokenizer,
        batch_size: int = 32
    ) -> torch.Tensor:
        """
        Encode texts to embeddings.

        Args:
            texts: List of input texts
            tokenizer: Tokenizer instance
            batch_size: Batch size for encoding

        Returns:
            Embeddings [num_texts, output_dim]
        """
        self.eval()
        all_embeddings = []

        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i:i + batch_size]
            encoded = tokenizer.encode(batch_texts)

            input_ids = encoded["input_ids"].to(
                next(self.parameters()).device
            )
            attention_mask = encoded["attention_mask"].to(
                next(self.parameters()).device
            )

            embeddings = self.forward(input_ids, attention_mask)
            all_embeddings.append(embeddings.cpu())

        return torch.cat(all_embeddings, dim=0)

Step 5: Similarity Functions

src/similarity.py

"""Similarity computation functions."""

import torch
import numpy as np
from typing import List, Tuple, Optional


def cosine_similarity(
    query: torch.Tensor,
    documents: torch.Tensor
) -> torch.Tensor:
    """
    Compute cosine similarity between query and documents.

    Args:
        query: Query embedding [1, dim] or [dim]
        documents: Document embeddings [num_docs, dim]

    Returns:
        Similarity scores [num_docs]
    """
    if query.dim() == 1:
        query = query.unsqueeze(0)

    # Normalize (should already be normalized, but ensure)
    query = torch.nn.functional.normalize(query, p=2, dim=-1)
    documents = torch.nn.functional.normalize(documents, p=2, dim=-1)

    # Compute similarity
    similarity = torch.mm(query, documents.t()).squeeze(0)

    return similarity


def euclidean_distance(
    query: torch.Tensor,
    documents: torch.Tensor
) -> torch.Tensor:
    """
    Compute Euclidean distance between query and documents.

    Args:
        query: Query embedding [1, dim] or [dim]
        documents: Document embeddings [num_docs, dim]

    Returns:
        Distances [num_docs] (lower is more similar)
    """
    if query.dim() == 1:
        query = query.unsqueeze(0)

    # Compute pairwise distances
    distances = torch.cdist(query, documents, p=2).squeeze(0)

    return distances


def dot_product(
    query: torch.Tensor,
    documents: torch.Tensor
) -> torch.Tensor:
    """
    Compute dot product similarity.

    Args:
        query: Query embedding [1, dim] or [dim]
        documents: Document embeddings [num_docs, dim]

    Returns:
        Similarity scores [num_docs]
    """
    if query.dim() == 1:
        query = query.unsqueeze(0)

    return torch.mm(query, documents.t()).squeeze(0)


class SemanticSearch:
    """
    Semantic search using embedding similarity.
    """

    def __init__(
        self,
        model,
        tokenizer,
        similarity_fn: str = "cosine"
    ):
        self.model = model
        self.tokenizer = tokenizer
        self.similarity_fn = {
            "cosine": cosine_similarity,
            "euclidean": euclidean_distance,
            "dot": dot_product
        }[similarity_fn]

        self.documents: List[str] = []
        self.embeddings: Optional[torch.Tensor] = None

    def index(self, documents: List[str]) -> None:
        """Index documents for search."""
        self.documents = documents
        self.embeddings = self.model.encode(documents, self.tokenizer)
        print(f"Indexed {len(documents)} documents")

    def search(
        self,
        query: str,
        top_k: int = 5
    ) -> List[Tuple[str, float]]:
        """
        Search for most similar documents.

        Args:
            query: Search query
            top_k: Number of results to return

        Returns:
            List of (document, score) tuples
        """
        if self.embeddings is None:
            raise ValueError("No documents indexed")

        # Encode query
        query_embedding = self.model.encode([query], self.tokenizer)

        # Compute similarities
        scores = self.similarity_fn(query_embedding[0], self.embeddings)

        # Get top-k
        if self.similarity_fn == euclidean_distance:
            # Lower is better for distance
            top_indices = torch.argsort(scores)[:top_k]
        else:
            # Higher is better for similarity
            top_indices = torch.argsort(scores, descending=True)[:top_k]

        results = [
            (self.documents[idx], scores[idx].item())
            for idx in top_indices
        ]

        return results

Step 6: Vector Index with ChromaDB

src/index.py

"""Vector indexing with ChromaDB."""

import chromadb
from chromadb.config import Settings
from typing import List, Dict, Optional
import torch


class VectorIndex:
    """
    Vector index using ChromaDB for persistent storage.
    """

    def __init__(
        self,
        model,
        tokenizer,
        collection_name: str = "embeddings",
        persist_directory: Optional[str] = None
    ):
        self.model = model
        self.tokenizer = tokenizer

        # Initialize ChromaDB
        if persist_directory:
            self.client = chromadb.PersistentClient(
                path=persist_directory
            )
        else:
            self.client = chromadb.Client()

        # Get or create collection
        self.collection = self.client.get_or_create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"}
        )

    def add(
        self,
        documents: List[str],
        ids: Optional[List[str]] = None,
        metadata: Optional[List[Dict]] = None
    ) -> None:
        """
        Add documents to the index.

        Args:
            documents: List of document texts
            ids: Optional document IDs
            metadata: Optional metadata for each document
        """
        # Generate embeddings
        embeddings = self.model.encode(documents, self.tokenizer)
        embeddings_list = embeddings.numpy().tolist()

        # Generate IDs if not provided
        if ids is None:
            existing_count = self.collection.count()
            ids = [f"doc_{existing_count + i}" for i in range(len(documents))]

        # Add to collection
        self.collection.add(
            documents=documents,
            embeddings=embeddings_list,
            ids=ids,
            metadatas=metadata
        )

        print(f"Added {len(documents)} documents to index")

    def search(
        self,
        query: str,
        top_k: int = 5,
        filter: Optional[Dict] = None
    ) -> List[Dict]:
        """
        Search for similar documents.

        Args:
            query: Search query
            top_k: Number of results
            filter: Optional metadata filter

        Returns:
            List of results with document, score, and metadata
        """
        # Encode query
        query_embedding = self.model.encode([query], self.tokenizer)
        query_list = query_embedding.numpy().tolist()

        # Search
        results = self.collection.query(
            query_embeddings=query_list,
            n_results=top_k,
            where=filter
        )

        # Format results
        formatted = []
        for i in range(len(results["documents"][0])):
            formatted.append({
                "document": results["documents"][0][i],
                "id": results["ids"][0][i],
                "score": 1 - results["distances"][0][i],  # Convert distance to similarity
                "metadata": results["metadatas"][0][i] if results["metadatas"] else None
            })

        return formatted

    def delete(self, ids: List[str]) -> None:
        """Delete documents by ID."""
        self.collection.delete(ids=ids)

    def count(self) -> int:
        """Get number of documents in index."""
        return self.collection.count()

Step 7: FastAPI Application

api/main.py

"""FastAPI application for embedding service."""

import torch
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from typing import List, Optional, Dict

from src.model import TextEmbeddingModel
from src.tokenizer import EmbeddingTokenizer
from src.index import VectorIndex


class EmbedRequest(BaseModel):
    texts: List[str] = Field(..., min_length=1, max_length=100)


class EmbedResponse(BaseModel):
    embeddings: List[List[float]]
    dimension: int


class SearchRequest(BaseModel):
    query: str = Field(..., min_length=1)
    top_k: int = Field(default=5, ge=1, le=100)
    filter: Optional[Dict] = None


class SearchResult(BaseModel):
    document: str
    score: float
    id: str
    metadata: Optional[Dict] = None


class IndexRequest(BaseModel):
    documents: List[str] = Field(..., min_length=1)
    ids: Optional[List[str]] = None
    metadata: Optional[List[Dict]] = None


# Global instances
model: TextEmbeddingModel = None
tokenizer: EmbeddingTokenizer = None
index: VectorIndex = None
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

app = FastAPI(
    title="Embedding Service API",
    description="Generate and search text embeddings"
)


@app.on_event("startup")
async def startup():
    global model, tokenizer, index

    # Initialize tokenizer
    tokenizer = EmbeddingTokenizer()

    # Initialize model
    model = TextEmbeddingModel(
        vocab_size=tokenizer.vocab_size,
        embedding_dim=256,
        output_dim=384
    )
    model.to(device)
    model.eval()

    # Initialize index
    index = VectorIndex(
        model=model,
        tokenizer=tokenizer,
        collection_name="documents",
        persist_directory="./chroma_db"
    )

    print(f"Service started on {device}")


@app.get("/health")
async def health():
    return {
        "status": "healthy",
        "device": str(device),
        "documents_indexed": index.count() if index else 0
    }


@app.post("/embed", response_model=EmbedResponse)
async def embed(request: EmbedRequest):
    """Generate embeddings for texts."""
    embeddings = model.encode(request.texts, tokenizer)

    return EmbedResponse(
        embeddings=embeddings.tolist(),
        dimension=model.output_dim
    )


@app.post("/index")
async def index_documents(request: IndexRequest):
    """Add documents to the search index."""
    index.add(
        documents=request.documents,
        ids=request.ids,
        metadata=request.metadata
    )

    return {"indexed": len(request.documents), "total": index.count()}


@app.post("/search", response_model=List[SearchResult])
async def search(request: SearchRequest):
    """Search for similar documents."""
    results = index.search(
        query=request.query,
        top_k=request.top_k,
        filter=request.filter
    )

    return [SearchResult(**r) for r in results]


if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Running the Project

# Install dependencies
pip install -r requirements.txt

# Run the API
uvicorn api.main:app --reload

# Generate embeddings
curl -X POST http://localhost:8000/embed \
  -H "Content-Type: application/json" \
  -d '{"texts": ["Hello world", "How are you?"]}'

# Index documents
curl -X POST http://localhost:8000/index \
  -H "Content-Type: application/json" \
  -d '{"documents": ["Python is great", "Machine learning is fun"]}'

# Search
curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"query": "programming languages", "top_k": 3}'

Key Concepts

Embeddings

Dense vector representations that capture semantic meaning:

# Similar texts have similar embeddings
embed("I love dogs") ≈ embed("I adore puppies")
embed("I love dogs") ≠ embed("The stock market crashed")

Mean Pooling

Averaging token embeddings (weighted by attention mask):

# Ignore padding tokens
mask_expanded = attention_mask.unsqueeze(-1)
sum_embeddings = (token_embeddings * mask_expanded).sum(dim=1)
mean_embedding = sum_embeddings / mask_expanded.sum(dim=1)

Cosine Similarity

Measures angle between vectors (independent of magnitude):

similarity = (A · B) / (||A|| × ||B||)
# Range: -1 (opposite) to 1 (identical)

L2 Normalization

Normalizing embeddings to unit length for cosine similarity:

normalized = embedding / embedding.norm(p=2, dim=-1, keepdim=True)

Key Concepts Recap

Concept	What It Is	Why It Matters
Mean Pooling	Average token embeddings weighted by attention mask	Best general-purpose pooling, handles variable lengths
CLS Pooling	Use only the [CLS] token embedding	Common in BERT-style models, single-vector representation
Attention Pooling	Learned weights for each token position	Adaptive focus on important tokens
L2 Normalization	Scale embeddings to unit length	Required for cosine similarity, makes magnitudes comparable
Cosine Similarity	Angle between vectors (dot product of normalized)	Standard metric for semantic similarity, range [-1, 1]
Positional Embedding	Adds position information to tokens	Transformers need explicit position encoding
ChromaDB	Vector database with HNSW indexing	Fast approximate nearest neighbor search at scale
Subword Tokenization	Split words into smaller pieces (BPE/WordPiece)	Handles out-of-vocabulary words gracefully

Next Steps

Inference API - Serve models efficiently
Custom Reranker - Train a reranker for RAG

Embedding Model from Scratch

TL;DR

Build a custom embedding model that converts text into dense vector representations for semantic search and similarity tasks.

What You'll Learn

Word embeddings and tokenization
Building embedding layers with nn.Embedding
Mean pooling for sentence embeddings
Cosine similarity for semantic search
Integration with vector databases

Tech Stack

Component	Technology
Framework	PyTorch
Tokenizer	HuggingFace tokenizers
Vector DB	ChromaDB
API	FastAPI

Architecture

┌──────────────────────────────────────────────────────────────────────────────┐
│                        TEXT EMBEDDING ARCHITECTURE                           │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ENCODING PIPELINE                                                           │
│  ┌────────────┐   ┌───────────┐   ┌───────────┐   ┌─────────────────────┐   │
│  │ Input Text │──▶│ Tokenizer │──▶│ Token IDs │──▶│   Embedding Layer   │   │
│  └────────────┘   └───────────┘   └───────────┘   └──────────┬──────────┘   │
│                                                              │              │
│                                                              ▼              │
│                   ┌─────────────────────────────────────────────────────┐   │
│                   │             Transformer Encoder Layers              │   │
│                   │         (Self-attention + Feed-forward)             │   │
│                   └──────────────────────┬──────────────────────────────┘   │
│                                          │                                  │
│                                          ▼                                  │
│                   ┌─────────────────────────────────────────────────────┐   │
│                   │    Pooling Layer (Mean / CLS / Attention)           │   │
│                   └──────────────────────┬──────────────────────────────┘   │
│                                          │                                  │
│                                          ▼                                  │
│                   ┌─────────────────────────────────────────────────────┐   │
│                   │   L2 Normalize → Embedding Vector [1, output_dim]   │   │
│                   └─────────────────────────────────────────────────────┘   │
│                                                                              │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SEMANTIC SEARCH                                                             │
│  ┌─────────┐    ┌────────────────┐    ┌──────────────────┐    ┌──────────┐  │
│  │  Query  │───▶│ Encode Query   │───▶│ Cosine Similarity │───▶│  Top-K   │  │
│  └─────────┘    └────────────────┘    │   with Doc Embeds │    │ Results  │  │
│                                       └──────────────────┘    └──────────┘  │
│                         ▲                                                    │
│  ┌───────────┐          │                                                    │
│  │ Documents │─────────▶│ (Pre-computed embeddings in ChromaDB)              │
│  └───────────┘                                                               │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Project Structure

embedding-model/
├── src/
│   ├── __init__.py
│   ├── model.py           # Embedding model
│   ├── tokenizer.py       # Tokenization
│   ├── pooling.py         # Pooling strategies
│   ├── similarity.py      # Similarity functions
│   └── index.py           # Vector indexing
├── api/
│   └── main.py            # FastAPI application
├── tests/
│   └── test_embeddings.py
├── requirements.txt
└── Dockerfile

Implementation

Step 1: Dependencies

requirements.txt

torch>=2.0.0
transformers>=4.30.0
tokenizers>=0.13.0
chromadb>=0.4.0
fastapi>=0.100.0
uvicorn>=0.23.0
numpy>=1.24.0

Step 2: Tokenizer Wrapper

src/tokenizer.py

"""Tokenizer for embedding model."""

from typing import List, Dict, Optional
from transformers import AutoTokenizer
import torch


class EmbeddingTokenizer:
    """
    Tokenizer wrapper for embedding models.

    Uses HuggingFace tokenizers for subword tokenization,
    which handles out-of-vocabulary words gracefully.
    """

    def __init__(
        self,
        model_name: str = "bert-base-uncased",
        max_length: int = 128
    ):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.max_length = max_length
        self.vocab_size = self.tokenizer.vocab_size
        self.pad_token_id = self.tokenizer.pad_token_id

    def encode(
        self,
        texts: List[str],
        return_tensors: bool = True
    ) -> Dict[str, torch.Tensor]:
        """
        Encode texts to token IDs with attention masks.

        Args:
            texts: List of input texts
            return_tensors: Whether to return PyTorch tensors

        Returns:
            Dictionary with input_ids and attention_mask
        """
        encoded = self.tokenizer(
            texts,
            padding=True,
            truncation=True,
            max_length=self.max_length,
            return_tensors="pt" if return_tensors else None
        )

        return {
            "input_ids": encoded["input_ids"],
            "attention_mask": encoded["attention_mask"]
        }

    def decode(self, token_ids: torch.Tensor) -> List[str]:
        """Decode token IDs back to text."""
        return self.tokenizer.batch_decode(
            token_ids,
            skip_special_tokens=True
        )


class SimpleTokenizer:
    """
    Simple word-level tokenizer for learning purposes.

    Demonstrates tokenization fundamentals without
    external dependencies.
    """

    def __init__(self, vocab_size: int = 10000):
        self.vocab_size = vocab_size
        self.word2idx: Dict[str, int] = {"<PAD>": 0, "<UNK>": 1}
        self.idx2word: Dict[int, str] = {0: "<PAD>", 1: "<UNK>"}

    def fit(self, texts: List[str]) -> None:
        """Build vocabulary from texts."""
        from collections import Counter

        # Count word frequencies
        word_counts = Counter()
        for text in texts:
            words = text.lower().split()
            word_counts.update(words)

        # Add most common words to vocabulary
        for idx, (word, _) in enumerate(
            word_counts.most_common(self.vocab_size - 2),
            start=2
        ):
            self.word2idx[word] = idx
            self.idx2word[idx] = word

    def encode(
        self,
        texts: List[str],
        max_length: int = 128
    ) -> Dict[str, torch.Tensor]:
        """Encode texts to padded sequences."""
        batch_ids = []
        batch_masks = []

        for text in texts:
            words = text.lower().split()
            ids = [
                self.word2idx.get(w, 1)  # 1 = <UNK>
                for w in words[:max_length]
            ]

            # Pad or truncate
            mask = [1] * len(ids)
            padding_length = max_length - len(ids)

            if padding_length > 0:
                ids.extend([0] * padding_length)
                mask.extend([0] * padding_length)

            batch_ids.append(ids)
            batch_masks.append(mask)

        return {
            "input_ids": torch.tensor(batch_ids, dtype=torch.long),
            "attention_mask": torch.tensor(batch_masks, dtype=torch.long)
        }

Step 3: Pooling Strategies

src/pooling.py

"""Pooling strategies for converting token embeddings to sentence embeddings."""

import torch
import torch.nn as nn
from typing import Optional


class MeanPooling(nn.Module):
    """
    Mean pooling over token embeddings.

    Takes the average of all token embeddings, weighted by
    attention mask to ignore padding tokens.
    """

    def forward(
        self,
        token_embeddings: torch.Tensor,
        attention_mask: torch.Tensor
    ) -> torch.Tensor:
        """
        Args:
            token_embeddings: [batch_size, seq_len, hidden_dim]
            attention_mask: [batch_size, seq_len]

        Returns:
            Pooled embeddings [batch_size, hidden_dim]
        """
        # Expand mask to match embedding dimensions
        mask_expanded = attention_mask.unsqueeze(-1).expand(
            token_embeddings.size()
        ).float()

        # Sum embeddings (only non-padding tokens)
        sum_embeddings = torch.sum(
            token_embeddings * mask_expanded, dim=1
        )

        # Count non-padding tokens
        sum_mask = mask_expanded.sum(dim=1)
        sum_mask = torch.clamp(sum_mask, min=1e-9)  # Avoid division by zero

        # Compute mean
        return sum_embeddings / sum_mask


class CLSPooling(nn.Module):
    """
    CLS token pooling.

    Uses only the [CLS] token embedding as the sentence representation.
    Common in BERT-style models.
    """

    def forward(
        self,
        token_embeddings: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        """
        Args:
            token_embeddings: [batch_size, seq_len, hidden_dim]

        Returns:
            CLS embedding [batch_size, hidden_dim]
        """
        return token_embeddings[:, 0, :]


class MaxPooling(nn.Module):
    """
    Max pooling over token embeddings.

    Takes the maximum value for each dimension across all tokens.
    """

    def forward(
        self,
        token_embeddings: torch.Tensor,
        attention_mask: torch.Tensor
    ) -> torch.Tensor:
        """
        Args:
            token_embeddings: [batch_size, seq_len, hidden_dim]
            attention_mask: [batch_size, seq_len]

        Returns:
            Pooled embeddings [batch_size, hidden_dim]
        """
        # Set padding tokens to large negative value
        mask_expanded = attention_mask.unsqueeze(-1).expand(
            token_embeddings.size()
        )
        token_embeddings = token_embeddings.masked_fill(
            mask_expanded == 0, -1e9
        )

        # Max pool
        return torch.max(token_embeddings, dim=1)[0]


class AttentionPooling(nn.Module):
    """
    Attention-based pooling.

    Learns to weight different tokens based on their importance.
    """

    def __init__(self, hidden_dim: int):
        super().__init__()
        self.attention = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, 1)
        )

    def forward(
        self,
        token_embeddings: torch.Tensor,
        attention_mask: torch.Tensor
    ) -> torch.Tensor:
        """
        Args:
            token_embeddings: [batch_size, seq_len, hidden_dim]
            attention_mask: [batch_size, seq_len]

        Returns:
            Pooled embeddings [batch_size, hidden_dim]
        """
        # Compute attention weights
        weights = self.attention(token_embeddings).squeeze(-1)

        # Mask padding tokens
        weights = weights.masked_fill(attention_mask == 0, -1e9)
        weights = torch.softmax(weights, dim=1)

        # Weighted sum
        return torch.sum(
            token_embeddings * weights.unsqueeze(-1), dim=1
        )

Understanding Pooling Strategies:

┌─────────────────────────────────────────────────────────────────────────────┐
│ POOLING: FROM TOKEN EMBEDDINGS TO SENTENCE EMBEDDING                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Input: Token embeddings [batch, seq_len, hidden_dim]                       │
│  Output: Single embedding [batch, hidden_dim]                               │
│                                                                             │
│  MEAN POOLING (Most Common):                                                │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  Tokens: [The] [cat] [sat] [PAD] [PAD]                              │    │
│  │  Mask:   [ 1 ] [ 1 ] [ 1 ] [ 0 ] [ 0 ]                              │    │
│  │                                                                     │    │
│  │  Result = (emb[The] + emb[cat] + emb[sat]) / 3                      │    │
│  │           ▲ Only non-PAD tokens contribute                          │    │
│  │                                                                     │    │
│  │  ✓ Works for any sequence length                                    │    │
│  │  ✓ Uses information from all tokens                                 │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  CLS POOLING (BERT-style):                                                  │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  Tokens: [CLS] [The] [cat] [sat] [SEP]                              │    │
│  │              │                                                      │    │
│  │              └──► Use ONLY this token's embedding                   │    │
│  │                                                                     │    │
│  │  ✓ Simple, fast                                                     │    │
│  │  ✗ Requires model trained with CLS pooling objective                │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  ATTENTION POOLING (Learned):                                               │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  Tokens:   [The] [cat] [sat] [on] [mat]                             │    │
│  │  Weights:   0.1   0.4   0.3   0.1   0.1   (learned, sum to 1)       │    │
│  │                    ▲                                                │    │
│  │                    └── Model learns "cat" is most important         │    │
│  │                                                                     │    │
│  │  Result = 0.1*emb[The] + 0.4*emb[cat] + 0.3*emb[sat] + ...         │    │
│  │                                                                     │    │
│  │  ✓ Adaptive - focuses on important tokens                           │    │
│  │  ✗ Requires training the attention layer                            │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

When to Use Each:

Strategy	Best For	sentence-transformers Default
Mean	General text, any model	Yes (most models)
CLS	BERT-trained models	No
Attention	When you can train pooling	No
Max	Rare - when key info is sparse	No

Step 4: Embedding Model

src/model.py

"""Custom embedding model."""

import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Dict, Optional, List

from .pooling import MeanPooling, CLSPooling, AttentionPooling


class TextEmbeddingModel(nn.Module):
    """
    Text embedding model that converts text to dense vectors.

    Architecture:
    - Embedding layer for token representations
    - Transformer encoder for contextual understanding
    - Pooling layer for sentence representation
    - Optional projection head for dimension reduction

    Args:
        vocab_size: Size of vocabulary
        embedding_dim: Dimension of token embeddings
        hidden_dim: Hidden dimension of transformer
        num_layers: Number of transformer layers
        num_heads: Number of attention heads
        output_dim: Final embedding dimension
        pooling: Pooling strategy ('mean', 'cls', 'attention')
        dropout: Dropout probability
    """

    def __init__(
        self,
        vocab_size: int,
        embedding_dim: int = 256,
        hidden_dim: int = 512,
        num_layers: int = 4,
        num_heads: int = 8,
        output_dim: int = 384,
        pooling: str = "mean",
        dropout: float = 0.1,
        max_length: int = 128
    ):
        super().__init__()

        self.embedding_dim = embedding_dim
        self.output_dim = output_dim

        # Token embedding
        self.token_embedding = nn.Embedding(
            vocab_size,
            embedding_dim,
            padding_idx=0
        )

        # Positional embedding
        self.position_embedding = nn.Embedding(max_length, embedding_dim)

        # Transformer encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embedding_dim,
            nhead=num_heads,
            dim_feedforward=hidden_dim,
            dropout=dropout,
            batch_first=True
        )
        self.transformer = nn.TransformerEncoder(
            encoder_layer,
            num_layers=num_layers
        )

        # Pooling layer
        if pooling == "mean":
            self.pooling = MeanPooling()
        elif pooling == "cls":
            self.pooling = CLSPooling()
        elif pooling == "attention":
            self.pooling = AttentionPooling(embedding_dim)
        else:
            raise ValueError(f"Unknown pooling: {pooling}")

        # Projection head
        self.projection = nn.Sequential(
            nn.Linear(embedding_dim, hidden_dim),
            nn.GELU(),
            nn.Linear(hidden_dim, output_dim)
        )

        # Layer normalization
        self.layer_norm = nn.LayerNorm(output_dim)

        self._init_weights()

    def _init_weights(self):
        """Initialize weights."""
        for module in self.modules():
            if isinstance(module, nn.Linear):
                nn.init.xavier_uniform_(module.weight)
                if module.bias is not None:
                    nn.init.zeros_(module.bias)
            elif isinstance(module, nn.Embedding):
                nn.init.normal_(module.weight, std=0.02)

    def forward(
        self,
        input_ids: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        """
        Compute embeddings for input texts.

        Args:
            input_ids: Token IDs [batch_size, seq_len]
            attention_mask: Attention mask [batch_size, seq_len]

        Returns:
            Normalized embeddings [batch_size, output_dim]
        """
        batch_size, seq_len = input_ids.shape

        if attention_mask is None:
            attention_mask = (input_ids != 0).long()

        # Get token embeddings
        token_emb = self.token_embedding(input_ids)

        # Add positional embeddings
        positions = torch.arange(seq_len, device=input_ids.device)
        positions = positions.unsqueeze(0).expand(batch_size, -1)
        pos_emb = self.position_embedding(positions)

        embeddings = token_emb + pos_emb

        # Create attention mask for transformer
        # Convert from [batch, seq] to [batch, seq, seq] for self-attention
        src_key_padding_mask = (attention_mask == 0)

        # Apply transformer
        hidden_states = self.transformer(
            embeddings,
            src_key_padding_mask=src_key_padding_mask
        )

        # Pool to sentence embedding
        pooled = self.pooling(hidden_states, attention_mask)

        # Project and normalize
        projected = self.projection(pooled)
        normalized = self.layer_norm(projected)

        # L2 normalize for cosine similarity
        normalized = F.normalize(normalized, p=2, dim=-1)

        return normalized

    @torch.no_grad()
    def encode(
        self,
        texts: List[str],
        tokenizer,
        batch_size: int = 32
    ) -> torch.Tensor:
        """
        Encode texts to embeddings.

        Args:
            texts: List of input texts
            tokenizer: Tokenizer instance
            batch_size: Batch size for encoding

        Returns:
            Embeddings [num_texts, output_dim]
        """
        self.eval()
        all_embeddings = []

        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i:i + batch_size]
            encoded = tokenizer.encode(batch_texts)

            input_ids = encoded["input_ids"].to(
                next(self.parameters()).device
            )
            attention_mask = encoded["attention_mask"].to(
                next(self.parameters()).device
            )

            embeddings = self.forward(input_ids, attention_mask)
            all_embeddings.append(embeddings.cpu())

        return torch.cat(all_embeddings, dim=0)

Step 5: Similarity Functions

src/similarity.py

"""Similarity computation functions."""

import torch
import numpy as np
from typing import List, Tuple, Optional


def cosine_similarity(
    query: torch.Tensor,
    documents: torch.Tensor
) -> torch.Tensor:
    """
    Compute cosine similarity between query and documents.

    Args:
        query: Query embedding [1, dim] or [dim]
        documents: Document embeddings [num_docs, dim]

    Returns:
        Similarity scores [num_docs]
    """
    if query.dim() == 1:
        query = query.unsqueeze(0)

    # Normalize (should already be normalized, but ensure)
    query = torch.nn.functional.normalize(query, p=2, dim=-1)
    documents = torch.nn.functional.normalize(documents, p=2, dim=-1)

    # Compute similarity
    similarity = torch.mm(query, documents.t()).squeeze(0)

    return similarity


def euclidean_distance(
    query: torch.Tensor,
    documents: torch.Tensor
) -> torch.Tensor:
    """
    Compute Euclidean distance between query and documents.

    Args:
        query: Query embedding [1, dim] or [dim]
        documents: Document embeddings [num_docs, dim]

    Returns:
        Distances [num_docs] (lower is more similar)
    """
    if query.dim() == 1:
        query = query.unsqueeze(0)

    # Compute pairwise distances
    distances = torch.cdist(query, documents, p=2).squeeze(0)

    return distances


def dot_product(
    query: torch.Tensor,
    documents: torch.Tensor
) -> torch.Tensor:
    """
    Compute dot product similarity.

    Args:
        query: Query embedding [1, dim] or [dim]
        documents: Document embeddings [num_docs, dim]

    Returns:
        Similarity scores [num_docs]
    """
    if query.dim() == 1:
        query = query.unsqueeze(0)

    return torch.mm(query, documents.t()).squeeze(0)


class SemanticSearch:
    """
    Semantic search using embedding similarity.
    """

    def __init__(
        self,
        model,
        tokenizer,
        similarity_fn: str = "cosine"
    ):
        self.model = model
        self.tokenizer = tokenizer
        self.similarity_fn = {
            "cosine": cosine_similarity,
            "euclidean": euclidean_distance,
            "dot": dot_product
        }[similarity_fn]

        self.documents: List[str] = []
        self.embeddings: Optional[torch.Tensor] = None

    def index(self, documents: List[str]) -> None:
        """Index documents for search."""
        self.documents = documents
        self.embeddings = self.model.encode(documents, self.tokenizer)
        print(f"Indexed {len(documents)} documents")

    def search(
        self,
        query: str,
        top_k: int = 5
    ) -> List[Tuple[str, float]]:
        """
        Search for most similar documents.

        Args:
            query: Search query
            top_k: Number of results to return

        Returns:
            List of (document, score) tuples
        """
        if self.embeddings is None:
            raise ValueError("No documents indexed")

        # Encode query
        query_embedding = self.model.encode([query], self.tokenizer)

        # Compute similarities
        scores = self.similarity_fn(query_embedding[0], self.embeddings)

        # Get top-k
        if self.similarity_fn == euclidean_distance:
            # Lower is better for distance
            top_indices = torch.argsort(scores)[:top_k]
        else:
            # Higher is better for similarity
            top_indices = torch.argsort(scores, descending=True)[:top_k]

        results = [
            (self.documents[idx], scores[idx].item())
            for idx in top_indices
        ]

        return results

Step 6: Vector Index with ChromaDB

src/index.py

"""Vector indexing with ChromaDB."""

import chromadb
from chromadb.config import Settings
from typing import List, Dict, Optional
import torch


class VectorIndex:
    """
    Vector index using ChromaDB for persistent storage.
    """

    def __init__(
        self,
        model,
        tokenizer,
        collection_name: str = "embeddings",
        persist_directory: Optional[str] = None
    ):
        self.model = model
        self.tokenizer = tokenizer

        # Initialize ChromaDB
        if persist_directory:
            self.client = chromadb.PersistentClient(
                path=persist_directory
            )
        else:
            self.client = chromadb.Client()

        # Get or create collection
        self.collection = self.client.get_or_create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"}
        )

    def add(
        self,
        documents: List[str],
        ids: Optional[List[str]] = None,
        metadata: Optional[List[Dict]] = None
    ) -> None:
        """
        Add documents to the index.

        Args:
            documents: List of document texts
            ids: Optional document IDs
            metadata: Optional metadata for each document
        """
        # Generate embeddings
        embeddings = self.model.encode(documents, self.tokenizer)
        embeddings_list = embeddings.numpy().tolist()

        # Generate IDs if not provided
        if ids is None:
            existing_count = self.collection.count()
            ids = [f"doc_{existing_count + i}" for i in range(len(documents))]

        # Add to collection
        self.collection.add(
            documents=documents,
            embeddings=embeddings_list,
            ids=ids,
            metadatas=metadata
        )

        print(f"Added {len(documents)} documents to index")

    def search(
        self,
        query: str,
        top_k: int = 5,
        filter: Optional[Dict] = None
    ) -> List[Dict]:
        """
        Search for similar documents.

        Args:
            query: Search query
            top_k: Number of results
            filter: Optional metadata filter

        Returns:
            List of results with document, score, and metadata
        """
        # Encode query
        query_embedding = self.model.encode([query], self.tokenizer)
        query_list = query_embedding.numpy().tolist()

        # Search
        results = self.collection.query(
            query_embeddings=query_list,
            n_results=top_k,
            where=filter
        )

        # Format results
        formatted = []
        for i in range(len(results["documents"][0])):
            formatted.append({
                "document": results["documents"][0][i],
                "id": results["ids"][0][i],
                "score": 1 - results["distances"][0][i],  # Convert distance to similarity
                "metadata": results["metadatas"][0][i] if results["metadatas"] else None
            })

        return formatted

    def delete(self, ids: List[str]) -> None:
        """Delete documents by ID."""
        self.collection.delete(ids=ids)

    def count(self) -> int:
        """Get number of documents in index."""
        return self.collection.count()

Step 7: FastAPI Application

api/main.py

"""FastAPI application for embedding service."""

import torch
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from typing import List, Optional, Dict

from src.model import TextEmbeddingModel
from src.tokenizer import EmbeddingTokenizer
from src.index import VectorIndex


class EmbedRequest(BaseModel):
    texts: List[str] = Field(..., min_length=1, max_length=100)


class EmbedResponse(BaseModel):
    embeddings: List[List[float]]
    dimension: int


class SearchRequest(BaseModel):
    query: str = Field(..., min_length=1)
    top_k: int = Field(default=5, ge=1, le=100)
    filter: Optional[Dict] = None


class SearchResult(BaseModel):
    document: str
    score: float
    id: str
    metadata: Optional[Dict] = None


class IndexRequest(BaseModel):
    documents: List[str] = Field(..., min_length=1)
    ids: Optional[List[str]] = None
    metadata: Optional[List[Dict]] = None


# Global instances
model: TextEmbeddingModel = None
tokenizer: EmbeddingTokenizer = None
index: VectorIndex = None
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

app = FastAPI(
    title="Embedding Service API",
    description="Generate and search text embeddings"
)


@app.on_event("startup")
async def startup():
    global model, tokenizer, index

    # Initialize tokenizer
    tokenizer = EmbeddingTokenizer()

    # Initialize model
    model = TextEmbeddingModel(
        vocab_size=tokenizer.vocab_size,
        embedding_dim=256,
        output_dim=384
    )
    model.to(device)
    model.eval()

    # Initialize index
    index = VectorIndex(
        model=model,
        tokenizer=tokenizer,
        collection_name="documents",
        persist_directory="./chroma_db"
    )

    print(f"Service started on {device}")


@app.get("/health")
async def health():
    return {
        "status": "healthy",
        "device": str(device),
        "documents_indexed": index.count() if index else 0
    }


@app.post("/embed", response_model=EmbedResponse)
async def embed(request: EmbedRequest):
    """Generate embeddings for texts."""
    embeddings = model.encode(request.texts, tokenizer)

    return EmbedResponse(
        embeddings=embeddings.tolist(),
        dimension=model.output_dim
    )


@app.post("/index")
async def index_documents(request: IndexRequest):
    """Add documents to the search index."""
    index.add(
        documents=request.documents,
        ids=request.ids,
        metadata=request.metadata
    )

    return {"indexed": len(request.documents), "total": index.count()}


@app.post("/search", response_model=List[SearchResult])
async def search(request: SearchRequest):
    """Search for similar documents."""
    results = index.search(
        query=request.query,
        top_k=request.top_k,
        filter=request.filter
    )

    return [SearchResult(**r) for r in results]


if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Running the Project

# Install dependencies
pip install -r requirements.txt

# Run the API
uvicorn api.main:app --reload

# Generate embeddings
curl -X POST http://localhost:8000/embed \
  -H "Content-Type: application/json" \
  -d '{"texts": ["Hello world", "How are you?"]}'

# Index documents
curl -X POST http://localhost:8000/index \
  -H "Content-Type: application/json" \
  -d '{"documents": ["Python is great", "Machine learning is fun"]}'

# Search
curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"query": "programming languages", "top_k": 3}'

Key Concepts

Embeddings

Dense vector representations that capture semantic meaning:

# Similar texts have similar embeddings
embed("I love dogs") ≈ embed("I adore puppies")
embed("I love dogs") ≠ embed("The stock market crashed")

Mean Pooling

Averaging token embeddings (weighted by attention mask):

# Ignore padding tokens
mask_expanded = attention_mask.unsqueeze(-1)
sum_embeddings = (token_embeddings * mask_expanded).sum(dim=1)
mean_embedding = sum_embeddings / mask_expanded.sum(dim=1)

Cosine Similarity

Measures angle between vectors (independent of magnitude):

similarity = (A · B) / (||A|| × ||B||)
# Range: -1 (opposite) to 1 (identical)

L2 Normalization

Normalizing embeddings to unit length for cosine similarity:

normalized = embedding / embedding.norm(p=2, dim=-1, keepdim=True)

Key Concepts Recap

Concept	What It Is	Why It Matters
Mean Pooling	Average token embeddings weighted by attention mask	Best general-purpose pooling, handles variable lengths
CLS Pooling	Use only the [CLS] token embedding	Common in BERT-style models, single-vector representation
Attention Pooling	Learned weights for each token position	Adaptive focus on important tokens
L2 Normalization	Scale embeddings to unit length	Required for cosine similarity, makes magnitudes comparable
Cosine Similarity	Angle between vectors (dot product of normalized)	Standard metric for semantic similarity, range [-1, 1]
Positional Embedding	Adds position information to tokens	Transformers need explicit position encoding
ChromaDB	Vector database with HNSW indexing	Fast approximate nearest neighbor search at scale
Subword Tokenization	Split words into smaller pieces (BPE/WordPiece)	Handles out-of-vocabulary words gracefully

Next Steps

Inference API - Serve models efficiently
Custom Reranker - Train a reranker for RAG

Embedding Model from Scratch

On this page

Embedding Model from Scratch

On this page