Improve retrieval accuracy using cross-encoder reranking for more relevant results

RAG with Reranking

TL;DR

Basic RAG retrieval often returns "close but not quite right" results. This tutorial teaches you two-stage retrieval: first cast a wide net with fast bi-encoders, then use slower but smarter cross-encoders to find the truly relevant documents. You'll implement both cloud (Cohere) and local rerankers.

Property	Value
Difficulty	Intermediate
Time	~4 hours
Code Size	~300 LOC
Prerequisites	Intelligent Document Q&A

Tech Stack

Technology	Purpose
LangChain	RAG orchestration
OpenAI	Embeddings + GPT-4
Cohere	Reranking API
ChromaDB	Vector database
sentence-transformers	Local cross-encoder
FastAPI	REST API

Prerequisites

Completed Intelligent Document Q&A tutorial
Python 3.10+ with async understanding
OpenAI API key (Get one here)
Cohere API key (Get one here) - optional for cloud reranking

What You'll Learn

Understand why basic retrieval often returns suboptimal results
Implement two-stage retrieval (retrieve then rerank)
Compare bi-encoders vs cross-encoders for ranking
Integrate Cohere Rerank API for production use
Build a local reranker using sentence-transformers
Measure retrieval quality improvements with metrics

The Problem with Basic Retrieval

Basic semantic search uses bi-encoders: query and documents are embedded independently, then compared via cosine similarity. This is fast but has limitations:

BI-ENCODER (Fast, Less Accurate)
┌──────────────────────────────────────────────────────────┐
│                                                          │
│  Query ──► [Encoder] ──► Vector ─┐                       │
│                                   ├──► Cosine Similarity │
│  Document ──► [Encoder] ──► Vector┘                      │
│                                                          │
│  (Query and document encoded SEPARATELY)                 │
└──────────────────────────────────────────────────────────┘

Problems:

Query and document are encoded separately (no interaction)
Misses nuanced semantic relationships
"What causes headaches?" might rank "Headaches are painful" higher than "Caffeine withdrawal triggers migraines"

Solution: Cross-Encoder Reranking

CROSS-ENCODER (Slower, More Accurate)
┌──────────────────────────────────────────────────────────────┐
│                                                              │
│  Query ────┐                                                 │
│            ├──► [Concatenate] ──► [Cross-Encoder] ──► Score  │
│  Document ─┘                                                 │
│                                                              │
│  (Query and document processed TOGETHER)                     │
└──────────────────────────────────────────────────────────────┘

Cross-encoders process query and document together, enabling deep semantic understanding at the cost of speed.

Why This Difference Matters

Aspect	Bi-Encoder	Cross-Encoder
How it works	Encodes query and document separately into vectors, then compares	Encodes query+document together, sees all token interactions
Speed	Fast (can pre-compute document vectors)	Slow (must compute for each query-doc pair)
Accuracy	Good for recall (finding candidates)	Excellent for precision (ranking accuracy)
Use case	Search millions of documents	Rerank top 10-50 candidates

Think of it like job recruiting:

Bi-encoder = Resume keyword scanner (fast, catches most good candidates)
Cross-encoder = Detailed interview (slow, finds the best match)

Two-Stage Retrieval Architecture

The key insight: use fast bi-encoders to get candidates, then use accurate cross-encoders to rerank them.

┌─────────────────────────────────────────────────────────────────────────┐
│                    TWO-STAGE RETRIEVAL ARCHITECTURE                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  STAGE 1: Candidate Retrieval (Fast)                                   │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │ User Query ──► [Bi-Encoder] ──► [Vector DB] ──► Top 20 Candidates│  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                         │                               │
│                                         ▼                               │
│  STAGE 2: Reranking (Accurate)                                         │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │ User Query ─┐                                                     │  │
│  │             ├──► [Cross-Encoder Reranker] ──► Top 5 Reranked      │  │
│  │ Candidates ─┘                                                     │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                         │                               │
│                                         ▼                               │
│  STAGE 3: Generation                                                   │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │ User Query ─┐                                                     │  │
│  │             ├──► [LLM] ──► Answer                                 │  │
│  │ Top 5 ──────┘                                                     │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Data Flow

┌─────────────────────────────────────────────────────────────────────────┐
│                         DATA FLOW SEQUENCE                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  User       FastAPI     VectorStore    Reranker       OpenAI            │
│   │            │            │             │             │               │
│   │─Question──►│            │             │             │               │
│   │            │─Search(20)►│             │             │               │
│   │            │◄─Candidates│             │             │               │
│   │            │            │             │             │               │
│   │            │────Rerank───────────────►│             │               │
│   │            │            │  (Cross-encoder scores    │               │
│   │            │            │   each query-doc pair)    │               │
│   │            │◄───Top 5 Reranked────────│             │               │
│   │            │            │             │             │               │
│   │            │────Generate with context───────────────►│              │
│   │            │◄──────────Answer───────────────────────│               │
│   │◄─Answer────│            │             │             │               │
│   │ +Sources   │            │             │             │               │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Project Structure

rag-reranking/
├── src/
│   ├── __init__.py
│   ├── config.py
│   ├── document_processor.py
│   ├── rerankers/
│   │   ├── __init__.py
│   │   ├── base.py
│   │   ├── cohere_reranker.py
│   │   └── local_reranker.py
│   ├── rag_engine.py
│   └── api.py
├── tests/
│   ├── test_rerankers.py
│   └── test_retrieval.py
├── data/
│   └── sample.pdf
├── .env
├── pyproject.toml
└── README.md

Implementation

Step 1: Project Setup

Create your project and install dependencies:

mkdir rag-reranking && cd rag-reranking
uv init
uv venv && source .venv/bin/activate

uv add langchain langchain-openai langchain-chroma
uv add chromadb pypdf python-dotenv
uv add fastapi uvicorn python-multipart
uv add cohere sentence-transformers

Configure environment variables:

.env

OPENAI_API_KEY=sk-your-key-here
COHERE_API_KEY=your-cohere-key-here
CHROMA_PERSIST_DIR=./chroma_db

# Retrieval settings
INITIAL_K=20
RERANK_TOP_K=5

# Reranker choice: "cohere" or "local"
RERANKER_TYPE=local
LOCAL_RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2

Step 2: Configuration Module

src/config.py

"""Configuration for RAG with Reranking system."""
import os
from dataclasses import dataclass
from dotenv import load_dotenv

load_dotenv()


@dataclass
class Config:
    """Application configuration."""

    # API Keys
    OPENAI_API_KEY: str = os.getenv("OPENAI_API_KEY", "")
    COHERE_API_KEY: str = os.getenv("COHERE_API_KEY", "")

    # Storage
    CHROMA_PERSIST_DIR: str = os.getenv("CHROMA_PERSIST_DIR", "./chroma_db")

    # Retrieval settings
    INITIAL_K: int = int(os.getenv("INITIAL_K", "20"))
    RERANK_TOP_K: int = int(os.getenv("RERANK_TOP_K", "5"))

    # Reranker settings
    RERANKER_TYPE: str = os.getenv("RERANKER_TYPE", "local")
    LOCAL_RERANKER_MODEL: str = os.getenv(
        "LOCAL_RERANKER_MODEL",
        "cross-encoder/ms-marco-MiniLM-L-6-v2"
    )

    # Model settings
    EMBEDDING_MODEL: str = "text-embedding-3-small"
    LLM_MODEL: str = "gpt-4o-mini"
    TEMPERATURE: float = 0.1

    # Chunking
    CHUNK_SIZE: int = 1000
    CHUNK_OVERLAP: int = 200

    @classmethod
    def validate(cls) -> None:
        """Validate required configuration."""
        if not cls.OPENAI_API_KEY:
            raise ValueError("OPENAI_API_KEY is required")


config = Config()

Understanding the Retrieval Settings:

Setting	Value	Why This Choice
`INITIAL_K=20`	Retrieve 20 candidates	Cast a wide net to not miss relevant docs
`RERANK_TOP_K=5`	Keep top 5 after reranking	Focus LLM context on best matches
`RERANKER_TYPE`	"local" or "cohere"	Local = free/offline, Cohere = higher quality

The ratio matters: retrieving 20 and keeping 5 means the reranker filters out 75% of candidates. If you retrieve too few (e.g., 5), you might miss relevant documents. If you retrieve too many (e.g., 100), reranking becomes slow.

Step 3: Reranker Base Class

Create src/rerankers/base.py:

src/rerankers/base.py

"""Base interface for rerankers."""
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import List

from langchain_core.documents import Document


@dataclass
class RankedDocument:
    """A document with its relevance score."""

    document: Document
    score: float
    original_rank: int

    def __repr__(self) -> str:
        content_preview = self.document.page_content[:50] + "..."
        return f"RankedDocument(score={self.score:.4f}, content='{content_preview}')"


class BaseReranker(ABC):
    """Abstract base class for reranking implementations."""

    @abstractmethod
    def rerank(
        self,
        query: str,
        documents: List[Document],
        top_k: int = 5
    ) -> List[RankedDocument]:
        """
        Rerank documents based on relevance to query.

        Args:
            query: The search query
            documents: List of candidate documents
            top_k: Number of top documents to return

        Returns:
            List of RankedDocument sorted by relevance (highest first)
        """
        pass

    @property
    @abstractmethod
    def name(self) -> str:
        """Return the reranker name for logging."""
        pass

Understanding the Abstract Base Class Pattern:

BaseReranker (Abstract)        # Defines the interface
     ├── CohereReranker        # Cloud implementation
     └── LocalReranker         # Local implementation

Why use this pattern?

Interchangeability: Your RAG engine doesn't care which reranker it uses—it just calls rerank(). This lets you swap implementations without changing application code.
RankedDocument dataclass: Bundles the document with its relevance score and original position. The original_rank field is useful for analyzing how much reranking changed the order.
@abstractmethod decorator: Forces any subclass to implement rerank() and name. If someone creates a new reranker without these methods, Python raises an error immediately.

Step 4: Cohere Reranker (Cloud)

Create src/rerankers/cohere_reranker.py:

src/rerankers/cohere_reranker.py

"""Cohere Rerank API integration."""
from typing import List

import cohere
from langchain_core.documents import Document

from src.config import config
from src.rerankers.base import BaseReranker, RankedDocument


class CohereReranker(BaseReranker):
    """Reranker using Cohere's Rerank API.

    Pros:
    - High quality reranking
    - No GPU required
    - Easy to use

    Cons:
    - Requires API key
    - Costs per request
    - Network latency
    """

    def __init__(self, model: str = "rerank-english-v3.0"):
        if not config.COHERE_API_KEY:
            raise ValueError("COHERE_API_KEY is required for CohereReranker")

        self.client = cohere.Client(config.COHERE_API_KEY)
        self.model = model

    @property
    def name(self) -> str:
        return f"Cohere/{self.model}"

    def rerank(
        self,
        query: str,
        documents: List[Document],
        top_k: int = 5
    ) -> List[RankedDocument]:
        """Rerank using Cohere API."""
        if not documents:
            return []

        # Extract text content for Cohere
        doc_texts = [doc.page_content for doc in documents]

        # Call Cohere Rerank API
        response = self.client.rerank(
            model=self.model,
            query=query,
            documents=doc_texts,
            top_n=top_k,
            return_documents=False  # We already have the docs
        )

        # Build ranked results
        ranked = []
        for result in response.results:
            ranked.append(RankedDocument(
                document=documents[result.index],
                score=result.relevance_score,
                original_rank=result.index
            ))

        return ranked

How the Cohere API Works:

┌─────────────────────────────────────────────────────────┐
│                    Your Application                      │
├─────────────────────────────────────────────────────────┤
│  1. Send: query + 20 document texts                     │
│                        ↓                                 │
│  2. Cohere's servers run cross-encoder on all pairs     │
│                        ↓                                 │
│  3. Receive: indices sorted by relevance + scores       │
└─────────────────────────────────────────────────────────┘

Key points about the implementation:

Line	What It Does
`return_documents=False`	We already have the docs, just need indices—saves bandwidth
`result.index`	Position in original list (so we can map back to our Document objects)
`result.relevance_score`	0.0 to 1.0 score indicating how relevant the doc is to the query

When to use Cohere:

Production systems needing consistent quality
Multilingual content (Cohere supports 100+ languages)
When you don't have GPU resources for local models

Step 5: Local Reranker (Free)

Create src/rerankers/local_reranker.py:

src/rerankers/local_reranker.py

"""Local cross-encoder reranker using sentence-transformers."""
from typing import List

from langchain_core.documents import Document
from sentence_transformers import CrossEncoder

from src.config import config
from src.rerankers.base import BaseReranker, RankedDocument


class LocalReranker(BaseReranker):
    """Reranker using local cross-encoder model.

    Pros:
    - Free (no API costs)
    - Fast for small batches
    - Works offline

    Cons:
    - Requires model download (~100MB)
    - Uses local compute
    - May be slower than cloud for large batches
    """

    # Popular cross-encoder models (smaller = faster, larger = better)
    MODELS = {
        "tiny": "cross-encoder/ms-marco-TinyBERT-L-2-v2",      # ~17MB, fastest
        "small": "cross-encoder/ms-marco-MiniLM-L-6-v2",       # ~90MB, balanced
        "medium": "cross-encoder/ms-marco-MiniLM-L-12-v2",     # ~130MB, better
        "large": "cross-encoder/ms-marco-electra-base",        # ~440MB, best
    }

    def __init__(self, model_name: str = None):
        model_name = model_name or config.LOCAL_RERANKER_MODEL

        # Allow shorthand names
        if model_name in self.MODELS:
            model_name = self.MODELS[model_name]

        self.model_name = model_name
        self.model = CrossEncoder(model_name, max_length=512)

    @property
    def name(self) -> str:
        return f"Local/{self.model_name.split('/')[-1]}"

    def rerank(
        self,
        query: str,
        documents: List[Document],
        top_k: int = 5
    ) -> List[RankedDocument]:
        """Rerank using local cross-encoder."""
        if not documents:
            return []

        # Prepare query-document pairs
        pairs = [[query, doc.page_content] for doc in documents]

        # Get scores from cross-encoder
        scores = self.model.predict(pairs)

        # Combine documents with scores
        doc_scores = list(zip(documents, scores, range(len(documents))))

        # Sort by score (highest first)
        doc_scores.sort(key=lambda x: x[1], reverse=True)

        # Build ranked results
        ranked = []
        for doc, score, original_idx in doc_scores[:top_k]:
            ranked.append(RankedDocument(
                document=doc,
                score=float(score),
                original_rank=original_idx
            ))

        return ranked

Understanding the Cross-Encoder Scoring:

# What happens inside model.predict(pairs):

pairs = [
    ["What is Python?", "Python is a programming language..."],   # pair 0
    ["What is Python?", "The python is a large snake..."],        # pair 1
    ["What is Python?", "JavaScript is used for web..."],         # pair 2
]

scores = model.predict(pairs)
# Result: [0.92, 0.31, 0.08]  ← Programming doc wins!

The cross-encoder sees both query and document tokens together, allowing it to understand:

"Python" in context of "programming" → programming language
"Python" near "snake" → the reptile
Query asks about "Python" but doc talks about "JavaScript" → not relevant

MS-MARCO Models Explained:

Model	Parameters	Why MS-MARCO?
TinyBERT-L-2	4.4M	Trained on 500K real Bing search queries
MiniLM-L-6	22M	Human-labeled relevance judgments
MiniLM-L-12	33M	Gold standard for passage ranking

MS-MARCO (Microsoft MAchine Reading COmprehension) is the most widely-used benchmark for training search models because it contains real search queries, not synthetic data.

Step 6: Reranker Factory

Create src/rerankers/__init__.py:

src/rerankers/__init__.py

"""Reranker factory and exports."""
from src.config import config
from src.rerankers.base import BaseReranker, RankedDocument
from src.rerankers.cohere_reranker import CohereReranker
from src.rerankers.local_reranker import LocalReranker


def get_reranker(reranker_type: str = None) -> BaseReranker:
    """Factory function to get the configured reranker.

    Args:
        reranker_type: "cohere" or "local". Defaults to config value.

    Returns:
        Configured reranker instance
    """
    reranker_type = reranker_type or config.RERANKER_TYPE

    if reranker_type == "cohere":
        return CohereReranker()
    elif reranker_type == "local":
        return LocalReranker()
    else:
        raise ValueError(f"Unknown reranker type: {reranker_type}")


__all__ = [
    "BaseReranker",
    "RankedDocument",
    "CohereReranker",
    "LocalReranker",
    "get_reranker",
]

The Factory Pattern:

The get_reranker() function is a factory—it creates objects based on configuration without the caller needing to know the details:

# Without factory (tight coupling):
if config.RERANKER_TYPE == "cohere":
    reranker = CohereReranker()
elif config.RERANKER_TYPE == "local":
    reranker = LocalReranker()
# ... repeated everywhere you need a reranker

# With factory (loose coupling):
reranker = get_reranker()  # One line, config-driven

Benefits:

Single point of change: Add a new reranker? Update only the factory
Configuration-driven: Switch between rerankers via .env file, no code changes
Testability: Easy to inject mock rerankers for testing

Step 7: Document Processor

src/document_processor.py

"""Document processing: extraction, chunking, and embedding."""
from pathlib import Path
from typing import List

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

from src.config import config


class DocumentProcessor:
    """Handles PDF loading, text extraction, and chunking."""

    def __init__(self):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=config.CHUNK_SIZE,
            chunk_overlap=config.CHUNK_OVERLAP,
            length_function=len,
            separators=["\n\n", "\n", ". ", " ", ""]
        )

    def load_pdf(self, file_path: str | Path) -> List[Document]:
        """Load a PDF and return raw documents."""
        loader = PyPDFLoader(str(file_path))
        return loader.load()

    def chunk_documents(self, documents: List[Document]) -> List[Document]:
        """Split documents into smaller chunks for embedding."""
        chunks = self.text_splitter.split_documents(documents)

        for i, chunk in enumerate(chunks):
            chunk.metadata["chunk_id"] = i
            chunk.metadata["chunk_total"] = len(chunks)

        return chunks

    def process(self, file_path: str | Path) -> List[Document]:
        """Full pipeline: load PDF and chunk it."""
        documents = self.load_pdf(file_path)
        return self.chunk_documents(documents)

Note: This processor is identical to our basic RAG tutorial. The chunking parameters (CHUNK_SIZE=1000, CHUNK_OVERLAP=200) create chunks that are large enough for cross-encoders to work with meaningfully—very short chunks (< 100 chars) don't give rerankers much to analyze.

Step 8: RAG Engine with Reranking

src/rag_engine.py

"""RAG Engine with two-stage retrieval and reranking."""
from pathlib import Path
from typing import List, Optional

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

from src.config import config
from src.document_processor import DocumentProcessor
from src.rerankers import get_reranker, BaseReranker, RankedDocument


class RAGEngine:
    """RAG Engine with two-stage retrieval: retrieve then rerank."""

    def __init__(self, reranker: Optional[BaseReranker] = None):
        config.validate()

        self.embeddings = OpenAIEmbeddings(
            model=config.EMBEDDING_MODEL,
            openai_api_key=config.OPENAI_API_KEY
        )

        self.llm = ChatOpenAI(
            model=config.LLM_MODEL,
            temperature=config.TEMPERATURE,
            openai_api_key=config.OPENAI_API_KEY
        )

        self.processor = DocumentProcessor()
        self.reranker = reranker or get_reranker()
        self.vectorstore: Optional[Chroma] = None
        self._load_or_create_vectorstore()

    def _load_or_create_vectorstore(self) -> None:
        """Initialize or load existing vector store."""
        persist_dir = Path(config.CHROMA_PERSIST_DIR)

        self.vectorstore = Chroma(
            persist_directory=str(persist_dir),
            embedding_function=self.embeddings
        )

    def ingest_document(self, file_path: str | Path) -> int:
        """Process and store a document in the vector store."""
        chunks = self.processor.process(file_path)
        self.vectorstore.add_documents(chunks)
        return len(chunks)

    def _retrieve_candidates(self, query: str, k: int) -> List[Document]:
        """Stage 1: Fast bi-encoder retrieval."""
        retriever = self.vectorstore.as_retriever(
            search_type="similarity",
            search_kwargs={"k": k}
        )
        return retriever.invoke(query)

    def _rerank_documents(
        self,
        query: str,
        documents: List[Document],
        top_k: int
    ) -> List[RankedDocument]:
        """Stage 2: Cross-encoder reranking."""
        return self.reranker.rerank(query, documents, top_k)

    def _format_ranked_docs(self, ranked_docs: List[RankedDocument]) -> str:
        """Format reranked documents for the prompt."""
        formatted = []
        for i, ranked in enumerate(ranked_docs, 1):
            doc = ranked.document
            source = doc.metadata.get("source", "Unknown")
            page = doc.metadata.get("page", "?")
            formatted.append(
                f"[Source {i}: {Path(source).name}, Page {page}] "
                f"(relevance: {ranked.score:.3f})\n{doc.page_content}"
            )
        return "\n\n---\n\n".join(formatted)

    def query(
        self,
        question: str,
        use_reranking: bool = True
    ) -> dict:
        """Answer a question using two-stage RAG.

        Args:
            question: The user's question
            use_reranking: Whether to apply reranking (for comparison)

        Returns:
            Dict with answer, sources, and retrieval metadata
        """
        if not self.vectorstore:
            raise ValueError("No documents ingested yet")

        # Stage 1: Retrieve candidates
        candidates = self._retrieve_candidates(
            question,
            k=config.INITIAL_K
        )

        # Stage 2: Rerank (optional)
        if use_reranking:
            ranked_docs = self._rerank_documents(
                question,
                candidates,
                top_k=config.RERANK_TOP_K
            )
            context = self._format_ranked_docs(ranked_docs)
            final_docs = [r.document for r in ranked_docs]
            scores = [r.score for r in ranked_docs]
        else:
            # Use first RERANK_TOP_K without reranking
            final_docs = candidates[:config.RERANK_TOP_K]
            context = self._format_ranked_docs([
                RankedDocument(doc, 0.0, i)
                for i, doc in enumerate(final_docs)
            ])
            scores = None

        # RAG prompt
        prompt = ChatPromptTemplate.from_messages([
            ("system", """You are a helpful assistant that answers questions based on the provided context.

Instructions:
- Answer ONLY based on the context provided
- If the answer is not in the context, say "I cannot find this information in the document"
- Cite the source numbers when possible
- Be concise but thorough"""),
            ("human", """Context (ranked by relevance):
{context}

Question: {question}

Answer:""")
        ])

        # Generate answer
        chain = prompt | self.llm | StrOutputParser()
        answer = chain.invoke({
            "context": context,
            "question": question
        })

        return {
            "answer": answer,
            "sources": [
                {
                    "content": doc.page_content[:200] + "...",
                    "page": doc.metadata.get("page"),
                    "source": doc.metadata.get("source"),
                    "relevance_score": scores[i] if scores else None
                }
                for i, doc in enumerate(final_docs)
            ],
            "metadata": {
                "reranking_used": use_reranking,
                "reranker": self.reranker.name if use_reranking else None,
                "candidates_retrieved": len(candidates),
                "final_docs_used": len(final_docs)
            }
        }

    def compare_retrieval(self, question: str) -> dict:
        """Compare results with and without reranking."""
        without_reranking = self.query(question, use_reranking=False)
        with_reranking = self.query(question, use_reranking=True)

        return {
            "question": question,
            "without_reranking": without_reranking,
            "with_reranking": with_reranking
        }

    def clear_vectorstore(self) -> None:
        """Clear all documents from the vector store."""
        if self.vectorstore:
            self.vectorstore.delete_collection()
            self._load_or_create_vectorstore()

Understanding the Two-Stage Query Flow:

query("What causes headaches?")
        │
        ▼
┌─────────────────────────────────────────────────────────────┐
│  Stage 1: _retrieve_candidates(query, k=20)                 │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  Bi-encoder embeds query → searches vector DB       │    │
│  │  Returns: 20 candidates (fast, ~50ms)               │    │
│  │                                                      │    │
│  │  Results might include:                              │    │
│  │  • "Headaches are painful" (keyword match, less useful) │
│  │  • "Caffeine withdrawal causes migraines" (relevant)  │
│  │  • "Head injuries..." (partial match)                │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────────────────┐
│  Stage 2: _rerank_documents(query, candidates, top_k=5)     │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  Cross-encoder scores each (query, doc) pair        │    │
│  │  Returns: Top 5 reranked (slower, ~200ms)           │    │
│  │                                                      │    │
│  │  Reranked order:                                     │    │
│  │  1. "Caffeine withdrawal causes migraines" (0.89)   │
│  │  2. "Dehydration leads to headaches" (0.82)         │
│  │  3. ...better results float to top...               │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────────────────┐
│  Stage 3: LLM Generation                                    │
│  • Uses only top 5 high-quality chunks                      │
│  • Less noise = better answers                              │
└─────────────────────────────────────────────────────────────┘

Key Implementation Details:

Method	Purpose	Why It's Designed This Way
`_retrieve_candidates`	Fast initial search	Uses `as_retriever()` for clean LangChain integration
`_rerank_documents`	Precision filtering	Delegates to pluggable reranker
`_format_ranked_docs`	Context preparation	Includes relevance scores so you can debug ranking
`use_reranking` param	A/B testing support	Compare results with/without reranking

The compare_retrieval() method is invaluable during development—it shows you exactly how reranking changes your results.

Step 9: FastAPI Application

src/api.py

"""FastAPI application for RAG with Reranking."""
import tempfile
from pathlib import Path
from typing import Optional

from fastapi import FastAPI, UploadFile, File, HTTPException, Query
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel

from src.rag_engine import RAGEngine
from src.config import config


app = FastAPI(
    title="RAG with Reranking API",
    description="Two-stage RAG system with cross-encoder reranking",
    version="1.0.0"
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# Initialize RAG engine
rag_engine = RAGEngine()


class QuestionRequest(BaseModel):
    question: str
    use_reranking: bool = True


class SourceInfo(BaseModel):
    content: str
    page: Optional[int]
    source: Optional[str]
    relevance_score: Optional[float]


class RetrievalMetadata(BaseModel):
    reranking_used: bool
    reranker: Optional[str]
    candidates_retrieved: int
    final_docs_used: int


class AnswerResponse(BaseModel):
    answer: str
    sources: list[SourceInfo]
    metadata: RetrievalMetadata


class IngestResponse(BaseModel):
    message: str
    chunks_created: int


@app.get("/")
async def root():
    """Health check endpoint."""
    return {
        "status": "healthy",
        "service": "RAG with Reranking",
        "reranker": rag_engine.reranker.name,
        "config": {
            "initial_k": config.INITIAL_K,
            "rerank_top_k": config.RERANK_TOP_K
        }
    }


@app.post("/ingest", response_model=IngestResponse)
async def ingest_document(file: UploadFile = File(...)):
    """Upload and process a PDF document."""
    if not file.filename.endswith(".pdf"):
        raise HTTPException(status_code=400, detail="Only PDF files are supported")

    try:
        with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
            content = await file.read()
            tmp.write(content)
            tmp_path = tmp.name

        chunks_count = rag_engine.ingest_document(tmp_path)
        Path(tmp_path).unlink()

        return IngestResponse(
            message=f"Successfully processed {file.filename}",
            chunks_created=chunks_count
        )

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.post("/query", response_model=AnswerResponse)
async def query_document(request: QuestionRequest):
    """Ask a question with optional reranking."""
    if not request.question.strip():
        raise HTTPException(status_code=400, detail="Question cannot be empty")

    try:
        result = rag_engine.query(
            request.question,
            use_reranking=request.use_reranking
        )
        return AnswerResponse(**result)

    except ValueError as e:
        raise HTTPException(status_code=400, detail=str(e))
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.post("/compare")
async def compare_retrieval(question: str = Query(..., min_length=1)):
    """Compare results with and without reranking."""
    try:
        return rag_engine.compare_retrieval(question)
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.delete("/clear")
async def clear_documents():
    """Clear all ingested documents."""
    rag_engine.clear_vectorstore()
    return {"message": "Vector store cleared successfully"}


if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Understanding the API Design:

Endpoint	Purpose	Use Case
`POST /ingest`	Upload PDFs	Initial document loading
`POST /query`	Ask questions	Main RAG endpoint with reranking toggle
`POST /compare`	A/B comparison	Development/debugging to see reranking impact
`DELETE /clear`	Reset database	Testing or switching document sets

The use_reranking parameter in /query is a powerful feature:

Production: Always true for best quality
Debugging: Toggle to see if reranking helps or hurts specific queries
Cost analysis: Compare latency with and without reranking

Pydantic Response Models: Notice how SourceInfo, RetrievalMetadata, and AnswerResponse define the exact shape of API responses. This gives you:

Automatic request validation
Auto-generated OpenAPI docs at /docs
Type hints for API consumers

Testing

Create tests/test_rerankers.py:

tests/test_rerankers.py

"""Tests for reranking components."""
import pytest
from langchain_core.documents import Document

from src.rerankers.local_reranker import LocalReranker
from src.rerankers.base import RankedDocument


class TestLocalReranker:
    """Tests for the local cross-encoder reranker."""

    @pytest.fixture
    def reranker(self):
        """Create a local reranker with tiny model for fast tests."""
        return LocalReranker("tiny")

    @pytest.fixture
    def sample_docs(self):
        """Create sample documents for testing."""
        return [
            Document(page_content="Python is a programming language"),
            Document(page_content="Machine learning uses algorithms"),
            Document(page_content="Python is great for machine learning"),
            Document(page_content="The weather is sunny today"),
        ]

    def test_rerank_returns_ranked_documents(self, reranker, sample_docs):
        """Test that reranking returns RankedDocument objects."""
        results = reranker.rerank(
            query="What programming language is used for ML?",
            documents=sample_docs,
            top_k=2
        )

        assert len(results) == 2
        assert all(isinstance(r, RankedDocument) for r in results)

    def test_rerank_orders_by_relevance(self, reranker, sample_docs):
        """Test that results are ordered by relevance score."""
        results = reranker.rerank(
            query="Python for machine learning",
            documents=sample_docs,
            top_k=4
        )

        # Scores should be in descending order
        scores = [r.score for r in results]
        assert scores == sorted(scores, reverse=True)

    def test_rerank_irrelevant_doc_ranked_low(self, reranker, sample_docs):
        """Test that irrelevant documents are ranked lower."""
        results = reranker.rerank(
            query="Python programming",
            documents=sample_docs,
            top_k=4
        )

        # Weather doc should be ranked last
        weather_rank = None
        for i, r in enumerate(results):
            if "weather" in r.document.page_content:
                weather_rank = i

        assert weather_rank == len(results) - 1

    def test_rerank_empty_documents(self, reranker):
        """Test handling of empty document list."""
        results = reranker.rerank(
            query="test query",
            documents=[],
            top_k=5
        )

        assert results == []


# Run with: pytest tests/test_rerankers.py -v

What These Tests Verify:

Test	What It Checks	Why It Matters
`test_rerank_returns_ranked_documents`	Output type is correct	Ensures interface contract is met
`test_rerank_orders_by_relevance`	Scores are descending	Confirms sorting logic works
`test_rerank_irrelevant_doc_ranked_low`	"Weather" doc ranks last	Validates semantic understanding
`test_rerank_empty_documents`	Handles edge case	Prevents crashes on empty input

Notice we use the "tiny" model in tests (LocalReranker("tiny"))—it's only 17MB and loads fast, perfect for CI/CD pipelines where you don't need production-quality ranking.

Running the Application

Start the API

uvicorn src.api:app --reload

Test with curl

# Upload a PDF
curl -X POST "http://localhost:8000/ingest" \
  -F "file=@your-document.pdf"

# Query with reranking (default)
curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the main conclusion?"}'

# Query without reranking (for comparison)
curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the main conclusion?", "use_reranking": false}'

# Compare both approaches
curl -X POST "http://localhost:8000/compare?question=What%20is%20the%20main%20topic"

Visit http://localhost:8000/docs for interactive Swagger UI.

Measuring Improvement

Retrieval Quality Metrics

Metric	Without Reranking	With Reranking	Improvement
MRR@5	~0.65	~0.82	+26%
NDCG@5	~0.58	~0.76	+31%
Precision@3	~0.71	~0.89	+25%

Note: Actual improvements vary by dataset. Test on your specific use case!

When Reranking Helps Most

Scenario	Improvement	Reason
Ambiguous queries	High	Cross-encoder understands nuance
Technical documents	High	Better semantic matching
Short queries	Medium	More context from full comparison
Simple factoid queries	Low	Bi-encoder often sufficient

Debugging Tips

Reranking is slow

Use smaller cross-encoder model (try "tiny")
Reduce INITIAL_K to retrieve fewer candidates
Consider async/batching for production

Results worse with reranking

Check if documents are very short (cross-encoder needs content)
Try different cross-encoder models
Verify reranker is loading correctly

Cohere API errors

Verify API key is correct
Check Cohere dashboard for rate limits
Ensure documents aren't too long (max ~512 tokens each)

Local model not downloading

Check internet connection
Try explicit model path: LocalReranker("cross-encoder/ms-marco-MiniLM-L-6-v2")

Cross-Encoder Model Comparison

Model	Size	Speed	Quality	Best For
`ms-marco-TinyBERT-L-2-v2`	17MB	Fastest	Good	Development, high throughput
`ms-marco-MiniLM-L-6-v2`	90MB	Fast	Better	Production (balanced)
`ms-marco-MiniLM-L-12-v2`	130MB	Medium	Great	Quality-focused apps
Cohere Rerank v3	Cloud	Variable	Best	Production, multilingual

Extensions

Level	Ideas
Easy	Add caching for reranker results, support more file types
Medium	Implement query expansion before retrieval, add A/B testing
Advanced	Train custom cross-encoder on your domain, implement listwise reranking

Resources

Key Concepts Recap

Concept	What It Is	Why It Matters
Bi-Encoder	Encodes query and documents separately into vectors	Fast but misses nuanced relationships
Cross-Encoder	Encodes query+document together	Slower but understands semantic nuances
Two-Stage Retrieval	Retrieve many candidates, then rerank	Best of both: speed + accuracy
MS-MARCO Models	Models trained on real Bing search data	Gold standard for passage ranking
Factory Pattern	Create objects based on config	Easy to swap implementations
Abstract Base Class	Define interface, force implementation	Ensures all rerankers work the same way
Relevance Score	0.0-1.0 measure of query-document match	Debug ranking, set confidence thresholds

Summary

You've built a two-stage RAG system that:

Stage 1: Fast bi-encoder retrieves broad candidates (k=20)
Stage 2: Cross-encoder reranks for precision (top 5)
Supports both cloud (Cohere) and local (sentence-transformers) reranking
Measures retrieval quality improvements
Compares results with and without reranking

Key Takeaways:

Bi-encoders are fast but shallow - Good for initial retrieval
Cross-encoders are slow but deep - Perfect for reranking small sets
Two-stage is the best of both - Speed + accuracy
Test on your data - Improvements vary by domain

Next: Hybrid Search - Combine keyword and semantic search

RAG with Reranking

TL;DR

Property	Value
Difficulty	Intermediate
Time	~4 hours
Code Size	~300 LOC
Prerequisites	Intelligent Document Q&A

Tech Stack

Technology	Purpose
LangChain	RAG orchestration
OpenAI	Embeddings + GPT-4
Cohere	Reranking API
ChromaDB	Vector database
sentence-transformers	Local cross-encoder
FastAPI	REST API

Prerequisites

Completed Intelligent Document Q&A tutorial
Python 3.10+ with async understanding
OpenAI API key (Get one here)
Cohere API key (Get one here) - optional for cloud reranking

What You'll Learn

Understand why basic retrieval often returns suboptimal results
Implement two-stage retrieval (retrieve then rerank)
Compare bi-encoders vs cross-encoders for ranking
Integrate Cohere Rerank API for production use
Build a local reranker using sentence-transformers
Measure retrieval quality improvements with metrics

The Problem with Basic Retrieval

Basic semantic search uses bi-encoders: query and documents are embedded independently, then compared via cosine similarity. This is fast but has limitations:

BI-ENCODER (Fast, Less Accurate)
┌──────────────────────────────────────────────────────────┐
│                                                          │
│  Query ──► [Encoder] ──► Vector ─┐                       │
│                                   ├──► Cosine Similarity │
│  Document ──► [Encoder] ──► Vector┘                      │
│                                                          │
│  (Query and document encoded SEPARATELY)                 │
└──────────────────────────────────────────────────────────┘

Problems:

Query and document are encoded separately (no interaction)
Misses nuanced semantic relationships
"What causes headaches?" might rank "Headaches are painful" higher than "Caffeine withdrawal triggers migraines"

Solution: Cross-Encoder Reranking

CROSS-ENCODER (Slower, More Accurate)
┌──────────────────────────────────────────────────────────────┐
│                                                              │
│  Query ────┐                                                 │
│            ├──► [Concatenate] ──► [Cross-Encoder] ──► Score  │
│  Document ─┘                                                 │
│                                                              │
│  (Query and document processed TOGETHER)                     │
└──────────────────────────────────────────────────────────────┘

Cross-encoders process query and document together, enabling deep semantic understanding at the cost of speed.

Why This Difference Matters

Aspect	Bi-Encoder	Cross-Encoder
How it works	Encodes query and document separately into vectors, then compares	Encodes query+document together, sees all token interactions
Speed	Fast (can pre-compute document vectors)	Slow (must compute for each query-doc pair)
Accuracy	Good for recall (finding candidates)	Excellent for precision (ranking accuracy)
Use case	Search millions of documents	Rerank top 10-50 candidates

Think of it like job recruiting:

Bi-encoder = Resume keyword scanner (fast, catches most good candidates)
Cross-encoder = Detailed interview (slow, finds the best match)

Two-Stage Retrieval Architecture

The key insight: use fast bi-encoders to get candidates, then use accurate cross-encoders to rerank them.

┌─────────────────────────────────────────────────────────────────────────┐
│                    TWO-STAGE RETRIEVAL ARCHITECTURE                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  STAGE 1: Candidate Retrieval (Fast)                                   │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │ User Query ──► [Bi-Encoder] ──► [Vector DB] ──► Top 20 Candidates│  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                         │                               │
│                                         ▼                               │
│  STAGE 2: Reranking (Accurate)                                         │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │ User Query ─┐                                                     │  │
│  │             ├──► [Cross-Encoder Reranker] ──► Top 5 Reranked      │  │
│  │ Candidates ─┘                                                     │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                         │                               │
│                                         ▼                               │
│  STAGE 3: Generation                                                   │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │ User Query ─┐                                                     │  │
│  │             ├──► [LLM] ──► Answer                                 │  │
│  │ Top 5 ──────┘                                                     │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Data Flow

┌─────────────────────────────────────────────────────────────────────────┐
│                         DATA FLOW SEQUENCE                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  User       FastAPI     VectorStore    Reranker       OpenAI            │
│   │            │            │             │             │               │
│   │─Question──►│            │             │             │               │
│   │            │─Search(20)►│             │             │               │
│   │            │◄─Candidates│             │             │               │
│   │            │            │             │             │               │
│   │            │────Rerank───────────────►│             │               │
│   │            │            │  (Cross-encoder scores    │               │
│   │            │            │   each query-doc pair)    │               │
│   │            │◄───Top 5 Reranked────────│             │               │
│   │            │            │             │             │               │
│   │            │────Generate with context───────────────►│              │
│   │            │◄──────────Answer───────────────────────│               │
│   │◄─Answer────│            │             │             │               │
│   │ +Sources   │            │             │             │               │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Project Structure

rag-reranking/
├── src/
│   ├── __init__.py
│   ├── config.py
│   ├── document_processor.py
│   ├── rerankers/
│   │   ├── __init__.py
│   │   ├── base.py
│   │   ├── cohere_reranker.py
│   │   └── local_reranker.py
│   ├── rag_engine.py
│   └── api.py
├── tests/
│   ├── test_rerankers.py
│   └── test_retrieval.py
├── data/
│   └── sample.pdf
├── .env
├── pyproject.toml
└── README.md

Implementation

Step 1: Project Setup

Create your project and install dependencies:

mkdir rag-reranking && cd rag-reranking
uv init
uv venv && source .venv/bin/activate

uv add langchain langchain-openai langchain-chroma
uv add chromadb pypdf python-dotenv
uv add fastapi uvicorn python-multipart
uv add cohere sentence-transformers

Configure environment variables:

.env

OPENAI_API_KEY=sk-your-key-here
COHERE_API_KEY=your-cohere-key-here
CHROMA_PERSIST_DIR=./chroma_db

# Retrieval settings
INITIAL_K=20
RERANK_TOP_K=5

# Reranker choice: "cohere" or "local"
RERANKER_TYPE=local
LOCAL_RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2

Step 2: Configuration Module

src/config.py

"""Configuration for RAG with Reranking system."""
import os
from dataclasses import dataclass
from dotenv import load_dotenv

load_dotenv()


@dataclass
class Config:
    """Application configuration."""

    # API Keys
    OPENAI_API_KEY: str = os.getenv("OPENAI_API_KEY", "")
    COHERE_API_KEY: str = os.getenv("COHERE_API_KEY", "")

    # Storage
    CHROMA_PERSIST_DIR: str = os.getenv("CHROMA_PERSIST_DIR", "./chroma_db")

    # Retrieval settings
    INITIAL_K: int = int(os.getenv("INITIAL_K", "20"))
    RERANK_TOP_K: int = int(os.getenv("RERANK_TOP_K", "5"))

    # Reranker settings
    RERANKER_TYPE: str = os.getenv("RERANKER_TYPE", "local")
    LOCAL_RERANKER_MODEL: str = os.getenv(
        "LOCAL_RERANKER_MODEL",
        "cross-encoder/ms-marco-MiniLM-L-6-v2"
    )

    # Model settings
    EMBEDDING_MODEL: str = "text-embedding-3-small"
    LLM_MODEL: str = "gpt-4o-mini"
    TEMPERATURE: float = 0.1

    # Chunking
    CHUNK_SIZE: int = 1000
    CHUNK_OVERLAP: int = 200

    @classmethod
    def validate(cls) -> None:
        """Validate required configuration."""
        if not cls.OPENAI_API_KEY:
            raise ValueError("OPENAI_API_KEY is required")


config = Config()

Understanding the Retrieval Settings:

Setting	Value	Why This Choice
`INITIAL_K=20`	Retrieve 20 candidates	Cast a wide net to not miss relevant docs
`RERANK_TOP_K=5`	Keep top 5 after reranking	Focus LLM context on best matches
`RERANKER_TYPE`	"local" or "cohere"	Local = free/offline, Cohere = higher quality

Step 3: Reranker Base Class

Create src/rerankers/base.py:

src/rerankers/base.py

"""Base interface for rerankers."""
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import List

from langchain_core.documents import Document


@dataclass
class RankedDocument:
    """A document with its relevance score."""

    document: Document
    score: float
    original_rank: int

    def __repr__(self) -> str:
        content_preview = self.document.page_content[:50] + "..."
        return f"RankedDocument(score={self.score:.4f}, content='{content_preview}')"


class BaseReranker(ABC):
    """Abstract base class for reranking implementations."""

    @abstractmethod
    def rerank(
        self,
        query: str,
        documents: List[Document],
        top_k: int = 5
    ) -> List[RankedDocument]:
        """
        Rerank documents based on relevance to query.

        Args:
            query: The search query
            documents: List of candidate documents
            top_k: Number of top documents to return

        Returns:
            List of RankedDocument sorted by relevance (highest first)
        """
        pass

    @property
    @abstractmethod
    def name(self) -> str:
        """Return the reranker name for logging."""
        pass

Understanding the Abstract Base Class Pattern:

BaseReranker (Abstract)        # Defines the interface
     ├── CohereReranker        # Cloud implementation
     └── LocalReranker         # Local implementation

Why use this pattern?

Interchangeability: Your RAG engine doesn't care which reranker it uses—it just calls rerank(). This lets you swap implementations without changing application code.
RankedDocument dataclass: Bundles the document with its relevance score and original position. The original_rank field is useful for analyzing how much reranking changed the order.
@abstractmethod decorator: Forces any subclass to implement rerank() and name. If someone creates a new reranker without these methods, Python raises an error immediately.

Step 4: Cohere Reranker (Cloud)

Create src/rerankers/cohere_reranker.py:

src/rerankers/cohere_reranker.py

"""Cohere Rerank API integration."""
from typing import List

import cohere
from langchain_core.documents import Document

from src.config import config
from src.rerankers.base import BaseReranker, RankedDocument


class CohereReranker(BaseReranker):
    """Reranker using Cohere's Rerank API.

    Pros:
    - High quality reranking
    - No GPU required
    - Easy to use

    Cons:
    - Requires API key
    - Costs per request
    - Network latency
    """

    def __init__(self, model: str = "rerank-english-v3.0"):
        if not config.COHERE_API_KEY:
            raise ValueError("COHERE_API_KEY is required for CohereReranker")

        self.client = cohere.Client(config.COHERE_API_KEY)
        self.model = model

    @property
    def name(self) -> str:
        return f"Cohere/{self.model}"

    def rerank(
        self,
        query: str,
        documents: List[Document],
        top_k: int = 5
    ) -> List[RankedDocument]:
        """Rerank using Cohere API."""
        if not documents:
            return []

        # Extract text content for Cohere
        doc_texts = [doc.page_content for doc in documents]

        # Call Cohere Rerank API
        response = self.client.rerank(
            model=self.model,
            query=query,
            documents=doc_texts,
            top_n=top_k,
            return_documents=False  # We already have the docs
        )

        # Build ranked results
        ranked = []
        for result in response.results:
            ranked.append(RankedDocument(
                document=documents[result.index],
                score=result.relevance_score,
                original_rank=result.index
            ))

        return ranked

How the Cohere API Works:

┌─────────────────────────────────────────────────────────┐
│                    Your Application                      │
├─────────────────────────────────────────────────────────┤
│  1. Send: query + 20 document texts                     │
│                        ↓                                 │
│  2. Cohere's servers run cross-encoder on all pairs     │
│                        ↓                                 │
│  3. Receive: indices sorted by relevance + scores       │
└─────────────────────────────────────────────────────────┘

Key points about the implementation:

Line	What It Does
`return_documents=False`	We already have the docs, just need indices—saves bandwidth
`result.index`	Position in original list (so we can map back to our Document objects)
`result.relevance_score`	0.0 to 1.0 score indicating how relevant the doc is to the query

When to use Cohere:

Production systems needing consistent quality
Multilingual content (Cohere supports 100+ languages)
When you don't have GPU resources for local models

Step 5: Local Reranker (Free)

Create src/rerankers/local_reranker.py:

src/rerankers/local_reranker.py

"""Local cross-encoder reranker using sentence-transformers."""
from typing import List

from langchain_core.documents import Document
from sentence_transformers import CrossEncoder

from src.config import config
from src.rerankers.base import BaseReranker, RankedDocument


class LocalReranker(BaseReranker):
    """Reranker using local cross-encoder model.

    Pros:
    - Free (no API costs)
    - Fast for small batches
    - Works offline

    Cons:
    - Requires model download (~100MB)
    - Uses local compute
    - May be slower than cloud for large batches
    """

    # Popular cross-encoder models (smaller = faster, larger = better)
    MODELS = {
        "tiny": "cross-encoder/ms-marco-TinyBERT-L-2-v2",      # ~17MB, fastest
        "small": "cross-encoder/ms-marco-MiniLM-L-6-v2",       # ~90MB, balanced
        "medium": "cross-encoder/ms-marco-MiniLM-L-12-v2",     # ~130MB, better
        "large": "cross-encoder/ms-marco-electra-base",        # ~440MB, best
    }

    def __init__(self, model_name: str = None):
        model_name = model_name or config.LOCAL_RERANKER_MODEL

        # Allow shorthand names
        if model_name in self.MODELS:
            model_name = self.MODELS[model_name]

        self.model_name = model_name
        self.model = CrossEncoder(model_name, max_length=512)

    @property
    def name(self) -> str:
        return f"Local/{self.model_name.split('/')[-1]}"

    def rerank(
        self,
        query: str,
        documents: List[Document],
        top_k: int = 5
    ) -> List[RankedDocument]:
        """Rerank using local cross-encoder."""
        if not documents:
            return []

        # Prepare query-document pairs
        pairs = [[query, doc.page_content] for doc in documents]

        # Get scores from cross-encoder
        scores = self.model.predict(pairs)

        # Combine documents with scores
        doc_scores = list(zip(documents, scores, range(len(documents))))

        # Sort by score (highest first)
        doc_scores.sort(key=lambda x: x[1], reverse=True)

        # Build ranked results
        ranked = []
        for doc, score, original_idx in doc_scores[:top_k]:
            ranked.append(RankedDocument(
                document=doc,
                score=float(score),
                original_rank=original_idx
            ))

        return ranked

Understanding the Cross-Encoder Scoring:

# What happens inside model.predict(pairs):

pairs = [
    ["What is Python?", "Python is a programming language..."],   # pair 0
    ["What is Python?", "The python is a large snake..."],        # pair 1
    ["What is Python?", "JavaScript is used for web..."],         # pair 2
]

scores = model.predict(pairs)
# Result: [0.92, 0.31, 0.08]  ← Programming doc wins!

The cross-encoder sees both query and document tokens together, allowing it to understand:

"Python" in context of "programming" → programming language
"Python" near "snake" → the reptile
Query asks about "Python" but doc talks about "JavaScript" → not relevant

MS-MARCO Models Explained:

Model	Parameters	Why MS-MARCO?
TinyBERT-L-2	4.4M	Trained on 500K real Bing search queries
MiniLM-L-6	22M	Human-labeled relevance judgments
MiniLM-L-12	33M	Gold standard for passage ranking

MS-MARCO (Microsoft MAchine Reading COmprehension) is the most widely-used benchmark for training search models because it contains real search queries, not synthetic data.

Step 6: Reranker Factory

Create src/rerankers/__init__.py:

src/rerankers/__init__.py

"""Reranker factory and exports."""
from src.config import config
from src.rerankers.base import BaseReranker, RankedDocument
from src.rerankers.cohere_reranker import CohereReranker
from src.rerankers.local_reranker import LocalReranker


def get_reranker(reranker_type: str = None) -> BaseReranker:
    """Factory function to get the configured reranker.

    Args:
        reranker_type: "cohere" or "local". Defaults to config value.

    Returns:
        Configured reranker instance
    """
    reranker_type = reranker_type or config.RERANKER_TYPE

    if reranker_type == "cohere":
        return CohereReranker()
    elif reranker_type == "local":
        return LocalReranker()
    else:
        raise ValueError(f"Unknown reranker type: {reranker_type}")


__all__ = [
    "BaseReranker",
    "RankedDocument",
    "CohereReranker",
    "LocalReranker",
    "get_reranker",
]

The Factory Pattern:

The get_reranker() function is a factory—it creates objects based on configuration without the caller needing to know the details:

# Without factory (tight coupling):
if config.RERANKER_TYPE == "cohere":
    reranker = CohereReranker()
elif config.RERANKER_TYPE == "local":
    reranker = LocalReranker()
# ... repeated everywhere you need a reranker

# With factory (loose coupling):
reranker = get_reranker()  # One line, config-driven

Benefits:

Single point of change: Add a new reranker? Update only the factory
Configuration-driven: Switch between rerankers via .env file, no code changes
Testability: Easy to inject mock rerankers for testing

Step 7: Document Processor

src/document_processor.py

"""Document processing: extraction, chunking, and embedding."""
from pathlib import Path
from typing import List

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

from src.config import config


class DocumentProcessor:
    """Handles PDF loading, text extraction, and chunking."""

    def __init__(self):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=config.CHUNK_SIZE,
            chunk_overlap=config.CHUNK_OVERLAP,
            length_function=len,
            separators=["\n\n", "\n", ". ", " ", ""]
        )

    def load_pdf(self, file_path: str | Path) -> List[Document]:
        """Load a PDF and return raw documents."""
        loader = PyPDFLoader(str(file_path))
        return loader.load()

    def chunk_documents(self, documents: List[Document]) -> List[Document]:
        """Split documents into smaller chunks for embedding."""
        chunks = self.text_splitter.split_documents(documents)

        for i, chunk in enumerate(chunks):
            chunk.metadata["chunk_id"] = i
            chunk.metadata["chunk_total"] = len(chunks)

        return chunks

    def process(self, file_path: str | Path) -> List[Document]:
        """Full pipeline: load PDF and chunk it."""
        documents = self.load_pdf(file_path)
        return self.chunk_documents(documents)

Note: This processor is identical to our basic RAG tutorial. The chunking parameters (CHUNK_SIZE=1000, CHUNK_OVERLAP=200) create chunks that are large enough for cross-encoders to work with meaningfully—very short chunks (< 100 chars) don't give rerankers much to analyze.

Step 8: RAG Engine with Reranking

src/rag_engine.py

"""RAG Engine with two-stage retrieval and reranking."""
from pathlib import Path
from typing import List, Optional

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

from src.config import config
from src.document_processor import DocumentProcessor
from src.rerankers import get_reranker, BaseReranker, RankedDocument


class RAGEngine:
    """RAG Engine with two-stage retrieval: retrieve then rerank."""

    def __init__(self, reranker: Optional[BaseReranker] = None):
        config.validate()

        self.embeddings = OpenAIEmbeddings(
            model=config.EMBEDDING_MODEL,
            openai_api_key=config.OPENAI_API_KEY
        )

        self.llm = ChatOpenAI(
            model=config.LLM_MODEL,
            temperature=config.TEMPERATURE,
            openai_api_key=config.OPENAI_API_KEY
        )

        self.processor = DocumentProcessor()
        self.reranker = reranker or get_reranker()
        self.vectorstore: Optional[Chroma] = None
        self._load_or_create_vectorstore()

    def _load_or_create_vectorstore(self) -> None:
        """Initialize or load existing vector store."""
        persist_dir = Path(config.CHROMA_PERSIST_DIR)

        self.vectorstore = Chroma(
            persist_directory=str(persist_dir),
            embedding_function=self.embeddings
        )

    def ingest_document(self, file_path: str | Path) -> int:
        """Process and store a document in the vector store."""
        chunks = self.processor.process(file_path)
        self.vectorstore.add_documents(chunks)
        return len(chunks)

    def _retrieve_candidates(self, query: str, k: int) -> List[Document]:
        """Stage 1: Fast bi-encoder retrieval."""
        retriever = self.vectorstore.as_retriever(
            search_type="similarity",
            search_kwargs={"k": k}
        )
        return retriever.invoke(query)

    def _rerank_documents(
        self,
        query: str,
        documents: List[Document],
        top_k: int
    ) -> List[RankedDocument]:
        """Stage 2: Cross-encoder reranking."""
        return self.reranker.rerank(query, documents, top_k)

    def _format_ranked_docs(self, ranked_docs: List[RankedDocument]) -> str:
        """Format reranked documents for the prompt."""
        formatted = []
        for i, ranked in enumerate(ranked_docs, 1):
            doc = ranked.document
            source = doc.metadata.get("source", "Unknown")
            page = doc.metadata.get("page", "?")
            formatted.append(
                f"[Source {i}: {Path(source).name}, Page {page}] "
                f"(relevance: {ranked.score:.3f})\n{doc.page_content}"
            )
        return "\n\n---\n\n".join(formatted)

    def query(
        self,
        question: str,
        use_reranking: bool = True
    ) -> dict:
        """Answer a question using two-stage RAG.

        Args:
            question: The user's question
            use_reranking: Whether to apply reranking (for comparison)

        Returns:
            Dict with answer, sources, and retrieval metadata
        """
        if not self.vectorstore:
            raise ValueError("No documents ingested yet")

        # Stage 1: Retrieve candidates
        candidates = self._retrieve_candidates(
            question,
            k=config.INITIAL_K
        )

        # Stage 2: Rerank (optional)
        if use_reranking:
            ranked_docs = self._rerank_documents(
                question,
                candidates,
                top_k=config.RERANK_TOP_K
            )
            context = self._format_ranked_docs(ranked_docs)
            final_docs = [r.document for r in ranked_docs]
            scores = [r.score for r in ranked_docs]
        else:
            # Use first RERANK_TOP_K without reranking
            final_docs = candidates[:config.RERANK_TOP_K]
            context = self._format_ranked_docs([
                RankedDocument(doc, 0.0, i)
                for i, doc in enumerate(final_docs)
            ])
            scores = None

        # RAG prompt
        prompt = ChatPromptTemplate.from_messages([
            ("system", """You are a helpful assistant that answers questions based on the provided context.

Instructions:
- Answer ONLY based on the context provided
- If the answer is not in the context, say "I cannot find this information in the document"
- Cite the source numbers when possible
- Be concise but thorough"""),
            ("human", """Context (ranked by relevance):
{context}

Question: {question}

Answer:""")
        ])

        # Generate answer
        chain = prompt | self.llm | StrOutputParser()
        answer = chain.invoke({
            "context": context,
            "question": question
        })

        return {
            "answer": answer,
            "sources": [
                {
                    "content": doc.page_content[:200] + "...",
                    "page": doc.metadata.get("page"),
                    "source": doc.metadata.get("source"),
                    "relevance_score": scores[i] if scores else None
                }
                for i, doc in enumerate(final_docs)
            ],
            "metadata": {
                "reranking_used": use_reranking,
                "reranker": self.reranker.name if use_reranking else None,
                "candidates_retrieved": len(candidates),
                "final_docs_used": len(final_docs)
            }
        }

    def compare_retrieval(self, question: str) -> dict:
        """Compare results with and without reranking."""
        without_reranking = self.query(question, use_reranking=False)
        with_reranking = self.query(question, use_reranking=True)

        return {
            "question": question,
            "without_reranking": without_reranking,
            "with_reranking": with_reranking
        }

    def clear_vectorstore(self) -> None:
        """Clear all documents from the vector store."""
        if self.vectorstore:
            self.vectorstore.delete_collection()
            self._load_or_create_vectorstore()

Understanding the Two-Stage Query Flow:

query("What causes headaches?")
        │
        ▼
┌─────────────────────────────────────────────────────────────┐
│  Stage 1: _retrieve_candidates(query, k=20)                 │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  Bi-encoder embeds query → searches vector DB       │    │
│  │  Returns: 20 candidates (fast, ~50ms)               │    │
│  │                                                      │    │
│  │  Results might include:                              │    │
│  │  • "Headaches are painful" (keyword match, less useful) │
│  │  • "Caffeine withdrawal causes migraines" (relevant)  │
│  │  • "Head injuries..." (partial match)                │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────────────────┐
│  Stage 2: _rerank_documents(query, candidates, top_k=5)     │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  Cross-encoder scores each (query, doc) pair        │    │
│  │  Returns: Top 5 reranked (slower, ~200ms)           │    │
│  │                                                      │    │
│  │  Reranked order:                                     │    │
│  │  1. "Caffeine withdrawal causes migraines" (0.89)   │
│  │  2. "Dehydration leads to headaches" (0.82)         │
│  │  3. ...better results float to top...               │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────────────────┐
│  Stage 3: LLM Generation                                    │
│  • Uses only top 5 high-quality chunks                      │
│  • Less noise = better answers                              │
└─────────────────────────────────────────────────────────────┘

Key Implementation Details:

Method	Purpose	Why It's Designed This Way
`_retrieve_candidates`	Fast initial search	Uses `as_retriever()` for clean LangChain integration
`_rerank_documents`	Precision filtering	Delegates to pluggable reranker
`_format_ranked_docs`	Context preparation	Includes relevance scores so you can debug ranking
`use_reranking` param	A/B testing support	Compare results with/without reranking

The compare_retrieval() method is invaluable during development—it shows you exactly how reranking changes your results.

Step 9: FastAPI Application

src/api.py

"""FastAPI application for RAG with Reranking."""
import tempfile
from pathlib import Path
from typing import Optional

from fastapi import FastAPI, UploadFile, File, HTTPException, Query
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel

from src.rag_engine import RAGEngine
from src.config import config


app = FastAPI(
    title="RAG with Reranking API",
    description="Two-stage RAG system with cross-encoder reranking",
    version="1.0.0"
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# Initialize RAG engine
rag_engine = RAGEngine()


class QuestionRequest(BaseModel):
    question: str
    use_reranking: bool = True


class SourceInfo(BaseModel):
    content: str
    page: Optional[int]
    source: Optional[str]
    relevance_score: Optional[float]


class RetrievalMetadata(BaseModel):
    reranking_used: bool
    reranker: Optional[str]
    candidates_retrieved: int
    final_docs_used: int


class AnswerResponse(BaseModel):
    answer: str
    sources: list[SourceInfo]
    metadata: RetrievalMetadata


class IngestResponse(BaseModel):
    message: str
    chunks_created: int


@app.get("/")
async def root():
    """Health check endpoint."""
    return {
        "status": "healthy",
        "service": "RAG with Reranking",
        "reranker": rag_engine.reranker.name,
        "config": {
            "initial_k": config.INITIAL_K,
            "rerank_top_k": config.RERANK_TOP_K
        }
    }


@app.post("/ingest", response_model=IngestResponse)
async def ingest_document(file: UploadFile = File(...)):
    """Upload and process a PDF document."""
    if not file.filename.endswith(".pdf"):
        raise HTTPException(status_code=400, detail="Only PDF files are supported")

    try:
        with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
            content = await file.read()
            tmp.write(content)
            tmp_path = tmp.name

        chunks_count = rag_engine.ingest_document(tmp_path)
        Path(tmp_path).unlink()

        return IngestResponse(
            message=f"Successfully processed {file.filename}",
            chunks_created=chunks_count
        )

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.post("/query", response_model=AnswerResponse)
async def query_document(request: QuestionRequest):
    """Ask a question with optional reranking."""
    if not request.question.strip():
        raise HTTPException(status_code=400, detail="Question cannot be empty")

    try:
        result = rag_engine.query(
            request.question,
            use_reranking=request.use_reranking
        )
        return AnswerResponse(**result)

    except ValueError as e:
        raise HTTPException(status_code=400, detail=str(e))
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.post("/compare")
async def compare_retrieval(question: str = Query(..., min_length=1)):
    """Compare results with and without reranking."""
    try:
        return rag_engine.compare_retrieval(question)
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.delete("/clear")
async def clear_documents():
    """Clear all ingested documents."""
    rag_engine.clear_vectorstore()
    return {"message": "Vector store cleared successfully"}


if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Understanding the API Design:

Endpoint	Purpose	Use Case
`POST /ingest`	Upload PDFs	Initial document loading
`POST /query`	Ask questions	Main RAG endpoint with reranking toggle
`POST /compare`	A/B comparison	Development/debugging to see reranking impact
`DELETE /clear`	Reset database	Testing or switching document sets

The use_reranking parameter in /query is a powerful feature:

Production: Always true for best quality
Debugging: Toggle to see if reranking helps or hurts specific queries
Cost analysis: Compare latency with and without reranking

Pydantic Response Models: Notice how SourceInfo, RetrievalMetadata, and AnswerResponse define the exact shape of API responses. This gives you:

Automatic request validation
Auto-generated OpenAPI docs at /docs
Type hints for API consumers

Testing

Create tests/test_rerankers.py:

tests/test_rerankers.py

"""Tests for reranking components."""
import pytest
from langchain_core.documents import Document

from src.rerankers.local_reranker import LocalReranker
from src.rerankers.base import RankedDocument


class TestLocalReranker:
    """Tests for the local cross-encoder reranker."""

    @pytest.fixture
    def reranker(self):
        """Create a local reranker with tiny model for fast tests."""
        return LocalReranker("tiny")

    @pytest.fixture
    def sample_docs(self):
        """Create sample documents for testing."""
        return [
            Document(page_content="Python is a programming language"),
            Document(page_content="Machine learning uses algorithms"),
            Document(page_content="Python is great for machine learning"),
            Document(page_content="The weather is sunny today"),
        ]

    def test_rerank_returns_ranked_documents(self, reranker, sample_docs):
        """Test that reranking returns RankedDocument objects."""
        results = reranker.rerank(
            query="What programming language is used for ML?",
            documents=sample_docs,
            top_k=2
        )

        assert len(results) == 2
        assert all(isinstance(r, RankedDocument) for r in results)

    def test_rerank_orders_by_relevance(self, reranker, sample_docs):
        """Test that results are ordered by relevance score."""
        results = reranker.rerank(
            query="Python for machine learning",
            documents=sample_docs,
            top_k=4
        )

        # Scores should be in descending order
        scores = [r.score for r in results]
        assert scores == sorted(scores, reverse=True)

    def test_rerank_irrelevant_doc_ranked_low(self, reranker, sample_docs):
        """Test that irrelevant documents are ranked lower."""
        results = reranker.rerank(
            query="Python programming",
            documents=sample_docs,
            top_k=4
        )

        # Weather doc should be ranked last
        weather_rank = None
        for i, r in enumerate(results):
            if "weather" in r.document.page_content:
                weather_rank = i

        assert weather_rank == len(results) - 1

    def test_rerank_empty_documents(self, reranker):
        """Test handling of empty document list."""
        results = reranker.rerank(
            query="test query",
            documents=[],
            top_k=5
        )

        assert results == []


# Run with: pytest tests/test_rerankers.py -v

What These Tests Verify:

Test	What It Checks	Why It Matters
`test_rerank_returns_ranked_documents`	Output type is correct	Ensures interface contract is met
`test_rerank_orders_by_relevance`	Scores are descending	Confirms sorting logic works
`test_rerank_irrelevant_doc_ranked_low`	"Weather" doc ranks last	Validates semantic understanding
`test_rerank_empty_documents`	Handles edge case	Prevents crashes on empty input

Notice we use the "tiny" model in tests (LocalReranker("tiny"))—it's only 17MB and loads fast, perfect for CI/CD pipelines where you don't need production-quality ranking.

Running the Application

Start the API

uvicorn src.api:app --reload

Test with curl

# Upload a PDF
curl -X POST "http://localhost:8000/ingest" \
  -F "file=@your-document.pdf"

# Query with reranking (default)
curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the main conclusion?"}'

# Query without reranking (for comparison)
curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the main conclusion?", "use_reranking": false}'

# Compare both approaches
curl -X POST "http://localhost:8000/compare?question=What%20is%20the%20main%20topic"

Visit http://localhost:8000/docs for interactive Swagger UI.

Measuring Improvement

Retrieval Quality Metrics

Metric	Without Reranking	With Reranking	Improvement
MRR@5	~0.65	~0.82	+26%
NDCG@5	~0.58	~0.76	+31%
Precision@3	~0.71	~0.89	+25%

Note: Actual improvements vary by dataset. Test on your specific use case!

When Reranking Helps Most

Scenario	Improvement	Reason
Ambiguous queries	High	Cross-encoder understands nuance
Technical documents	High	Better semantic matching
Short queries	Medium	More context from full comparison
Simple factoid queries	Low	Bi-encoder often sufficient

Debugging Tips

Reranking is slow

Use smaller cross-encoder model (try "tiny")
Reduce INITIAL_K to retrieve fewer candidates
Consider async/batching for production

Results worse with reranking

Check if documents are very short (cross-encoder needs content)
Try different cross-encoder models
Verify reranker is loading correctly

Cohere API errors

Verify API key is correct
Check Cohere dashboard for rate limits
Ensure documents aren't too long (max ~512 tokens each)

Local model not downloading

Check internet connection
Try explicit model path: LocalReranker("cross-encoder/ms-marco-MiniLM-L-6-v2")

Cross-Encoder Model Comparison

Model	Size	Speed	Quality	Best For
`ms-marco-TinyBERT-L-2-v2`	17MB	Fastest	Good	Development, high throughput
`ms-marco-MiniLM-L-6-v2`	90MB	Fast	Better	Production (balanced)
`ms-marco-MiniLM-L-12-v2`	130MB	Medium	Great	Quality-focused apps
Cohere Rerank v3	Cloud	Variable	Best	Production, multilingual

Extensions

Level	Ideas
Easy	Add caching for reranker results, support more file types
Medium	Implement query expansion before retrieval, add A/B testing
Advanced	Train custom cross-encoder on your domain, implement listwise reranking

Resources

Key Concepts Recap

Concept	What It Is	Why It Matters
Bi-Encoder	Encodes query and documents separately into vectors	Fast but misses nuanced relationships
Cross-Encoder	Encodes query+document together	Slower but understands semantic nuances
Two-Stage Retrieval	Retrieve many candidates, then rerank	Best of both: speed + accuracy
MS-MARCO Models	Models trained on real Bing search data	Gold standard for passage ranking
Factory Pattern	Create objects based on config	Easy to swap implementations
Abstract Base Class	Define interface, force implementation	Ensures all rerankers work the same way
Relevance Score	0.0-1.0 measure of query-document match	Debug ranking, set confidence thresholds

Summary

You've built a two-stage RAG system that:

Stage 1: Fast bi-encoder retrieves broad candidates (k=20)
Stage 2: Cross-encoder reranks for precision (top 5)
Supports both cloud (Cohere) and local (sentence-transformers) reranking
Measures retrieval quality improvements
Compares results with and without reranking

Key Takeaways:

Bi-encoders are fast but shallow - Good for initial retrieval
Cross-encoders are slow but deep - Perfect for reranking small sets
Two-stage is the best of both - Speed + accuracy
Test on your data - Improvements vary by domain

Next: Hybrid Search - Combine keyword and semantic search

RAG with Reranking

On this page

RAG with Reranking

On this page