Intelligent Document Q&A System

TL;DR

Build a RAG (Retrieval-Augmented Generation) system that lets users ask questions about PDF documents. Uses LangChain for orchestration, ChromaDB for vector storage, and OpenAI for embeddings and generation. Upload a PDF, ask questions, get answers with source citations.

Property	Value
Difficulty	Beginner
Time	~2 hours
Code Size	~150 LOC

Tech Stack

Technology	Purpose
LangChain	RAG orchestration
OpenAI	Embeddings + GPT-4
ChromaDB	Vector database
PyPDF	PDF extraction
FastAPI	REST API

Prerequisites

Basic Python knowledge
Understanding of APIs
OpenAI API key (Get one here)

What You'll Learn

Understand the RAG architecture pattern
Extract and chunk text from PDF documents
Create embeddings and store them in a vector database
Implement semantic search retrieval
Build a complete question-answering pipeline
Create a REST API for your RAG system

Understanding RAG: The Core Concept

Before diving into code, let's understand what we're building and why.

What is RAG?

RAG (Retrieval-Augmented Generation) solves a fundamental problem: LLMs like GPT-4 only know what they were trained on. They can't answer questions about your private documents, recent events, or company data.

RAG fixes this by:

Storing your documents in a searchable database
Finding relevant pieces when a question is asked
Feeding those pieces to the LLM as context
Generating an answer based on your actual data

Why Not Just Paste Documents into ChatGPT?

You might wonder: "Why build all this? Can't I just paste my document into ChatGPT?"

Approach	Limitation
Copy-paste	Context window limits (~128K tokens ≈ 100 pages max)
Copy-paste	Expensive - you pay for every token, every question
Copy-paste	No persistence - start over each conversation
RAG	Handles unlimited documents
RAG	Only retrieves what's relevant (cheaper)
RAG	Documents stored permanently

System Architecture

Here's the big picture of what we're building:

┌─────────────────────────────────────────────────────────────────────────┐
│                         RAG ARCHITECTURE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  INPUT LAYER                                                            │
│  ┌────────────────┐       ┌────────────────┐                            │
│  │  PDF Document  │       │ User Question  │                            │
│  └───────┬────────┘       └───────┬────────┘                            │
│          │                        │                                     │
│          ▼                        │                                     │
│  ════════════════════             │        ════════════════════          │
│  PROCESSING PIPELINE              │        QUERY PIPELINE               │
│  ════════════════════             │        ════════════════════          │
│          │                        │               │                     │
│          ▼                        │               ▼                     │
│  ┌────────────────┐               │       ┌────────────────┐            │
│  │Text Extraction │               │       │Query Embedding │            │
│  └───────┬────────┘               │       └───────┬────────┘            │
│          ▼                        │               │                     │
│  ┌────────────────┐               │               ▼                     │
│  │   Chunking     │               │       ┌────────────────┐            │
│  └───────┬────────┘               │       │Semantic Search │◄─────┐     │
│          ▼                        │       └───────┬────────┘      │     │
│  ┌────────────────┐               │               │               │     │
│  │   Embedding    │               │               ▼               │     │
│  └───────┬────────┘               │       ┌────────────────┐      │     │
│          ▼                        │       │Context Retrieval│      │     │
│  ┌────────────────┐               │       └───────┬────────┘      │     │
│  │ Vector Storage │───────────────┼───────────────┼───────────────┘     │
│  └────────────────┘               │               ▼                     │
│                                   │       ┌────────────────┐            │
│                                   └──────►│ LLM Generation │            │
│                                           └───────┬────────┘            │
│                                                   ▼                     │
│                                           ┌────────────────┐            │
│                                           │Answer + Sources│            │
│                                           └────────────────┘            │
└─────────────────────────────────────────────────────────────────────────┘

The Two Phases

Phase 1: Ingestion (happens once per document)

Extract text from PDF
Split into smaller chunks (why? explained below)
Convert chunks to vectors (numbers that capture meaning)
Store vectors in a database

Phase 2: Query (happens each time user asks)

Convert the question to a vector
Find chunks with similar vectors (semantic search)
Send question + relevant chunks to LLM
Return the answer with sources

Data Flow

This sequence diagram shows exactly what happens when you upload a document and ask a question:

┌─────────────────────────────────────────────────────────────────────────┐
│                         DATA FLOW SEQUENCE                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  User       FastAPI      LangChain     ChromaDB       OpenAI            │
│   │            │             │             │            │               │
│   │ ═══════════════════ DOCUMENT INGESTION ══════════════════           │
│   │            │             │             │            │               │
│   │─Upload PDF►│             │             │            │               │
│   │            │──Process───►│             │            │               │
│   │            │             │──Embed─────────────────►│               │
│   │            │             │◄────Vectors─────────────│               │
│   │            │             │──Store─────►│            │               │
│   │◄───Ready───│             │             │            │               │
│   │            │             │             │            │               │
│   │ ═══════════════════ QUESTION ANSWERING ══════════════════           │
│   │            │             │             │            │               │
│   │─Ask Quest─►│             │             │            │               │
│   │            │─────────Search────────────►│            │               │
│   │            │◄────Relevant Chunks───────│            │               │
│   │            │─────────Generate Answer──────────────►│               │
│   │◄─Answer────│             │             │            │               │
│   │            │             │             │            │               │
└─────────────────────────────────────────────────────────────────────────┘

Project Structure

intelligent-doc-qa/
├── src/
│   ├── __init__.py
│   ├── config.py              # Settings and environment variables
│   ├── document_processor.py  # PDF loading and chunking
│   ├── rag_engine.py          # Core RAG logic
│   └── api.py                 # REST API endpoints
├── tests/
│   └── test_rag.py
├── data/
│   └── sample.pdf
├── .env                       # Your API keys (never commit this!)
├── pyproject.toml
└── README.md

Implementation

Step 1: Project Setup

Create your project directory and set up the environment:

mkdir intelligent-doc-qa && cd intelligent-doc-qa
uv init
uv venv && source .venv/bin/activate

Install dependencies:

uv add langchain langchain-openai langchain-chroma
uv add chromadb pypdf python-dotenv
uv add fastapi uvicorn python-multipart

Create your .env file:

.env

OPENAI_API_KEY=sk-your-key-here
CHROMA_PERSIST_DIR=./chroma_db
CHUNK_SIZE=1000
CHUNK_OVERLAP=200

Step 2: Configuration Module

Create src/config.py to manage environment variables:

src/config.py

"""Configuration management for the RAG system."""
import os
from dotenv import load_dotenv

load_dotenv()

class Config:
    """Application configuration from environment variables."""

    OPENAI_API_KEY: str = os.getenv("OPENAI_API_KEY", "")
    CHROMA_PERSIST_DIR: str = os.getenv("CHROMA_PERSIST_DIR", "./chroma_db")
    CHUNK_SIZE: int = int(os.getenv("CHUNK_SIZE", "1000"))
    CHUNK_OVERLAP: int = int(os.getenv("CHUNK_OVERLAP", "200"))

    # Model settings
    EMBEDDING_MODEL: str = "text-embedding-3-small"
    LLM_MODEL: str = "gpt-4o-mini"
    TEMPERATURE: float = 0.1

    # Retrieval settings
    TOP_K: int = 4

    @classmethod
    def validate(cls) -> None:
        """Validate required configuration."""
        if not cls.OPENAI_API_KEY:
            raise ValueError("OPENAI_API_KEY is required")

config = Config()

What's Happening Here?

This module centralizes all configuration in one place. Here's why each setting matters:

Setting	Value	Why This Value?
`CHUNK_SIZE`	1000	Characters per chunk. Too small = fragments lose meaning. Too large = irrelevant content mixed in. 1000 is a good starting point.
`CHUNK_OVERLAP`	200	Characters shared between chunks. Prevents cutting sentences in half. 20% overlap is standard.
`EMBEDDING_MODEL`	text-embedding-3-small	OpenAI's newest, cheapest embedding model. Good quality at $0.02/1M tokens.
`LLM_MODEL`	gpt-4o-mini	Fast, cheap, capable. Use gpt-4o for complex reasoning.
`TEMPERATURE`	0.1	Low = more deterministic answers. For Q&A, we want consistent, factual responses.
`TOP_K`	4	Number of chunks to retrieve. 4 gives enough context without overwhelming the LLM.

Why use a Config class?

Centralizing configuration makes it easy to:

Change settings without editing multiple files
Validate required values at startup
Switch between development/production settings

Step 3: Document Processor

Create src/document_processor.py to handle PDF processing:

src/document_processor.py

"""Document processing: extraction, chunking, and embedding."""
from pathlib import Path
from typing import List

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

from src.config import config


class DocumentProcessor:
    """Handles PDF loading, text extraction, and chunking."""

    def __init__(self):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=config.CHUNK_SIZE,
            chunk_overlap=config.CHUNK_OVERLAP,
            length_function=len,
            separators=["\n\n", "\n", ". ", " ", ""]
        )

    def load_pdf(self, file_path: str | Path) -> List[Document]:
        """Load a PDF and return raw documents."""
        loader = PyPDFLoader(str(file_path))
        return loader.load()

    def chunk_documents(self, documents: List[Document]) -> List[Document]:
        """Split documents into smaller chunks for embedding."""
        chunks = self.text_splitter.split_documents(documents)

        # Add chunk metadata
        for i, chunk in enumerate(chunks):
            chunk.metadata["chunk_id"] = i
            chunk.metadata["chunk_total"] = len(chunks)

        return chunks

    def process(self, file_path: str | Path) -> List[Document]:
        """Full pipeline: load PDF and chunk it."""
        documents = self.load_pdf(file_path)
        return self.chunk_documents(documents)

Understanding Chunking: The Heart of RAG

Chunking is one of the most important decisions in a RAG system. Let's break down what's happening:

Why do we chunk documents?

Imagine you have a 100-page PDF. When a user asks "What was the revenue in Q3?", you don't want to send all 100 pages to the LLM - that's expensive and the relevant info gets lost in noise.

Instead, we:

Split the document into small pieces (chunks)
Find only the chunks relevant to the question
Send just those chunks to the LLM

How RecursiveCharacterTextSplitter works:

separators=["\n\n", "\n", ". ", " ", ""]

This tells the splitter to try splitting in this order:

First, try to split on double newlines (paragraphs) - keeps paragraphs together
If chunk is still too big, split on single newlines
Then on sentences (". ")
Then on words (" ")
Finally, on characters (last resort)

This "recursive" approach preserves meaning better than naive splitting.

Visual example of chunking:

Original Document (2500 characters):
┌─────────────────────────────────────────────────────┐
│ Introduction paragraph about the company...         │
│                                                     │
│ Q3 Revenue was $50M, up 20% from Q2...              │
│                                                     │
│ The marketing team expanded into 3 new regions...   │
│                                                     │
│ Future outlook remains positive with...             │
└─────────────────────────────────────────────────────┘

After chunking (chunk_size=1000, overlap=200):
┌────────────────────┐
│ Chunk 1            │
│ Introduction...    │──┐
│ Q3 Revenue was...  │  │ 200 char overlap
└────────────────────┘  │
    ┌───────────────────┘
    ▼
┌────────────────────┐
│ Chunk 2            │
│ ...Revenue was...  │──┐
│ Marketing team...  │  │ 200 char overlap
└────────────────────┘  │
    ┌───────────────────┘
    ▼
┌────────────────────┐
│ Chunk 3            │
│ ...3 new regions   │
│ Future outlook...  │
└────────────────────┘

The overlap ensures that if important information spans a chunk boundary, it appears in both chunks.

Step 4: RAG Engine

Create src/rag_engine.py - the core of the system:

src/rag_engine.py

"""RAG Engine: Vector store management and question answering."""
from pathlib import Path
from typing import List, Optional

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

from src.config import config
from src.document_processor import DocumentProcessor


class RAGEngine:
    """Manages vector storage and RAG-based question answering."""

    def __init__(self):
        config.validate()

        self.embeddings = OpenAIEmbeddings(
            model=config.EMBEDDING_MODEL,
            openai_api_key=config.OPENAI_API_KEY
        )

        self.llm = ChatOpenAI(
            model=config.LLM_MODEL,
            temperature=config.TEMPERATURE,
            openai_api_key=config.OPENAI_API_KEY
        )

        self.processor = DocumentProcessor()
        self.vectorstore: Optional[Chroma] = None
        self._load_or_create_vectorstore()

    def _load_or_create_vectorstore(self) -> None:
        """Initialize or load existing vector store."""
        persist_dir = Path(config.CHROMA_PERSIST_DIR)

        self.vectorstore = Chroma(
            persist_directory=str(persist_dir),
            embedding_function=self.embeddings
        )

    def ingest_document(self, file_path: str | Path) -> int:
        """Process and store a document in the vector store."""
        chunks = self.processor.process(file_path)
        self.vectorstore.add_documents(chunks)
        return len(chunks)

    def _format_docs(self, docs: List[Document]) -> str:
        """Format retrieved documents for the prompt."""
        formatted = []
        for i, doc in enumerate(docs, 1):
            source = doc.metadata.get("source", "Unknown")
            page = doc.metadata.get("page", "?")
            formatted.append(
                f"[Source {i}: {Path(source).name}, Page {page}]\n{doc.page_content}"
            )
        return "\n\n---\n\n".join(formatted)

    def query(self, question: str) -> dict:
        """Answer a question using RAG."""
        if not self.vectorstore:
            raise ValueError("No documents ingested yet")

        # Retrieve relevant chunks
        retriever = self.vectorstore.as_retriever(
            search_type="similarity",
            search_kwargs={"k": config.TOP_K}
        )

        # RAG prompt template
        prompt = ChatPromptTemplate.from_messages([
            ("system", """You are a helpful assistant that answers questions based on the provided context.

Instructions:
- Answer ONLY based on the context provided
- If the answer is not in the context, say "I cannot find this information in the document"
- Cite the source numbers when possible
- Be concise but thorough"""),
            ("human", """Context:
{context}

Question: {question}

Answer:""")
        ])

        # Build the RAG chain
        chain = (
            {"context": retriever | self._format_docs, "question": RunnablePassthrough()}
            | prompt
            | self.llm
            | StrOutputParser()
        )

        # Execute and get results
        answer = chain.invoke(question)
        source_docs = retriever.invoke(question)

        return {
            "answer": answer,
            "sources": [
                {
                    "content": doc.page_content[:200] + "...",
                    "page": doc.metadata.get("page"),
                    "source": doc.metadata.get("source")
                }
                for doc in source_docs
            ]
        }

    def clear_vectorstore(self) -> None:
        """Clear all documents from the vector store."""
        if self.vectorstore:
            self.vectorstore.delete_collection()
            self._load_or_create_vectorstore()

Deep Dive: How the RAG Engine Works

This is the heart of the system. Let's understand each component:

1. Embeddings: Converting Text to Numbers

self.embeddings = OpenAIEmbeddings(
    model=config.EMBEDDING_MODEL,
    openai_api_key=config.OPENAI_API_KEY
)

What are embeddings?

Embeddings convert text into vectors (lists of numbers) that capture semantic meaning. Similar texts have similar vectors.

"The cat sat on the mat"  →  [0.2, -0.5, 0.8, 0.1, ...]  (1536 numbers)
"A feline rested on a rug" →  [0.19, -0.48, 0.79, 0.12, ...]  (very similar!)
"Stock prices fell today"  →  [-0.7, 0.3, -0.2, 0.9, ...]  (very different)

This is how semantic search works - we find chunks whose vectors are close to the question's vector.

2. Vector Store: ChromaDB

self.vectorstore = Chroma(
    persist_directory=str(persist_dir),
    embedding_function=self.embeddings
)

ChromaDB stores our embeddings and enables fast similarity search. When you call add_documents(), it:

Converts each chunk to a vector using the embedding function
Stores the vector + original text + metadata
Builds an index for fast searching

3. The Retriever

retriever = self.vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": config.TOP_K}
)

The retriever wraps the vector store and provides a simple interface:

Input: A question (string)
Output: Top K most similar chunks

How similarity search works:

Question: "What was Q3 revenue?"
    ↓
Embed question: [0.3, -0.2, 0.7, ...]
    ↓
Compare to all stored chunks using cosine similarity
    ↓
Return top 4 most similar chunks

4. The RAG Chain (LangChain Expression Language)

chain = (
    {"context": retriever | self._format_docs, "question": RunnablePassthrough()}
    | prompt
    | self.llm
    | StrOutputParser()
)

This is LangChain's way of building pipelines. Let's break it down:

Step 1: {"context": retriever | self._format_docs, "question": RunnablePassthrough()}
        ┌─────────────────────────────────────────────────────────────┐
        │ Creates a dictionary with two keys:                         │
        │ - "context": question → retriever → format as string        │
        │ - "question": question passes through unchanged             │
        └─────────────────────────────────────────────────────────────┘
                                    ↓
Step 2: | prompt
        ┌─────────────────────────────────────────────────────────────┐
        │ Fills in the prompt template with context and question      │
        └─────────────────────────────────────────────────────────────┘
                                    ↓
Step 3: | self.llm
        ┌─────────────────────────────────────────────────────────────┐
        │ Sends the filled prompt to GPT-4o-mini                      │
        └─────────────────────────────────────────────────────────────┘
                                    ↓
Step 4: | StrOutputParser()
        ┌─────────────────────────────────────────────────────────────┐
        │ Extracts just the text string from the LLM response         │
        └─────────────────────────────────────────────────────────────┘

5. The Prompt Template

prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a helpful assistant that answers questions based on the provided context.

Instructions:
- Answer ONLY based on the context provided
- If the answer is not in the context, say "I cannot find this information in the document"
- Cite the source numbers when possible
- Be concise but thorough"""),
    ("human", """Context:
{context}

Question: {question}

Answer:""")
])

This prompt is critical for RAG quality. Notice:

We explicitly tell the LLM to only use the context (prevents hallucination)
We tell it to admit when information isn't available
We ask for citations (source numbers)

Prompt Engineering Tip

The instruction "If the answer is not in the context, say I cannot find this information" is crucial. Without it, LLMs often hallucinate answers that sound plausible but aren't in your documents.

Step 5: FastAPI Application

Create src/api.py for the REST API:

src/api.py

"""FastAPI application for the RAG system."""
import tempfile
from pathlib import Path

from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel

from src.rag_engine import RAGEngine


app = FastAPI(
    title="Intelligent Document Q&A API",
    description="RAG-powered document question answering system",
    version="1.0.0"
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# Initialize RAG engine
rag_engine = RAGEngine()


class QuestionRequest(BaseModel):
    question: str


class AnswerResponse(BaseModel):
    answer: str
    sources: list


class IngestResponse(BaseModel):
    message: str
    chunks_created: int


@app.get("/")
async def root():
    """Health check endpoint."""
    return {"status": "healthy", "service": "Intelligent Document Q&A"}


@app.post("/ingest", response_model=IngestResponse)
async def ingest_document(file: UploadFile = File(...)):
    """Upload and process a PDF document."""
    if not file.filename.endswith(".pdf"):
        raise HTTPException(status_code=400, detail="Only PDF files are supported")

    try:
        with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
            content = await file.read()
            tmp.write(content)
            tmp_path = tmp.name

        chunks_count = rag_engine.ingest_document(tmp_path)
        Path(tmp_path).unlink()

        return IngestResponse(
            message=f"Successfully processed {file.filename}",
            chunks_created=chunks_count
        )

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.post("/query", response_model=AnswerResponse)
async def query_document(request: QuestionRequest):
    """Ask a question about the ingested documents."""
    if not request.question.strip():
        raise HTTPException(status_code=400, detail="Question cannot be empty")

    try:
        result = rag_engine.query(request.question)
        return AnswerResponse(**result)

    except ValueError as e:
        raise HTTPException(status_code=400, detail=str(e))
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.delete("/clear")
async def clear_documents():
    """Clear all ingested documents."""
    rag_engine.clear_vectorstore()
    return {"message": "Vector store cleared successfully"}


if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Understanding the API

The API provides three main endpoints:

Endpoint	Method	Purpose
`/ingest`	POST	Upload a PDF document
`/query`	POST	Ask a question
`/clear`	DELETE	Reset the system

File Upload Flow (/ingest):

# 1. Receive the uploaded file
file: UploadFile = File(...)

# 2. Save to a temporary file (required because PyPDFLoader needs a file path)
with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
    content = await file.read()
    tmp.write(content)
    tmp_path = tmp.name

# 3. Process the document (chunk + embed + store)
chunks_count = rag_engine.ingest_document(tmp_path)

# 4. Clean up the temporary file
Path(tmp_path).unlink()

Why use Pydantic models?

class QuestionRequest(BaseModel):
    question: str

Pydantic provides:

Automatic validation (ensures question is a string)
Automatic documentation (shows up in Swagger UI)
Type hints for IDE support

Step 6: Run and Test

Start the server:

python -m uvicorn src.api:app --reload

Test with curl:

# Upload a PDF
curl -X POST "http://localhost:8000/ingest" \
  -H "accept: application/json" \
  -F "file=@your-document.pdf"

# Ask a question
curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the main topic of this document?"}'

Or visit http://localhost:8000/docs for the interactive Swagger UI.

Testing

Create tests/test_rag.py:

tests/test_rag.py

"""Tests for the RAG system."""
import pytest
from src.document_processor import DocumentProcessor
from langchain_core.documents import Document


class TestDocumentProcessor:
    """Tests for document processing."""

    def test_chunking_preserves_content(self):
        """Verify chunking doesn't lose content."""
        processor = DocumentProcessor()

        test_doc = Document(
            page_content="A" * 2500,
            metadata={"source": "test.pdf", "page": 0}
        )

        chunks = processor.chunk_documents([test_doc])

        # Should create multiple chunks
        assert len(chunks) > 1

        # Total content should be preserved
        total_content = sum(len(c.page_content) for c in chunks)
        assert total_content >= 2500


# Run with: pytest tests/test_rag.py -v

Debugging Tips

No documents ingested yet

Ensure you've uploaded a PDF before querying
Check that the vector store path is writable

OpenAI API key invalid

Verify your .env file has the correct key
Check for extra whitespace in the key

PDF processing failed

Ensure the PDF isn't password-protected
Some scanned PDFs need OCR preprocessing

Key Concepts Recap

Concept	What It Does	Why It Matters
Chunking	Splits documents into smaller pieces	Enables precise retrieval; too large chunks dilute relevance
Embeddings	Converts text to vectors	Enables semantic search (meaning-based, not keyword-based)
Vector Store	Stores and searches embeddings	Fast similarity search across thousands of chunks
Retriever	Finds relevant chunks	Bridges user question to stored knowledge
Prompt Template	Structures the LLM request	Controls answer quality and prevents hallucination

Extensions

Level	Ideas
Easy	Add conversation history, support multiple file formats, add streaming
Medium	Add reranking with Cohere, implement hybrid search, citation highlighting
Advanced	RAG evaluation pipeline, async processing queue, multi-tenancy support

Resources

Summary

You've built a complete RAG system that:

Processes PDF documents into searchable chunks
Creates and stores vector embeddings
Performs semantic similarity search
Generates contextual answers with sources
Exposes a REST API for integration

Next: Multi-Document RAG

Intelligent Document Q&A System

TL;DR

Property	Value
Difficulty	Beginner
Time	~2 hours
Code Size	~150 LOC

Tech Stack

Technology	Purpose
LangChain	RAG orchestration
OpenAI	Embeddings + GPT-4
ChromaDB	Vector database
PyPDF	PDF extraction
FastAPI	REST API

Prerequisites

Basic Python knowledge
Understanding of APIs
OpenAI API key (Get one here)

What You'll Learn

Understand the RAG architecture pattern
Extract and chunk text from PDF documents
Create embeddings and store them in a vector database
Implement semantic search retrieval
Build a complete question-answering pipeline
Create a REST API for your RAG system

Understanding RAG: The Core Concept

Before diving into code, let's understand what we're building and why.

What is RAG?

RAG fixes this by:

Storing your documents in a searchable database
Finding relevant pieces when a question is asked
Feeding those pieces to the LLM as context
Generating an answer based on your actual data

Why Not Just Paste Documents into ChatGPT?

You might wonder: "Why build all this? Can't I just paste my document into ChatGPT?"

Approach	Limitation
Copy-paste	Context window limits (~128K tokens ≈ 100 pages max)
Copy-paste	Expensive - you pay for every token, every question
Copy-paste	No persistence - start over each conversation
RAG	Handles unlimited documents
RAG	Only retrieves what's relevant (cheaper)
RAG	Documents stored permanently

System Architecture

Here's the big picture of what we're building:

┌─────────────────────────────────────────────────────────────────────────┐
│                         RAG ARCHITECTURE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  INPUT LAYER                                                            │
│  ┌────────────────┐       ┌────────────────┐                            │
│  │  PDF Document  │       │ User Question  │                            │
│  └───────┬────────┘       └───────┬────────┘                            │
│          │                        │                                     │
│          ▼                        │                                     │
│  ════════════════════             │        ════════════════════          │
│  PROCESSING PIPELINE              │        QUERY PIPELINE               │
│  ════════════════════             │        ════════════════════          │
│          │                        │               │                     │
│          ▼                        │               ▼                     │
│  ┌────────────────┐               │       ┌────────────────┐            │
│  │Text Extraction │               │       │Query Embedding │            │
│  └───────┬────────┘               │       └───────┬────────┘            │
│          ▼                        │               │                     │
│  ┌────────────────┐               │               ▼                     │
│  │   Chunking     │               │       ┌────────────────┐            │
│  └───────┬────────┘               │       │Semantic Search │◄─────┐     │
│          ▼                        │       └───────┬────────┘      │     │
│  ┌────────────────┐               │               │               │     │
│  │   Embedding    │               │               ▼               │     │
│  └───────┬────────┘               │       ┌────────────────┐      │     │
│          ▼                        │       │Context Retrieval│      │     │
│  ┌────────────────┐               │       └───────┬────────┘      │     │
│  │ Vector Storage │───────────────┼───────────────┼───────────────┘     │
│  └────────────────┘               │               ▼                     │
│                                   │       ┌────────────────┐            │
│                                   └──────►│ LLM Generation │            │
│                                           └───────┬────────┘            │
│                                                   ▼                     │
│                                           ┌────────────────┐            │
│                                           │Answer + Sources│            │
│                                           └────────────────┘            │
└─────────────────────────────────────────────────────────────────────────┘

The Two Phases

Phase 1: Ingestion (happens once per document)

Extract text from PDF
Split into smaller chunks (why? explained below)
Convert chunks to vectors (numbers that capture meaning)
Store vectors in a database

Phase 2: Query (happens each time user asks)

Convert the question to a vector
Find chunks with similar vectors (semantic search)
Send question + relevant chunks to LLM
Return the answer with sources

Data Flow

This sequence diagram shows exactly what happens when you upload a document and ask a question:

┌─────────────────────────────────────────────────────────────────────────┐
│                         DATA FLOW SEQUENCE                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  User       FastAPI      LangChain     ChromaDB       OpenAI            │
│   │            │             │             │            │               │
│   │ ═══════════════════ DOCUMENT INGESTION ══════════════════           │
│   │            │             │             │            │               │
│   │─Upload PDF►│             │             │            │               │
│   │            │──Process───►│             │            │               │
│   │            │             │──Embed─────────────────►│               │
│   │            │             │◄────Vectors─────────────│               │
│   │            │             │──Store─────►│            │               │
│   │◄───Ready───│             │             │            │               │
│   │            │             │             │            │               │
│   │ ═══════════════════ QUESTION ANSWERING ══════════════════           │
│   │            │             │             │            │               │
│   │─Ask Quest─►│             │             │            │               │
│   │            │─────────Search────────────►│            │               │
│   │            │◄────Relevant Chunks───────│            │               │
│   │            │─────────Generate Answer──────────────►│               │
│   │◄─Answer────│             │             │            │               │
│   │            │             │             │            │               │
└─────────────────────────────────────────────────────────────────────────┘

Project Structure

intelligent-doc-qa/
├── src/
│   ├── __init__.py
│   ├── config.py              # Settings and environment variables
│   ├── document_processor.py  # PDF loading and chunking
│   ├── rag_engine.py          # Core RAG logic
│   └── api.py                 # REST API endpoints
├── tests/
│   └── test_rag.py
├── data/
│   └── sample.pdf
├── .env                       # Your API keys (never commit this!)
├── pyproject.toml
└── README.md

Implementation

Step 1: Project Setup

Create your project directory and set up the environment:

mkdir intelligent-doc-qa && cd intelligent-doc-qa
uv init
uv venv && source .venv/bin/activate

Install dependencies:

uv add langchain langchain-openai langchain-chroma
uv add chromadb pypdf python-dotenv
uv add fastapi uvicorn python-multipart

Create your .env file:

.env

OPENAI_API_KEY=sk-your-key-here
CHROMA_PERSIST_DIR=./chroma_db
CHUNK_SIZE=1000
CHUNK_OVERLAP=200

Step 2: Configuration Module

Create src/config.py to manage environment variables:

src/config.py

"""Configuration management for the RAG system."""
import os
from dotenv import load_dotenv

load_dotenv()

class Config:
    """Application configuration from environment variables."""

    OPENAI_API_KEY: str = os.getenv("OPENAI_API_KEY", "")
    CHROMA_PERSIST_DIR: str = os.getenv("CHROMA_PERSIST_DIR", "./chroma_db")
    CHUNK_SIZE: int = int(os.getenv("CHUNK_SIZE", "1000"))
    CHUNK_OVERLAP: int = int(os.getenv("CHUNK_OVERLAP", "200"))

    # Model settings
    EMBEDDING_MODEL: str = "text-embedding-3-small"
    LLM_MODEL: str = "gpt-4o-mini"
    TEMPERATURE: float = 0.1

    # Retrieval settings
    TOP_K: int = 4

    @classmethod
    def validate(cls) -> None:
        """Validate required configuration."""
        if not cls.OPENAI_API_KEY:
            raise ValueError("OPENAI_API_KEY is required")

config = Config()

What's Happening Here?

This module centralizes all configuration in one place. Here's why each setting matters:

Setting	Value	Why This Value?
`CHUNK_SIZE`	1000	Characters per chunk. Too small = fragments lose meaning. Too large = irrelevant content mixed in. 1000 is a good starting point.
`CHUNK_OVERLAP`	200	Characters shared between chunks. Prevents cutting sentences in half. 20% overlap is standard.
`EMBEDDING_MODEL`	text-embedding-3-small	OpenAI's newest, cheapest embedding model. Good quality at $0.02/1M tokens.
`LLM_MODEL`	gpt-4o-mini	Fast, cheap, capable. Use gpt-4o for complex reasoning.
`TEMPERATURE`	0.1	Low = more deterministic answers. For Q&A, we want consistent, factual responses.
`TOP_K`	4	Number of chunks to retrieve. 4 gives enough context without overwhelming the LLM.

Why use a Config class?

Centralizing configuration makes it easy to:

Change settings without editing multiple files
Validate required values at startup
Switch between development/production settings

Step 3: Document Processor

Create src/document_processor.py to handle PDF processing:

src/document_processor.py

"""Document processing: extraction, chunking, and embedding."""
from pathlib import Path
from typing import List

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

from src.config import config


class DocumentProcessor:
    """Handles PDF loading, text extraction, and chunking."""

    def __init__(self):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=config.CHUNK_SIZE,
            chunk_overlap=config.CHUNK_OVERLAP,
            length_function=len,
            separators=["\n\n", "\n", ". ", " ", ""]
        )

    def load_pdf(self, file_path: str | Path) -> List[Document]:
        """Load a PDF and return raw documents."""
        loader = PyPDFLoader(str(file_path))
        return loader.load()

    def chunk_documents(self, documents: List[Document]) -> List[Document]:
        """Split documents into smaller chunks for embedding."""
        chunks = self.text_splitter.split_documents(documents)

        # Add chunk metadata
        for i, chunk in enumerate(chunks):
            chunk.metadata["chunk_id"] = i
            chunk.metadata["chunk_total"] = len(chunks)

        return chunks

    def process(self, file_path: str | Path) -> List[Document]:
        """Full pipeline: load PDF and chunk it."""
        documents = self.load_pdf(file_path)
        return self.chunk_documents(documents)

Understanding Chunking: The Heart of RAG

Chunking is one of the most important decisions in a RAG system. Let's break down what's happening:

Why do we chunk documents?

Imagine you have a 100-page PDF. When a user asks "What was the revenue in Q3?", you don't want to send all 100 pages to the LLM - that's expensive and the relevant info gets lost in noise.

Instead, we:

Split the document into small pieces (chunks)
Find only the chunks relevant to the question
Send just those chunks to the LLM

How RecursiveCharacterTextSplitter works:

separators=["\n\n", "\n", ". ", " ", ""]

This tells the splitter to try splitting in this order:

First, try to split on double newlines (paragraphs) - keeps paragraphs together
If chunk is still too big, split on single newlines
Then on sentences (". ")
Then on words (" ")
Finally, on characters (last resort)

This "recursive" approach preserves meaning better than naive splitting.

Visual example of chunking:

Original Document (2500 characters):
┌─────────────────────────────────────────────────────┐
│ Introduction paragraph about the company...         │
│                                                     │
│ Q3 Revenue was $50M, up 20% from Q2...              │
│                                                     │
│ The marketing team expanded into 3 new regions...   │
│                                                     │
│ Future outlook remains positive with...             │
└─────────────────────────────────────────────────────┘

After chunking (chunk_size=1000, overlap=200):
┌────────────────────┐
│ Chunk 1            │
│ Introduction...    │──┐
│ Q3 Revenue was...  │  │ 200 char overlap
└────────────────────┘  │
    ┌───────────────────┘
    ▼
┌────────────────────┐
│ Chunk 2            │
│ ...Revenue was...  │──┐
│ Marketing team...  │  │ 200 char overlap
└────────────────────┘  │
    ┌───────────────────┘
    ▼
┌────────────────────┐
│ Chunk 3            │
│ ...3 new regions   │
│ Future outlook...  │
└────────────────────┘

The overlap ensures that if important information spans a chunk boundary, it appears in both chunks.

Step 4: RAG Engine

Create src/rag_engine.py - the core of the system:

src/rag_engine.py

"""RAG Engine: Vector store management and question answering."""
from pathlib import Path
from typing import List, Optional

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

from src.config import config
from src.document_processor import DocumentProcessor


class RAGEngine:
    """Manages vector storage and RAG-based question answering."""

    def __init__(self):
        config.validate()

        self.embeddings = OpenAIEmbeddings(
            model=config.EMBEDDING_MODEL,
            openai_api_key=config.OPENAI_API_KEY
        )

        self.llm = ChatOpenAI(
            model=config.LLM_MODEL,
            temperature=config.TEMPERATURE,
            openai_api_key=config.OPENAI_API_KEY
        )

        self.processor = DocumentProcessor()
        self.vectorstore: Optional[Chroma] = None
        self._load_or_create_vectorstore()

    def _load_or_create_vectorstore(self) -> None:
        """Initialize or load existing vector store."""
        persist_dir = Path(config.CHROMA_PERSIST_DIR)

        self.vectorstore = Chroma(
            persist_directory=str(persist_dir),
            embedding_function=self.embeddings
        )

    def ingest_document(self, file_path: str | Path) -> int:
        """Process and store a document in the vector store."""
        chunks = self.processor.process(file_path)
        self.vectorstore.add_documents(chunks)
        return len(chunks)

    def _format_docs(self, docs: List[Document]) -> str:
        """Format retrieved documents for the prompt."""
        formatted = []
        for i, doc in enumerate(docs, 1):
            source = doc.metadata.get("source", "Unknown")
            page = doc.metadata.get("page", "?")
            formatted.append(
                f"[Source {i}: {Path(source).name}, Page {page}]\n{doc.page_content}"
            )
        return "\n\n---\n\n".join(formatted)

    def query(self, question: str) -> dict:
        """Answer a question using RAG."""
        if not self.vectorstore:
            raise ValueError("No documents ingested yet")

        # Retrieve relevant chunks
        retriever = self.vectorstore.as_retriever(
            search_type="similarity",
            search_kwargs={"k": config.TOP_K}
        )

        # RAG prompt template
        prompt = ChatPromptTemplate.from_messages([
            ("system", """You are a helpful assistant that answers questions based on the provided context.

Instructions:
- Answer ONLY based on the context provided
- If the answer is not in the context, say "I cannot find this information in the document"
- Cite the source numbers when possible
- Be concise but thorough"""),
            ("human", """Context:
{context}

Question: {question}

Answer:""")
        ])

        # Build the RAG chain
        chain = (
            {"context": retriever | self._format_docs, "question": RunnablePassthrough()}
            | prompt
            | self.llm
            | StrOutputParser()
        )

        # Execute and get results
        answer = chain.invoke(question)
        source_docs = retriever.invoke(question)

        return {
            "answer": answer,
            "sources": [
                {
                    "content": doc.page_content[:200] + "...",
                    "page": doc.metadata.get("page"),
                    "source": doc.metadata.get("source")
                }
                for doc in source_docs
            ]
        }

    def clear_vectorstore(self) -> None:
        """Clear all documents from the vector store."""
        if self.vectorstore:
            self.vectorstore.delete_collection()
            self._load_or_create_vectorstore()

Deep Dive: How the RAG Engine Works

This is the heart of the system. Let's understand each component:

1. Embeddings: Converting Text to Numbers

self.embeddings = OpenAIEmbeddings(
    model=config.EMBEDDING_MODEL,
    openai_api_key=config.OPENAI_API_KEY
)

What are embeddings?

Embeddings convert text into vectors (lists of numbers) that capture semantic meaning. Similar texts have similar vectors.

"The cat sat on the mat"  →  [0.2, -0.5, 0.8, 0.1, ...]  (1536 numbers)
"A feline rested on a rug" →  [0.19, -0.48, 0.79, 0.12, ...]  (very similar!)
"Stock prices fell today"  →  [-0.7, 0.3, -0.2, 0.9, ...]  (very different)

This is how semantic search works - we find chunks whose vectors are close to the question's vector.

2. Vector Store: ChromaDB

self.vectorstore = Chroma(
    persist_directory=str(persist_dir),
    embedding_function=self.embeddings
)

ChromaDB stores our embeddings and enables fast similarity search. When you call add_documents(), it:

Converts each chunk to a vector using the embedding function
Stores the vector + original text + metadata
Builds an index for fast searching

3. The Retriever

retriever = self.vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": config.TOP_K}
)

The retriever wraps the vector store and provides a simple interface:

Input: A question (string)
Output: Top K most similar chunks

How similarity search works:

Question: "What was Q3 revenue?"
    ↓
Embed question: [0.3, -0.2, 0.7, ...]
    ↓
Compare to all stored chunks using cosine similarity
    ↓
Return top 4 most similar chunks

4. The RAG Chain (LangChain Expression Language)

chain = (
    {"context": retriever | self._format_docs, "question": RunnablePassthrough()}
    | prompt
    | self.llm
    | StrOutputParser()
)

This is LangChain's way of building pipelines. Let's break it down:

Step 1: {"context": retriever | self._format_docs, "question": RunnablePassthrough()}
        ┌─────────────────────────────────────────────────────────────┐
        │ Creates a dictionary with two keys:                         │
        │ - "context": question → retriever → format as string        │
        │ - "question": question passes through unchanged             │
        └─────────────────────────────────────────────────────────────┘
                                    ↓
Step 2: | prompt
        ┌─────────────────────────────────────────────────────────────┐
        │ Fills in the prompt template with context and question      │
        └─────────────────────────────────────────────────────────────┘
                                    ↓
Step 3: | self.llm
        ┌─────────────────────────────────────────────────────────────┐
        │ Sends the filled prompt to GPT-4o-mini                      │
        └─────────────────────────────────────────────────────────────┘
                                    ↓
Step 4: | StrOutputParser()
        ┌─────────────────────────────────────────────────────────────┐
        │ Extracts just the text string from the LLM response         │
        └─────────────────────────────────────────────────────────────┘

5. The Prompt Template

prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a helpful assistant that answers questions based on the provided context.

Instructions:
- Answer ONLY based on the context provided
- If the answer is not in the context, say "I cannot find this information in the document"
- Cite the source numbers when possible
- Be concise but thorough"""),
    ("human", """Context:
{context}

Question: {question}

Answer:""")
])

This prompt is critical for RAG quality. Notice:

We explicitly tell the LLM to only use the context (prevents hallucination)
We tell it to admit when information isn't available
We ask for citations (source numbers)

Prompt Engineering Tip

The instruction "If the answer is not in the context, say I cannot find this information" is crucial. Without it, LLMs often hallucinate answers that sound plausible but aren't in your documents.

Step 5: FastAPI Application

Create src/api.py for the REST API:

src/api.py

"""FastAPI application for the RAG system."""
import tempfile
from pathlib import Path

from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel

from src.rag_engine import RAGEngine


app = FastAPI(
    title="Intelligent Document Q&A API",
    description="RAG-powered document question answering system",
    version="1.0.0"
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# Initialize RAG engine
rag_engine = RAGEngine()


class QuestionRequest(BaseModel):
    question: str


class AnswerResponse(BaseModel):
    answer: str
    sources: list


class IngestResponse(BaseModel):
    message: str
    chunks_created: int


@app.get("/")
async def root():
    """Health check endpoint."""
    return {"status": "healthy", "service": "Intelligent Document Q&A"}


@app.post("/ingest", response_model=IngestResponse)
async def ingest_document(file: UploadFile = File(...)):
    """Upload and process a PDF document."""
    if not file.filename.endswith(".pdf"):
        raise HTTPException(status_code=400, detail="Only PDF files are supported")

    try:
        with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
            content = await file.read()
            tmp.write(content)
            tmp_path = tmp.name

        chunks_count = rag_engine.ingest_document(tmp_path)
        Path(tmp_path).unlink()

        return IngestResponse(
            message=f"Successfully processed {file.filename}",
            chunks_created=chunks_count
        )

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.post("/query", response_model=AnswerResponse)
async def query_document(request: QuestionRequest):
    """Ask a question about the ingested documents."""
    if not request.question.strip():
        raise HTTPException(status_code=400, detail="Question cannot be empty")

    try:
        result = rag_engine.query(request.question)
        return AnswerResponse(**result)

    except ValueError as e:
        raise HTTPException(status_code=400, detail=str(e))
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.delete("/clear")
async def clear_documents():
    """Clear all ingested documents."""
    rag_engine.clear_vectorstore()
    return {"message": "Vector store cleared successfully"}


if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Understanding the API

The API provides three main endpoints:

Endpoint	Method	Purpose
`/ingest`	POST	Upload a PDF document
`/query`	POST	Ask a question
`/clear`	DELETE	Reset the system

File Upload Flow (/ingest):

# 1. Receive the uploaded file
file: UploadFile = File(...)

# 2. Save to a temporary file (required because PyPDFLoader needs a file path)
with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
    content = await file.read()
    tmp.write(content)
    tmp_path = tmp.name

# 3. Process the document (chunk + embed + store)
chunks_count = rag_engine.ingest_document(tmp_path)

# 4. Clean up the temporary file
Path(tmp_path).unlink()

Why use Pydantic models?

class QuestionRequest(BaseModel):
    question: str

Pydantic provides:

Automatic validation (ensures question is a string)
Automatic documentation (shows up in Swagger UI)
Type hints for IDE support

Step 6: Run and Test

Start the server:

python -m uvicorn src.api:app --reload

Test with curl:

# Upload a PDF
curl -X POST "http://localhost:8000/ingest" \
  -H "accept: application/json" \
  -F "file=@your-document.pdf"

# Ask a question
curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the main topic of this document?"}'

Or visit http://localhost:8000/docs for the interactive Swagger UI.

Testing

Create tests/test_rag.py:

tests/test_rag.py

"""Tests for the RAG system."""
import pytest
from src.document_processor import DocumentProcessor
from langchain_core.documents import Document


class TestDocumentProcessor:
    """Tests for document processing."""

    def test_chunking_preserves_content(self):
        """Verify chunking doesn't lose content."""
        processor = DocumentProcessor()

        test_doc = Document(
            page_content="A" * 2500,
            metadata={"source": "test.pdf", "page": 0}
        )

        chunks = processor.chunk_documents([test_doc])

        # Should create multiple chunks
        assert len(chunks) > 1

        # Total content should be preserved
        total_content = sum(len(c.page_content) for c in chunks)
        assert total_content >= 2500


# Run with: pytest tests/test_rag.py -v

Debugging Tips

No documents ingested yet

Ensure you've uploaded a PDF before querying
Check that the vector store path is writable

OpenAI API key invalid

Verify your .env file has the correct key
Check for extra whitespace in the key

PDF processing failed

Ensure the PDF isn't password-protected
Some scanned PDFs need OCR preprocessing

Key Concepts Recap

Concept	What It Does	Why It Matters
Chunking	Splits documents into smaller pieces	Enables precise retrieval; too large chunks dilute relevance
Embeddings	Converts text to vectors	Enables semantic search (meaning-based, not keyword-based)
Vector Store	Stores and searches embeddings	Fast similarity search across thousands of chunks
Retriever	Finds relevant chunks	Bridges user question to stored knowledge
Prompt Template	Structures the LLM request	Controls answer quality and prevents hallucination

Extensions

Level	Ideas
Easy	Add conversation history, support multiple file formats, add streaming
Medium	Add reranking with Cohere, implement hybrid search, citation highlighting
Advanced	RAG evaluation pipeline, async processing queue, multi-tenancy support

Resources

Summary

You've built a complete RAG system that:

Processes PDF documents into searchable chunks
Creates and stores vector embeddings
Performs semantic similarity search
Generates contextual answers with sources
Exposes a REST API for integration

Next: Multi-Document RAG

Intelligent Document Q&A System

On this page

Intelligent Document Q&A System

On this page