Intelligent Document Q&A System
Build a complete RAG system that answers questions from any PDF document
Intelligent Document Q&A System
TL;DR
Build a RAG (Retrieval-Augmented Generation) system that lets users ask questions about PDF documents. Uses LangChain for orchestration, ChromaDB for vector storage, and OpenAI for embeddings and generation. Upload a PDF, ask questions, get answers with source citations.
| Property | Value |
|---|---|
| Difficulty | Beginner |
| Time | ~2 hours |
| Code Size | ~150 LOC |
Tech Stack
| Technology | Purpose |
|---|---|
| LangChain | RAG orchestration |
| OpenAI | Embeddings + GPT-4 |
| ChromaDB | Vector database |
| PyPDF | PDF extraction |
| FastAPI | REST API |
Prerequisites
- Basic Python knowledge
- Understanding of APIs
- OpenAI API key (Get one here)
What You'll Learn
- Understand the RAG architecture pattern
- Extract and chunk text from PDF documents
- Create embeddings and store them in a vector database
- Implement semantic search retrieval
- Build a complete question-answering pipeline
- Create a REST API for your RAG system
Understanding RAG: The Core Concept
Before diving into code, let's understand what we're building and why.
What is RAG?
RAG (Retrieval-Augmented Generation) solves a fundamental problem: LLMs like GPT-4 only know what they were trained on. They can't answer questions about your private documents, recent events, or company data.
RAG fixes this by:
- Storing your documents in a searchable database
- Finding relevant pieces when a question is asked
- Feeding those pieces to the LLM as context
- Generating an answer based on your actual data
Why Not Just Paste Documents into ChatGPT?
You might wonder: "Why build all this? Can't I just paste my document into ChatGPT?"
| Approach | Limitation |
|---|---|
| Copy-paste | Context window limits (~128K tokens ≈ 100 pages max) |
| Copy-paste | Expensive - you pay for every token, every question |
| Copy-paste | No persistence - start over each conversation |
| RAG | Handles unlimited documents |
| RAG | Only retrieves what's relevant (cheaper) |
| RAG | Documents stored permanently |
System Architecture
Here's the big picture of what we're building:
┌─────────────────────────────────────────────────────────────────────────┐
│ RAG ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ INPUT LAYER │
│ ┌────────────────┐ ┌────────────────┐ │
│ │ PDF Document │ │ User Question │ │
│ └───────┬────────┘ └───────┬────────┘ │
│ │ │ │
│ ▼ │ │
│ ════════════════════ │ ════════════════════ │
│ PROCESSING PIPELINE │ QUERY PIPELINE │
│ ════════════════════ │ ════════════════════ │
│ │ │ │ │
│ ▼ │ ▼ │
│ ┌────────────────┐ │ ┌────────────────┐ │
│ │Text Extraction │ │ │Query Embedding │ │
│ └───────┬────────┘ │ └───────┬────────┘ │
│ ▼ │ │ │
│ ┌────────────────┐ │ ▼ │
│ │ Chunking │ │ ┌────────────────┐ │
│ └───────┬────────┘ │ │Semantic Search │◄─────┐ │
│ ▼ │ └───────┬────────┘ │ │
│ ┌────────────────┐ │ │ │ │
│ │ Embedding │ │ ▼ │ │
│ └───────┬────────┘ │ ┌────────────────┐ │ │
│ ▼ │ │Context Retrieval│ │ │
│ ┌────────────────┐ │ └───────┬────────┘ │ │
│ │ Vector Storage │───────────────┼───────────────┼───────────────┘ │
│ └────────────────┘ │ ▼ │
│ │ ┌────────────────┐ │
│ └──────►│ LLM Generation │ │
│ └───────┬────────┘ │
│ ▼ │
│ ┌────────────────┐ │
│ │Answer + Sources│ │
│ └────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘The Two Phases
Phase 1: Ingestion (happens once per document)
- Extract text from PDF
- Split into smaller chunks (why? explained below)
- Convert chunks to vectors (numbers that capture meaning)
- Store vectors in a database
Phase 2: Query (happens each time user asks)
- Convert the question to a vector
- Find chunks with similar vectors (semantic search)
- Send question + relevant chunks to LLM
- Return the answer with sources
Data Flow
This sequence diagram shows exactly what happens when you upload a document and ask a question:
┌─────────────────────────────────────────────────────────────────────────┐
│ DATA FLOW SEQUENCE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ User FastAPI LangChain ChromaDB OpenAI │
│ │ │ │ │ │ │
│ │ ═══════════════════ DOCUMENT INGESTION ══════════════════ │
│ │ │ │ │ │ │
│ │─Upload PDF►│ │ │ │ │
│ │ │──Process───►│ │ │ │
│ │ │ │──Embed─────────────────►│ │
│ │ │ │◄────Vectors─────────────│ │
│ │ │ │──Store─────►│ │ │
│ │◄───Ready───│ │ │ │ │
│ │ │ │ │ │ │
│ │ ═══════════════════ QUESTION ANSWERING ══════════════════ │
│ │ │ │ │ │ │
│ │─Ask Quest─►│ │ │ │ │
│ │ │─────────Search────────────►│ │ │
│ │ │◄────Relevant Chunks───────│ │ │
│ │ │─────────Generate Answer──────────────►│ │
│ │◄─Answer────│ │ │ │ │
│ │ │ │ │ │ │
└─────────────────────────────────────────────────────────────────────────┘Project Structure
intelligent-doc-qa/
├── src/
│ ├── __init__.py
│ ├── config.py # Settings and environment variables
│ ├── document_processor.py # PDF loading and chunking
│ ├── rag_engine.py # Core RAG logic
│ └── api.py # REST API endpoints
├── tests/
│ └── test_rag.py
├── data/
│ └── sample.pdf
├── .env # Your API keys (never commit this!)
├── pyproject.toml
└── README.mdImplementation
Step 1: Project Setup
Create your project directory and set up the environment:
mkdir intelligent-doc-qa && cd intelligent-doc-qa
uv init
uv venv && source .venv/bin/activateInstall dependencies:
uv add langchain langchain-openai langchain-chroma
uv add chromadb pypdf python-dotenv
uv add fastapi uvicorn python-multipartCreate your .env file:
OPENAI_API_KEY=sk-your-key-here
CHROMA_PERSIST_DIR=./chroma_db
CHUNK_SIZE=1000
CHUNK_OVERLAP=200Step 2: Configuration Module
Create src/config.py to manage environment variables:
"""Configuration management for the RAG system."""
import os
from dotenv import load_dotenv
load_dotenv()
class Config:
"""Application configuration from environment variables."""
OPENAI_API_KEY: str = os.getenv("OPENAI_API_KEY", "")
CHROMA_PERSIST_DIR: str = os.getenv("CHROMA_PERSIST_DIR", "./chroma_db")
CHUNK_SIZE: int = int(os.getenv("CHUNK_SIZE", "1000"))
CHUNK_OVERLAP: int = int(os.getenv("CHUNK_OVERLAP", "200"))
# Model settings
EMBEDDING_MODEL: str = "text-embedding-3-small"
LLM_MODEL: str = "gpt-4o-mini"
TEMPERATURE: float = 0.1
# Retrieval settings
TOP_K: int = 4
@classmethod
def validate(cls) -> None:
"""Validate required configuration."""
if not cls.OPENAI_API_KEY:
raise ValueError("OPENAI_API_KEY is required")
config = Config()What's Happening Here?
This module centralizes all configuration in one place. Here's why each setting matters:
| Setting | Value | Why This Value? |
|---|---|---|
CHUNK_SIZE | 1000 | Characters per chunk. Too small = fragments lose meaning. Too large = irrelevant content mixed in. 1000 is a good starting point. |
CHUNK_OVERLAP | 200 | Characters shared between chunks. Prevents cutting sentences in half. 20% overlap is standard. |
EMBEDDING_MODEL | text-embedding-3-small | OpenAI's newest, cheapest embedding model. Good quality at $0.02/1M tokens. |
LLM_MODEL | gpt-4o-mini | Fast, cheap, capable. Use gpt-4o for complex reasoning. |
TEMPERATURE | 0.1 | Low = more deterministic answers. For Q&A, we want consistent, factual responses. |
TOP_K | 4 | Number of chunks to retrieve. 4 gives enough context without overwhelming the LLM. |
Why use a Config class?
Centralizing configuration makes it easy to:
- Change settings without editing multiple files
- Validate required values at startup
- Switch between development/production settings
Step 3: Document Processor
Create src/document_processor.py to handle PDF processing:
"""Document processing: extraction, chunking, and embedding."""
from pathlib import Path
from typing import List
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from src.config import config
class DocumentProcessor:
"""Handles PDF loading, text extraction, and chunking."""
def __init__(self):
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=config.CHUNK_SIZE,
chunk_overlap=config.CHUNK_OVERLAP,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""]
)
def load_pdf(self, file_path: str | Path) -> List[Document]:
"""Load a PDF and return raw documents."""
loader = PyPDFLoader(str(file_path))
return loader.load()
def chunk_documents(self, documents: List[Document]) -> List[Document]:
"""Split documents into smaller chunks for embedding."""
chunks = self.text_splitter.split_documents(documents)
# Add chunk metadata
for i, chunk in enumerate(chunks):
chunk.metadata["chunk_id"] = i
chunk.metadata["chunk_total"] = len(chunks)
return chunks
def process(self, file_path: str | Path) -> List[Document]:
"""Full pipeline: load PDF and chunk it."""
documents = self.load_pdf(file_path)
return self.chunk_documents(documents)Understanding Chunking: The Heart of RAG
Chunking is one of the most important decisions in a RAG system. Let's break down what's happening:
Why do we chunk documents?
Imagine you have a 100-page PDF. When a user asks "What was the revenue in Q3?", you don't want to send all 100 pages to the LLM - that's expensive and the relevant info gets lost in noise.
Instead, we:
- Split the document into small pieces (chunks)
- Find only the chunks relevant to the question
- Send just those chunks to the LLM
How RecursiveCharacterTextSplitter works:
separators=["\n\n", "\n", ". ", " ", ""]This tells the splitter to try splitting in this order:
- First, try to split on double newlines (paragraphs) - keeps paragraphs together
- If chunk is still too big, split on single newlines
- Then on sentences (". ")
- Then on words (" ")
- Finally, on characters (last resort)
This "recursive" approach preserves meaning better than naive splitting.
Visual example of chunking:
Original Document (2500 characters):
┌─────────────────────────────────────────────────────┐
│ Introduction paragraph about the company... │
│ │
│ Q3 Revenue was $50M, up 20% from Q2... │
│ │
│ The marketing team expanded into 3 new regions... │
│ │
│ Future outlook remains positive with... │
└─────────────────────────────────────────────────────┘
After chunking (chunk_size=1000, overlap=200):
┌────────────────────┐
│ Chunk 1 │
│ Introduction... │──┐
│ Q3 Revenue was... │ │ 200 char overlap
└────────────────────┘ │
┌───────────────────┘
▼
┌────────────────────┐
│ Chunk 2 │
│ ...Revenue was... │──┐
│ Marketing team... │ │ 200 char overlap
└────────────────────┘ │
┌───────────────────┘
▼
┌────────────────────┐
│ Chunk 3 │
│ ...3 new regions │
│ Future outlook... │
└────────────────────┘The overlap ensures that if important information spans a chunk boundary, it appears in both chunks.
Step 4: RAG Engine
Create src/rag_engine.py - the core of the system:
"""RAG Engine: Vector store management and question answering."""
from pathlib import Path
from typing import List, Optional
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from src.config import config
from src.document_processor import DocumentProcessor
class RAGEngine:
"""Manages vector storage and RAG-based question answering."""
def __init__(self):
config.validate()
self.embeddings = OpenAIEmbeddings(
model=config.EMBEDDING_MODEL,
openai_api_key=config.OPENAI_API_KEY
)
self.llm = ChatOpenAI(
model=config.LLM_MODEL,
temperature=config.TEMPERATURE,
openai_api_key=config.OPENAI_API_KEY
)
self.processor = DocumentProcessor()
self.vectorstore: Optional[Chroma] = None
self._load_or_create_vectorstore()
def _load_or_create_vectorstore(self) -> None:
"""Initialize or load existing vector store."""
persist_dir = Path(config.CHROMA_PERSIST_DIR)
self.vectorstore = Chroma(
persist_directory=str(persist_dir),
embedding_function=self.embeddings
)
def ingest_document(self, file_path: str | Path) -> int:
"""Process and store a document in the vector store."""
chunks = self.processor.process(file_path)
self.vectorstore.add_documents(chunks)
return len(chunks)
def _format_docs(self, docs: List[Document]) -> str:
"""Format retrieved documents for the prompt."""
formatted = []
for i, doc in enumerate(docs, 1):
source = doc.metadata.get("source", "Unknown")
page = doc.metadata.get("page", "?")
formatted.append(
f"[Source {i}: {Path(source).name}, Page {page}]\n{doc.page_content}"
)
return "\n\n---\n\n".join(formatted)
def query(self, question: str) -> dict:
"""Answer a question using RAG."""
if not self.vectorstore:
raise ValueError("No documents ingested yet")
# Retrieve relevant chunks
retriever = self.vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": config.TOP_K}
)
# RAG prompt template
prompt = ChatPromptTemplate.from_messages([
("system", """You are a helpful assistant that answers questions based on the provided context.
Instructions:
- Answer ONLY based on the context provided
- If the answer is not in the context, say "I cannot find this information in the document"
- Cite the source numbers when possible
- Be concise but thorough"""),
("human", """Context:
{context}
Question: {question}
Answer:""")
])
# Build the RAG chain
chain = (
{"context": retriever | self._format_docs, "question": RunnablePassthrough()}
| prompt
| self.llm
| StrOutputParser()
)
# Execute and get results
answer = chain.invoke(question)
source_docs = retriever.invoke(question)
return {
"answer": answer,
"sources": [
{
"content": doc.page_content[:200] + "...",
"page": doc.metadata.get("page"),
"source": doc.metadata.get("source")
}
for doc in source_docs
]
}
def clear_vectorstore(self) -> None:
"""Clear all documents from the vector store."""
if self.vectorstore:
self.vectorstore.delete_collection()
self._load_or_create_vectorstore()Deep Dive: How the RAG Engine Works
This is the heart of the system. Let's understand each component:
1. Embeddings: Converting Text to Numbers
self.embeddings = OpenAIEmbeddings(
model=config.EMBEDDING_MODEL,
openai_api_key=config.OPENAI_API_KEY
)What are embeddings?
Embeddings convert text into vectors (lists of numbers) that capture semantic meaning. Similar texts have similar vectors.
"The cat sat on the mat" → [0.2, -0.5, 0.8, 0.1, ...] (1536 numbers)
"A feline rested on a rug" → [0.19, -0.48, 0.79, 0.12, ...] (very similar!)
"Stock prices fell today" → [-0.7, 0.3, -0.2, 0.9, ...] (very different)This is how semantic search works - we find chunks whose vectors are close to the question's vector.
2. Vector Store: ChromaDB
self.vectorstore = Chroma(
persist_directory=str(persist_dir),
embedding_function=self.embeddings
)ChromaDB stores our embeddings and enables fast similarity search. When you call add_documents(), it:
- Converts each chunk to a vector using the embedding function
- Stores the vector + original text + metadata
- Builds an index for fast searching
3. The Retriever
retriever = self.vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": config.TOP_K}
)The retriever wraps the vector store and provides a simple interface:
- Input: A question (string)
- Output: Top K most similar chunks
How similarity search works:
Question: "What was Q3 revenue?"
↓
Embed question: [0.3, -0.2, 0.7, ...]
↓
Compare to all stored chunks using cosine similarity
↓
Return top 4 most similar chunks4. The RAG Chain (LangChain Expression Language)
chain = (
{"context": retriever | self._format_docs, "question": RunnablePassthrough()}
| prompt
| self.llm
| StrOutputParser()
)This is LangChain's way of building pipelines. Let's break it down:
Step 1: {"context": retriever | self._format_docs, "question": RunnablePassthrough()}
┌─────────────────────────────────────────────────────────────┐
│ Creates a dictionary with two keys: │
│ - "context": question → retriever → format as string │
│ - "question": question passes through unchanged │
└─────────────────────────────────────────────────────────────┘
↓
Step 2: | prompt
┌─────────────────────────────────────────────────────────────┐
│ Fills in the prompt template with context and question │
└─────────────────────────────────────────────────────────────┘
↓
Step 3: | self.llm
┌─────────────────────────────────────────────────────────────┐
│ Sends the filled prompt to GPT-4o-mini │
└─────────────────────────────────────────────────────────────┘
↓
Step 4: | StrOutputParser()
┌─────────────────────────────────────────────────────────────┐
│ Extracts just the text string from the LLM response │
└─────────────────────────────────────────────────────────────┘5. The Prompt Template
prompt = ChatPromptTemplate.from_messages([
("system", """You are a helpful assistant that answers questions based on the provided context.
Instructions:
- Answer ONLY based on the context provided
- If the answer is not in the context, say "I cannot find this information in the document"
- Cite the source numbers when possible
- Be concise but thorough"""),
("human", """Context:
{context}
Question: {question}
Answer:""")
])This prompt is critical for RAG quality. Notice:
- We explicitly tell the LLM to only use the context (prevents hallucination)
- We tell it to admit when information isn't available
- We ask for citations (source numbers)
Prompt Engineering Tip
The instruction "If the answer is not in the context, say I cannot find this information" is crucial. Without it, LLMs often hallucinate answers that sound plausible but aren't in your documents.
Step 5: FastAPI Application
Create src/api.py for the REST API:
"""FastAPI application for the RAG system."""
import tempfile
from pathlib import Path
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from src.rag_engine import RAGEngine
app = FastAPI(
title="Intelligent Document Q&A API",
description="RAG-powered document question answering system",
version="1.0.0"
)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Initialize RAG engine
rag_engine = RAGEngine()
class QuestionRequest(BaseModel):
question: str
class AnswerResponse(BaseModel):
answer: str
sources: list
class IngestResponse(BaseModel):
message: str
chunks_created: int
@app.get("/")
async def root():
"""Health check endpoint."""
return {"status": "healthy", "service": "Intelligent Document Q&A"}
@app.post("/ingest", response_model=IngestResponse)
async def ingest_document(file: UploadFile = File(...)):
"""Upload and process a PDF document."""
if not file.filename.endswith(".pdf"):
raise HTTPException(status_code=400, detail="Only PDF files are supported")
try:
with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
content = await file.read()
tmp.write(content)
tmp_path = tmp.name
chunks_count = rag_engine.ingest_document(tmp_path)
Path(tmp_path).unlink()
return IngestResponse(
message=f"Successfully processed {file.filename}",
chunks_created=chunks_count
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.post("/query", response_model=AnswerResponse)
async def query_document(request: QuestionRequest):
"""Ask a question about the ingested documents."""
if not request.question.strip():
raise HTTPException(status_code=400, detail="Question cannot be empty")
try:
result = rag_engine.query(request.question)
return AnswerResponse(**result)
except ValueError as e:
raise HTTPException(status_code=400, detail=str(e))
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.delete("/clear")
async def clear_documents():
"""Clear all ingested documents."""
rag_engine.clear_vectorstore()
return {"message": "Vector store cleared successfully"}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)Understanding the API
The API provides three main endpoints:
| Endpoint | Method | Purpose |
|---|---|---|
/ingest | POST | Upload a PDF document |
/query | POST | Ask a question |
/clear | DELETE | Reset the system |
File Upload Flow (/ingest):
# 1. Receive the uploaded file
file: UploadFile = File(...)
# 2. Save to a temporary file (required because PyPDFLoader needs a file path)
with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
content = await file.read()
tmp.write(content)
tmp_path = tmp.name
# 3. Process the document (chunk + embed + store)
chunks_count = rag_engine.ingest_document(tmp_path)
# 4. Clean up the temporary file
Path(tmp_path).unlink()Why use Pydantic models?
class QuestionRequest(BaseModel):
question: strPydantic provides:
- Automatic validation (ensures
questionis a string) - Automatic documentation (shows up in Swagger UI)
- Type hints for IDE support
Step 6: Run and Test
Start the server:
python -m uvicorn src.api:app --reloadTest with curl:
# Upload a PDF
curl -X POST "http://localhost:8000/ingest" \
-H "accept: application/json" \
-F "file=@your-document.pdf"
# Ask a question
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{"question": "What is the main topic of this document?"}'Or visit http://localhost:8000/docs for the interactive Swagger UI.
Testing
Create tests/test_rag.py:
"""Tests for the RAG system."""
import pytest
from src.document_processor import DocumentProcessor
from langchain_core.documents import Document
class TestDocumentProcessor:
"""Tests for document processing."""
def test_chunking_preserves_content(self):
"""Verify chunking doesn't lose content."""
processor = DocumentProcessor()
test_doc = Document(
page_content="A" * 2500,
metadata={"source": "test.pdf", "page": 0}
)
chunks = processor.chunk_documents([test_doc])
# Should create multiple chunks
assert len(chunks) > 1
# Total content should be preserved
total_content = sum(len(c.page_content) for c in chunks)
assert total_content >= 2500
# Run with: pytest tests/test_rag.py -vDebugging Tips
No documents ingested yet
- Ensure you've uploaded a PDF before querying
- Check that the vector store path is writable
OpenAI API key invalid
- Verify your
.envfile has the correct key - Check for extra whitespace in the key
PDF processing failed
- Ensure the PDF isn't password-protected
- Some scanned PDFs need OCR preprocessing
Key Concepts Recap
| Concept | What It Does | Why It Matters |
|---|---|---|
| Chunking | Splits documents into smaller pieces | Enables precise retrieval; too large chunks dilute relevance |
| Embeddings | Converts text to vectors | Enables semantic search (meaning-based, not keyword-based) |
| Vector Store | Stores and searches embeddings | Fast similarity search across thousands of chunks |
| Retriever | Finds relevant chunks | Bridges user question to stored knowledge |
| Prompt Template | Structures the LLM request | Controls answer quality and prevents hallucination |
Extensions
| Level | Ideas |
|---|---|
| Easy | Add conversation history, support multiple file formats, add streaming |
| Medium | Add reranking with Cohere, implement hybrid search, citation highlighting |
| Advanced | RAG evaluation pipeline, async processing queue, multi-tenancy support |
Resources
Summary
You've built a complete RAG system that:
- Processes PDF documents into searchable chunks
- Creates and stores vector embeddings
- Performs semantic similarity search
- Generates contextual answers with sources
- Exposes a REST API for integration
Next: Multi-Document RAG