Intelligent Document Q&A System
Build a complete RAG system that answers questions from any PDF document
Intelligent Document Q&A System
TL;DR
Build a RAG (Retrieval-Augmented Generation) system that lets users ask questions about PDF documents. Uses LangChain for orchestration, ChromaDB for vector storage, and OpenAI for embeddings and generation. Upload a PDF, ask questions, get answers with source citations.
| Property | Value |
|---|---|
| Difficulty | Beginner |
| Time | ~2 hours |
| Code Size | ~150 LOC |
Tech Stack
| Technology | Purpose |
|---|---|
| LangChain | RAG orchestration |
| OpenAI | Embeddings + GPT-4o-mini |
| ChromaDB | Vector database |
| PyPDF | PDF extraction |
| FastAPI | REST API |
Prerequisites
- Basic Python knowledge
- Understanding of APIs
- OpenAI API key (Get one here)
What You'll Learn
- Understand the RAG architecture pattern
- Extract and chunk text from PDF documents
- Create embeddings and store them in a vector database
- Implement semantic search retrieval
- Build a complete question-answering pipeline
- Create a REST API for your RAG system
Why RAG?
LLMs have three fundamental limitations that RAG solves:
| Limitation | What Happens Without RAG | How RAG Fixes It |
|---|---|---|
| No private data | LLM cannot answer questions about your company docs, PDFs, or databases | RAG retrieves from your documents at query time |
| Outdated knowledge | LLM's training data has a cutoff date -- it does not know recent events | RAG pulls from up-to-date document stores |
| Hallucination | LLM confidently fabricates plausible-sounding answers | RAG grounds answers in actual source documents with citations |
RAG is the most widely deployed pattern for adding external knowledge to LLMs because it requires no model training, works with any LLM, and can be updated instantly by adding new documents.
Understanding RAG: The Core Concept
Before diving into code, let's understand what we're building and why.
What is RAG?
RAG (Retrieval-Augmented Generation) solves a fundamental problem: LLMs only know what they were trained on. They can't answer questions about your private documents, recent events, or company data.
RAG fixes this by:
- Storing your documents in a searchable database
- Finding relevant pieces when a question is asked
- Feeding those pieces to the LLM as context
- Generating an answer based on your actual data
Why Not Just Paste Documents into ChatGPT?
You might wonder: "Why build all this? Can't I just paste my document into ChatGPT?"
| Approach | Limitation |
|---|---|
| Copy-paste | Context window limits (~128K tokens ≈ 100 pages max) |
| Copy-paste | Expensive - you pay for every token, every question |
| Copy-paste | No persistence - start over each conversation |
| RAG | Handles unlimited documents |
| RAG | Only retrieves what's relevant (cheaper) |
| RAG | Documents stored permanently |
System Architecture
Here's the big picture of what we're building:
RAG Architecture
Processing Pipeline (once per document)
Query Pipeline (each question)
The Two Phases
Phase 1: Ingestion (happens once per document)
- Extract text from PDF
- Split into smaller chunks (why? explained below)
- Convert chunks to vectors (numbers that capture meaning)
- Store vectors in a database
Phase 2: Query (happens each time user asks)
- Convert the question to a vector
- Find chunks with similar vectors (semantic search)
- Send question + relevant chunks to LLM
- Return the answer with sources
Data Flow
This sequence diagram shows exactly what happens when you upload a document and ask a question:
Document Ingestion Flow
Question Answering Flow
Project Structure
intelligent-doc-qa/
├── src/
│ ├── __init__.py
│ ├── config.py # Settings and environment variables
│ ├── document_processor.py # PDF loading and chunking
│ ├── rag_engine.py # Core RAG logic
│ └── api.py # REST API endpoints
├── tests/
│ └── test_rag.py
├── data/
│ └── sample.pdf
├── .env # Your API keys (never commit this!)
├── pyproject.toml
└── README.mdImplementation
Step 1: Project Setup
Create your project directory and set up the environment:
mkdir intelligent-doc-qa && cd intelligent-doc-qa
uv init
uv venv && source .venv/bin/activateInstall dependencies:
uv add langchain langchain-openai langchain-chroma
uv add chromadb pypdf python-dotenv
uv add fastapi uvicorn python-multipartCreate your .env file:
OPENAI_API_KEY=sk-your-key-here
CHROMA_PERSIST_DIR=./chroma_db
CHUNK_SIZE=1000
CHUNK_OVERLAP=200Step 2: Configuration Module
Create src/config.py to manage environment variables:
"""Configuration management for the RAG system."""
import os
from dotenv import load_dotenv
load_dotenv()
class Config:
"""Application configuration from environment variables."""
OPENAI_API_KEY: str = os.getenv("OPENAI_API_KEY", "")
CHROMA_PERSIST_DIR: str = os.getenv("CHROMA_PERSIST_DIR", "./chroma_db")
CHUNK_SIZE: int = int(os.getenv("CHUNK_SIZE", "1000"))
CHUNK_OVERLAP: int = int(os.getenv("CHUNK_OVERLAP", "200"))
# Model settings
EMBEDDING_MODEL: str = "text-embedding-3-small"
LLM_MODEL: str = "gpt-4o-mini"
TEMPERATURE: float = 0.1
# Retrieval settings
TOP_K: int = 4
@classmethod
def validate(cls) -> None:
"""Validate required configuration."""
if not cls.OPENAI_API_KEY:
raise ValueError("OPENAI_API_KEY is required")
config = Config()What's Happening Here?
This module centralizes all configuration in one place. Here's why each setting matters:
| Setting | Value | Why This Value? |
|---|---|---|
CHUNK_SIZE | 1000 | Characters per chunk. Too small = fragments lose meaning. Too large = irrelevant content mixed in. 1000 is a good starting point. |
CHUNK_OVERLAP | 200 | Characters shared between chunks. Prevents cutting sentences in half. 20% overlap is standard. |
EMBEDDING_MODEL | text-embedding-3-small | OpenAI's newest, cheapest embedding model. Good quality at $0.02/1M tokens. |
LLM_MODEL | gpt-4o-mini | Fast, cheap, capable. Use gpt-4o for complex reasoning. |
TEMPERATURE | 0.1 | Low = more deterministic answers. For Q&A, we want consistent, factual responses. |
TOP_K | 4 | Number of chunks to retrieve. 4 gives enough context without overwhelming the LLM. |
Why use a Config class?
Centralizing configuration makes it easy to:
- Change settings without editing multiple files
- Validate required values at startup
- Switch between development/production settings
Step 3: Document Processor
Create src/document_processor.py to handle PDF processing:
"""Document processing: extraction, chunking, and embedding."""
from pathlib import Path
from typing import List
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from src.config import config
class DocumentProcessor:
"""Handles PDF loading, text extraction, and chunking."""
def __init__(self):
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=config.CHUNK_SIZE,
chunk_overlap=config.CHUNK_OVERLAP,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""]
)
def load_pdf(self, file_path: str | Path) -> List[Document]:
"""Load a PDF and return raw documents."""
loader = PyPDFLoader(str(file_path))
return loader.load()
def chunk_documents(self, documents: List[Document]) -> List[Document]:
"""Split documents into smaller chunks for embedding."""
chunks = self.text_splitter.split_documents(documents)
# Add chunk metadata
for i, chunk in enumerate(chunks):
chunk.metadata["chunk_id"] = i
chunk.metadata["chunk_total"] = len(chunks)
return chunks
def process(self, file_path: str | Path) -> List[Document]:
"""Full pipeline: load PDF and chunk it."""
documents = self.load_pdf(file_path)
return self.chunk_documents(documents)Understanding Chunking: The Heart of RAG
Chunking is one of the most important decisions in a RAG system. Let's break down what's happening:
Why do we chunk documents?
Imagine you have a 100-page PDF. When a user asks "What was the revenue in Q3?", you don't want to send all 100 pages to the LLM - that's expensive and the relevant info gets lost in noise.
Instead, we:
- Split the document into small pieces (chunks)
- Find only the chunks relevant to the question
- Send just those chunks to the LLM
How RecursiveCharacterTextSplitter works:
separators=["\n\n", "\n", ". ", " ", ""]This tells the splitter to try splitting in this order:
- First, try to split on double newlines (paragraphs) - keeps paragraphs together
- If chunk is still too big, split on single newlines
- Then on sentences (". ")
- Then on words (" ")
- Finally, on characters (last resort)
This "recursive" approach preserves meaning better than naive splitting.
Visual example of chunking:
Chunking a 2500-character document (chunk_size=1000, overlap=200)
The overlap ensures that if important information spans a chunk boundary, it appears in both chunks.
Step 4: RAG Engine
Create src/rag_engine.py - the core of the system:
"""RAG Engine: Vector store management and question answering."""
from pathlib import Path
from typing import List, Optional
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from src.config import config
from src.document_processor import DocumentProcessor
class RAGEngine:
"""Manages vector storage and RAG-based question answering."""
def __init__(self):
config.validate()
self.embeddings = OpenAIEmbeddings(
model=config.EMBEDDING_MODEL,
openai_api_key=config.OPENAI_API_KEY
)
self.llm = ChatOpenAI(
model=config.LLM_MODEL,
temperature=config.TEMPERATURE,
openai_api_key=config.OPENAI_API_KEY
)
self.processor = DocumentProcessor()
self.vectorstore: Optional[Chroma] = None
self._load_or_create_vectorstore()
def _load_or_create_vectorstore(self) -> None:
"""Initialize or load existing vector store."""
persist_dir = Path(config.CHROMA_PERSIST_DIR)
self.vectorstore = Chroma(
persist_directory=str(persist_dir),
embedding_function=self.embeddings
)
def ingest_document(self, file_path: str | Path) -> int:
"""Process and store a document in the vector store."""
chunks = self.processor.process(file_path)
self.vectorstore.add_documents(chunks)
return len(chunks)
def _format_docs(self, docs: List[Document]) -> str:
"""Format retrieved documents for the prompt."""
formatted = []
for i, doc in enumerate(docs, 1):
source = doc.metadata.get("source", "Unknown")
page = doc.metadata.get("page", "?")
formatted.append(
f"[Source {i}: {Path(source).name}, Page {page}]\n{doc.page_content}"
)
return "\n\n---\n\n".join(formatted)
def query(self, question: str) -> dict:
"""Answer a question using RAG."""
if not self.vectorstore:
raise ValueError("No documents ingested yet")
# Retrieve relevant chunks
retriever = self.vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": config.TOP_K}
)
# RAG prompt template
prompt = ChatPromptTemplate.from_messages([
("system", """You are a helpful assistant that answers questions based on the provided context.
Instructions:
- Answer ONLY based on the context provided
- If the answer is not in the context, say "I cannot find this information in the document"
- Cite the source numbers when possible
- Be concise but thorough"""),
("human", """Context:
{context}
Question: {question}
Answer:""")
])
# Build the RAG chain
chain = (
{"context": retriever | self._format_docs, "question": RunnablePassthrough()}
| prompt
| self.llm
| StrOutputParser()
)
# Execute and get results
answer = chain.invoke(question)
source_docs = retriever.invoke(question)
return {
"answer": answer,
"sources": [
{
"content": doc.page_content[:200] + "...",
"page": doc.metadata.get("page"),
"source": doc.metadata.get("source")
}
for doc in source_docs
]
}
def clear_vectorstore(self) -> None:
"""Clear all documents from the vector store."""
if self.vectorstore:
self.vectorstore.delete_collection()
self._load_or_create_vectorstore()Deep Dive: How the RAG Engine Works
This is the heart of the system. Let's understand each component:
1. Embeddings: Converting Text to Numbers
self.embeddings = OpenAIEmbeddings(
model=config.EMBEDDING_MODEL,
openai_api_key=config.OPENAI_API_KEY
)What are embeddings?
Embeddings convert text into vectors (lists of numbers) that capture semantic meaning. Similar texts have similar vectors.
How Embeddings Capture Meaning
"The cat sat on the mat"
"A feline rested on a rug"
"Stock prices fell today"
This is how semantic search works - we find chunks whose vectors are close to the question's vector.
2. Vector Store: ChromaDB
self.vectorstore = Chroma(
persist_directory=str(persist_dir),
embedding_function=self.embeddings
)ChromaDB stores our embeddings and enables fast similarity search. When you call add_documents(), it:
- Converts each chunk to a vector using the embedding function
- Stores the vector + original text + metadata
- Builds an index for fast searching
3. The Retriever
retriever = self.vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": config.TOP_K}
)The retriever wraps the vector store and provides a simple interface:
- Input: A question (string)
- Output: Top K most similar chunks
How similarity search works:
Similarity Search Process
4. The RAG Chain (LangChain Expression Language)
chain = (
{"context": retriever | self._format_docs, "question": RunnablePassthrough()}
| prompt
| self.llm
| StrOutputParser()
)This is LangChain's way of building pipelines. Let's break it down:
RAG Chain Execution
5. The Prompt Template
prompt = ChatPromptTemplate.from_messages([
("system", """You are a helpful assistant that answers questions based on the provided context.
Instructions:
- Answer ONLY based on the context provided
- If the answer is not in the context, say "I cannot find this information in the document"
- Cite the source numbers when possible
- Be concise but thorough"""),
("human", """Context:
{context}
Question: {question}
Answer:""")
])This prompt is critical for RAG quality. Notice:
- We explicitly tell the LLM to only use the context (prevents hallucination)
- We tell it to admit when information isn't available
- We ask for citations (source numbers)
Prompt Engineering Tip
The instruction "If the answer is not in the context, say I cannot find this information" is crucial. Without it, LLMs often hallucinate answers that sound plausible but aren't in your documents.
Step 5: FastAPI Application
Create src/api.py for the REST API:
"""FastAPI application for the RAG system."""
import tempfile
from pathlib import Path
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from src.rag_engine import RAGEngine
app = FastAPI(
title="Intelligent Document Q&A API",
description="RAG-powered document question answering system",
version="1.0.0"
)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Initialize RAG engine
rag_engine = RAGEngine()
class QuestionRequest(BaseModel):
question: str
class AnswerResponse(BaseModel):
answer: str
sources: list
class IngestResponse(BaseModel):
message: str
chunks_created: int
@app.get("/")
async def root():
"""Health check endpoint."""
return {"status": "healthy", "service": "Intelligent Document Q&A"}
@app.post("/ingest", response_model=IngestResponse)
async def ingest_document(file: UploadFile = File(...)):
"""Upload and process a PDF document."""
if not file.filename.endswith(".pdf"):
raise HTTPException(status_code=400, detail="Only PDF files are supported")
try:
with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
content = await file.read()
tmp.write(content)
tmp_path = tmp.name
chunks_count = rag_engine.ingest_document(tmp_path)
Path(tmp_path).unlink()
return IngestResponse(
message=f"Successfully processed {file.filename}",
chunks_created=chunks_count
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.post("/query", response_model=AnswerResponse)
async def query_document(request: QuestionRequest):
"""Ask a question about the ingested documents."""
if not request.question.strip():
raise HTTPException(status_code=400, detail="Question cannot be empty")
try:
result = rag_engine.query(request.question)
return AnswerResponse(**result)
except ValueError as e:
raise HTTPException(status_code=400, detail=str(e))
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.delete("/clear")
async def clear_documents():
"""Clear all ingested documents."""
rag_engine.clear_vectorstore()
return {"message": "Vector store cleared successfully"}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)Understanding the API
The API provides three main endpoints:
| Endpoint | Method | Purpose |
|---|---|---|
/ingest | POST | Upload a PDF document |
/query | POST | Ask a question |
/clear | DELETE | Reset the system |
File Upload Flow (/ingest):
# 1. Receive the uploaded file
file: UploadFile = File(...)
# 2. Save to a temporary file (required because PyPDFLoader needs a file path)
with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
content = await file.read()
tmp.write(content)
tmp_path = tmp.name
# 3. Process the document (chunk + embed + store)
chunks_count = rag_engine.ingest_document(tmp_path)
# 4. Clean up the temporary file
Path(tmp_path).unlink()Why use Pydantic models?
class QuestionRequest(BaseModel):
question: strPydantic provides:
- Automatic validation (ensures
questionis a string) - Automatic documentation (shows up in Swagger UI)
- Type hints for IDE support
Step 6: Run and Test
Start the server:
python -m uvicorn src.api:app --reloadTest with curl:
# Upload a PDF
curl -X POST "http://localhost:8000/ingest" \
-H "accept: application/json" \
-F "file=@your-document.pdf"
# Ask a question
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{"question": "What is the main topic of this document?"}'Or visit http://localhost:8000/docs for the interactive Swagger UI.
Testing
Create tests/test_rag.py:
"""Tests for the RAG system."""
import pytest
from src.document_processor import DocumentProcessor
from langchain_core.documents import Document
class TestDocumentProcessor:
"""Tests for document processing."""
def test_chunking_preserves_content(self):
"""Verify chunking doesn't lose content."""
processor = DocumentProcessor()
test_doc = Document(
page_content="A" * 2500,
metadata={"source": "test.pdf", "page": 0}
)
chunks = processor.chunk_documents([test_doc])
# Should create multiple chunks
assert len(chunks) > 1
# Total content should be preserved
total_content = sum(len(c.page_content) for c in chunks)
assert total_content >= 2500
# Run with: pytest tests/test_rag.py -vDebugging Tips
No documents ingested yet
- Ensure you've uploaded a PDF before querying
- Check that the vector store path is writable
OpenAI API key invalid
- Verify your
.envfile has the correct key - Check for extra whitespace in the key
PDF processing failed
- Ensure the PDF isn't password-protected
- Some scanned PDFs need OCR preprocessing
Key Concepts Recap
| Concept | What It Does | Why It Matters |
|---|---|---|
| Chunking | Splits documents into smaller pieces | Enables precise retrieval; too large chunks dilute relevance |
| Embeddings | Converts text to vectors | Enables semantic search (meaning-based, not keyword-based) |
| Vector Store | Stores and searches embeddings | Fast similarity search across thousands of chunks |
| Retriever | Finds relevant chunks | Bridges user question to stored knowledge |
| Prompt Template | Structures the LLM request | Controls answer quality and prevents hallucination |
Extensions
| Level | Ideas |
|---|---|
| Easy | Add conversation history, support multiple file formats, add streaming |
| Medium | Add reranking with Cohere, implement hybrid search, citation highlighting |
| Advanced | RAG evaluation pipeline, async processing queue, multi-tenancy support |
Resources
Summary
You've built a complete RAG system that:
- Processes PDF documents into searchable chunks
- Creates and stores vector embeddings
- Performs semantic similarity search
- Generates contextual answers with sources
- Exposes a REST API for integration
Next: Multi-Document RAG