Multi-Modal Application
Build applications that understand and generate text, images, and audio
Multi-Modal Application
| Property | Value |
|---|---|
| Difficulty | Advanced |
| Time | ~3 days |
| Code Size | ~500 LOC |
| Prerequisites | Chatbot, Structured Extraction |
TL;DR
Use GPT-4 Vision to analyze images alongside text. Build multi-modal RAG by storing image descriptions (generated via vision API) in ChromaDB alongside text documents. Query both modalities and combine context for richer answers.
Why Vision + Text Unlocks New Use Cases
Text-only LLMs cannot process product images, analyze charts, read diagrams, or understand screenshots. Adding vision capabilities opens entirely new application categories: visual Q&A over product catalogs, document understanding for scanned PDFs, and multi-modal RAG that combines text and images for richer answers.
The key insight: you do not need to embed raw images. You embed LLM-generated descriptions of images into the same vector space as text. This lets you search across modalities with a single query.
Text-Only vs Multi-Modal Applications
Text-only LLM
Can process text documents, answer text questions, generate text. Cannot understand product photos, charts, diagrams, or screenshots. Misses half the information in most business documents.
Multi-modal LLM (vision + text)
RecommendedProcesses images alongside text. Answers "What does this chart show?" or "Compare these two products from their photos." Enables document understanding for PDFs with images and tables.
What You'll Learn
- Using vision-capable LLMs
- Image understanding and description
- Multi-modal RAG systems
- Building visual Q&A applications
Tech Stack
| Component | Technology | Why |
|---|---|---|
| LLM | GPT-4o / Claude 3.5 Sonnet | Native vision capabilities with strong reasoning |
| Image Processing | Pillow | Resize and format images before API calls |
| Vector Store | ChromaDB | Store both text and image description embeddings |
| API | FastAPI | File upload support with python-multipart |
Multi-Modal Architecture
Multi-Modal Architecture
Text Encoder
Embeddings for text input
Image Encoder
CLIP/vision for image input
Audio Encoder
Whisper for audio input
Project Structure
multi-modal-app/
├── src/
│ ├── __init__.py
│ ├── vision.py # Image understanding
│ ├── embeddings.py # Multi-modal embeddings
│ ├── rag.py # Multi-modal RAG
│ └── api.py # FastAPI application
├── tests/
└── requirements.txtImplementation
Step 1: Setup
openai>=1.0.0
anthropic>=0.18.0
pillow>=10.0.0
chromadb>=0.4.0
open-clip-torch>=2.20.0
fastapi>=0.100.0
uvicorn>=0.23.0
python-multipart>=0.0.6Step 2: Vision Module
"""
Image understanding using vision-capable LLMs.
"""
from dataclasses import dataclass
from typing import Optional, Union
from pathlib import Path
import base64
from openai import OpenAI
@dataclass
class ImageAnalysis:
"""Result of image analysis."""
description: str
objects: list[str]
text_found: list[str]
colors: list[str]
mood: str
class VisionAnalyzer:
"""
Analyze images using GPT-4 Vision.
"""
def __init__(self, model: str = "gpt-4o"):
self.client = OpenAI()
self.model = model
def analyze(
self,
image: Union[str, Path, bytes],
question: str = "Describe this image in detail"
) -> str:
"""
Analyze an image and answer questions about it.
Args:
image: Image path, URL, or bytes
question: Question about the image
Returns:
Analysis or answer
"""
image_content = self._prepare_image(image)
response = self.client.chat.completions.create(
model=self.model,
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": question},
image_content
]
}
],
max_tokens=1000
)
return response.choices[0].message.content
def extract_info(self, image: Union[str, Path, bytes]) -> ImageAnalysis:
"""Extract structured information from an image."""
prompt = """Analyze this image and extract:
1. A detailed description
2. List of main objects/elements
3. Any text visible in the image
4. Dominant colors
5. Overall mood/atmosphere
Format as:
DESCRIPTION: [description]
OBJECTS: [comma-separated list]
TEXT: [comma-separated list or "none"]
COLORS: [comma-separated list]
MOOD: [single word or short phrase]"""
response = self.analyze(image, prompt)
return self._parse_analysis(response)
def compare_images(
self,
image1: Union[str, Path, bytes],
image2: Union[str, Path, bytes]
) -> str:
"""Compare two images."""
img1_content = self._prepare_image(image1)
img2_content = self._prepare_image(image2)
response = self.client.chat.completions.create(
model=self.model,
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Compare these two images. What are the similarities and differences?"},
img1_content,
img2_content
]
}
],
max_tokens=1000
)
return response.choices[0].message.content
def _prepare_image(self, image: Union[str, Path, bytes]) -> dict:
"""Prepare image for API."""
if isinstance(image, bytes):
b64 = base64.standard_b64encode(image).decode()
return {
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{b64}"}
}
image_str = str(image)
if image_str.startswith(("http://", "https://")):
return {
"type": "image_url",
"image_url": {"url": image_str}
}
# Local file
path = Path(image_str)
with open(path, "rb") as f:
b64 = base64.standard_b64encode(f.read()).decode()
ext = path.suffix.lower()
mime = {"jpg": "jpeg", "jpeg": "jpeg", "png": "png", "gif": "gif", "webp": "webp"}
mime_type = mime.get(ext.lstrip("."), "jpeg")
return {
"type": "image_url",
"image_url": {"url": f"data:image/{mime_type};base64,{b64}"}
}
def _parse_analysis(self, response: str) -> ImageAnalysis:
"""Parse structured analysis response."""
lines = response.strip().split("\n")
result = {}
for line in lines:
if ":" in line:
key, value = line.split(":", 1)
result[key.strip().upper()] = value.strip()
return ImageAnalysis(
description=result.get("DESCRIPTION", ""),
objects=[o.strip() for o in result.get("OBJECTS", "").split(",") if o.strip()],
text_found=[t.strip() for t in result.get("TEXT", "").split(",") if t.strip() and t.lower() != "none"],
colors=[c.strip() for c in result.get("COLORS", "").split(",") if c.strip()],
mood=result.get("MOOD", "")
)Understanding Vision API Input Formats:
Image Input Options for GPT-4V
URL
RecommendedEasiest for public images. Pass the URL directly: {"type": "image_url", "image_url": {"url": "https://..."}}.
Base64
Required for local files. Steps: Local file, Read bytes, Base64 encode, Data URI. Format: "data:image/jpeg;base64,{b64_string}".
Multiple Images
Compare or combine context. Send an array of image_url objects alongside a text prompt in the content field.
MIME Type Mapping:
| Extension | MIME Type | Notes |
|---|---|---|
.jpg, .jpeg | image/jpeg | Most common, smaller files |
.png | image/png | Supports transparency |
.gif | image/gif | Animated images (first frame used) |
.webp | image/webp | Modern format, good compression |
Step 3: Multi-Modal RAG
"""
Multi-modal RAG with image and text.
"""
from dataclasses import dataclass
from typing import Union, Optional
from pathlib import Path
import chromadb
from chromadb.utils import embedding_functions
import hashlib
import base64
from .vision import VisionAnalyzer
@dataclass
class MultiModalDocument:
"""Document with text and optional image."""
id: str
text: str
image_path: Optional[str] = None
image_description: Optional[str] = None
metadata: dict = None
class MultiModalRAG:
"""
RAG system that handles both text and images.
"""
def __init__(self, persist_dir: str = "./multimodal_index"):
self.client = chromadb.PersistentClient(path=persist_dir)
self.embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
model_name="text-embedding-3-small"
)
# Separate collections for text and images
self.text_collection = self.client.get_or_create_collection(
name="text_docs",
embedding_function=self.embedding_fn
)
self.image_collection = self.client.get_or_create_collection(
name="image_docs",
embedding_function=self.embedding_fn
)
self.vision = VisionAnalyzer()
def add_document(self, doc: MultiModalDocument) -> None:
"""Add a document to the index."""
# Add text to text collection
self.text_collection.upsert(
ids=[doc.id],
documents=[doc.text],
metadatas=[doc.metadata or {}]
)
# If image, analyze and add to image collection
if doc.image_path:
description = doc.image_description
if not description:
description = self.vision.analyze(doc.image_path)
self.image_collection.upsert(
ids=[doc.id],
documents=[description],
metadatas={
"image_path": doc.image_path,
"type": "image",
**(doc.metadata or {})
}
)
def query(
self,
question: str,
n_results: int = 5,
include_images: bool = True
) -> dict:
"""
Query the multi-modal knowledge base.
Args:
question: User question
n_results: Number of results per modality
include_images: Whether to search images
Returns:
Combined results from text and images
"""
results = {"text": [], "images": []}
# Search text
text_results = self.text_collection.query(
query_texts=[question],
n_results=n_results,
include=["documents", "metadatas", "distances"]
)
for i, doc in enumerate(text_results["documents"][0]):
results["text"].append({
"content": doc,
"metadata": text_results["metadatas"][0][i],
"relevance": 1 - text_results["distances"][0][i]
})
# Search images
if include_images:
image_results = self.image_collection.query(
query_texts=[question],
n_results=n_results,
include=["documents", "metadatas", "distances"]
)
for i, doc in enumerate(image_results["documents"][0]):
results["images"].append({
"description": doc,
"image_path": image_results["metadatas"][0][i].get("image_path"),
"relevance": 1 - image_results["distances"][0][i]
})
return results
def answer(
self,
question: str,
include_images: bool = True
) -> str:
"""
Answer a question using multi-modal context.
"""
from openai import OpenAI
client = OpenAI()
# Get relevant context
context = self.query(question, include_images=include_images)
# Build prompt with context
prompt = f"Question: {question}\n\nRelevant information:\n"
for text in context["text"][:3]:
prompt += f"\nText: {text['content']}\n"
for img in context["images"][:2]:
prompt += f"\nImage description: {img['description']}\n"
prompt += "\nAnswer the question based on the above information."
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.contentUnderstanding Multi-Modal RAG Architecture:
Multi-Modal RAG Flow
Indexing Phase
Query Phase
Answer Generation
Why Separate Collections?
| Reason | Explanation |
|---|---|
| Different metadata | Images have paths, documents have sources |
| Different relevance | Text matches differently than image descriptions |
| Selective retrieval | Some queries only need text OR images |
| Easier debugging | Inspect modalities independently |
Key Insight: We don't embed raw images - we embed their descriptions. This means image search is actually semantic search over LLM-generated descriptions.
Step 4: FastAPI Application
"""FastAPI application for multi-modal AI."""
from fastapi import FastAPI, UploadFile, File, Form
from pydantic import BaseModel
from typing import Optional
import tempfile
import shutil
from .vision import VisionAnalyzer, ImageAnalysis
from .rag import MultiModalRAG
app = FastAPI(
title="Multi-Modal AI API",
description="AI that understands text and images"
)
vision = VisionAnalyzer()
rag = MultiModalRAG()
class AnalyzeResponse(BaseModel):
description: str
objects: list[str]
text_found: list[str]
colors: list[str]
mood: str
@app.post("/analyze", response_model=AnalyzeResponse)
async def analyze_image(
file: UploadFile = File(...),
question: str = Form(default="Describe this image in detail")
):
"""Analyze an uploaded image."""
# Save to temp file
with tempfile.NamedTemporaryFile(delete=False, suffix=file.filename) as tmp:
shutil.copyfileobj(file.file, tmp)
tmp_path = tmp.name
try:
if question == "Describe this image in detail":
result = vision.extract_info(tmp_path)
return AnalyzeResponse(
description=result.description,
objects=result.objects,
text_found=result.text_found,
colors=result.colors,
mood=result.mood
)
else:
response = vision.analyze(tmp_path, question)
return AnalyzeResponse(
description=response,
objects=[],
text_found=[],
colors=[],
mood=""
)
finally:
import os
os.unlink(tmp_path)
class QuestionRequest(BaseModel):
question: str
include_images: bool = True
@app.post("/ask")
async def ask_question(request: QuestionRequest):
"""Ask a question using multi-modal RAG."""
answer = rag.answer(
request.question,
include_images=request.include_images
)
return {"answer": answer}Example Usage
# Analyze an image
curl -X POST http://localhost:8000/analyze \
-F "file=@photo.jpg" \
-F "question=What objects are in this image?"
# Ask multi-modal question
curl -X POST http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{"question": "What does the product look like?"}'Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| Vision LLMs | Models that accept images + text (GPT-4V, Claude 3) | Understand visual content, not just text |
| Base64 Encoding | Convert images to text for API transmission | Send local images to vision APIs |
| Image Analysis | Extract objects, colors, text, mood from images | Structured understanding vs raw description |
| Multi-Modal RAG | Index both text docs and image descriptions | Search across modalities for richer context |
| Separate Collections | Different ChromaDB collections for text vs images | Different retrieval strategies per modality |
| Image Description Index | Store LLM-generated descriptions, not raw images | Text embeddings work on descriptions |
| Cross-Modal Context | Combine text and image results in prompts | Complete picture for better answers |
Next Steps
- Fine-Tuning LLMs - Customize models
- LLM Evaluation - Test and evaluate