Structured Extraction
Extract structured data from unstructured text using LLMs
Structured Extraction
Transform unstructured text into structured, validated data
TL;DR
Define a Pydantic schema (name, email, phone...), pass it to an LLM with Instructor library, and get back validated JSON. Instructor handles retries if the LLM output doesn't match your schema. Schema-first = reliable extraction.
What You'll Learn
- Using Pydantic for schema definition
- Function calling for structured output
- Instructor library for reliable extraction
- Handling extraction errors and validation
Tech Stack
| Component | Technology |
|---|---|
| LLM | OpenAI GPT-4 |
| Schema | Pydantic |
| Extraction | Instructor |
| API | FastAPI |
Extraction Pipeline
┌─────────────────────────────────────────────────────────────────────────────┐
│ STRUCTURED EXTRACTION PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ Unstructured │ │ Pydantic │ │ LLM + Instructor│ │
│ │ Text │───►│ Schema │───►│ Extraction │ │
│ │ │ │ │ │ │ │
│ │ "Hi, I'm John │ │ class Contact│ │ response_model= │ │
│ │ at john@..." │ │ name: str │ │ Contact │ │
│ │ │ │ email: str │ │ │ │
│ └─────────────────┘ └──────────────┘ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Pydantic Validation │ │
│ │ • Type checking (str, int, float) │ │
│ │ • Field constraints (ge=1, le=5) │ │
│ │ • Custom validators (@field_validator) │ │
│ │ • Auto-retry on validation failure │ │
│ └────────────────────┬────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Structured Data (validated JSON) │ │
│ │ {"name": "John", "email": "john@..."} │ │
│ └─────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Project Structure
structured-extraction/
├── src/
│ ├── __init__.py
│ ├── schemas.py # Pydantic schemas
│ ├── extractor.py # Extraction logic
│ └── api.py # FastAPI application
├── tests/
│ └── test_extraction.py
└── requirements.txtImplementation
Step 1: Setup
openai>=1.0.0
instructor>=0.4.0
pydantic>=2.0.0
fastapi>=0.100.0
uvicorn>=0.23.0Step 2: Define Schemas
"""
Pydantic schemas for structured extraction.
"""
from pydantic import BaseModel, Field, field_validator
from typing import Optional
from datetime import date
from enum import Enum
class ContactInfo(BaseModel):
"""Extracted contact information."""
name: str = Field(description="Full name of the person")
email: Optional[str] = Field(None, description="Email address")
phone: Optional[str] = Field(None, description="Phone number")
company: Optional[str] = Field(None, description="Company or organization")
title: Optional[str] = Field(None, description="Job title")
@field_validator("email")
@classmethod
def validate_email(cls, v):
if v and "@" not in v:
raise ValueError("Invalid email format")
return v
class Sentiment(str, Enum):
POSITIVE = "positive"
NEGATIVE = "negative"
NEUTRAL = "neutral"
class ReviewAnalysis(BaseModel):
"""Analyzed product review."""
sentiment: Sentiment = Field(description="Overall sentiment")
rating: int = Field(ge=1, le=5, description="Rating from 1-5")
pros: list[str] = Field(default_factory=list, description="Positive points")
cons: list[str] = Field(default_factory=list, description="Negative points")
summary: str = Field(description="Brief summary of the review")
class InvoiceItem(BaseModel):
"""Single item on an invoice."""
description: str
quantity: int = Field(ge=1)
unit_price: float = Field(ge=0)
total: float = Field(ge=0)
class Invoice(BaseModel):
"""Extracted invoice data."""
invoice_number: str = Field(description="Invoice ID or number")
vendor_name: str = Field(description="Name of the vendor")
customer_name: str = Field(description="Name of the customer")
date: Optional[str] = Field(None, description="Invoice date")
due_date: Optional[str] = Field(None, description="Payment due date")
items: list[InvoiceItem] = Field(default_factory=list)
subtotal: Optional[float] = None
tax: Optional[float] = None
total: float = Field(description="Total amount")
currency: str = Field(default="USD")
class EventInfo(BaseModel):
"""Extracted event information."""
title: str = Field(description="Event name or title")
date: Optional[str] = Field(None, description="Event date")
time: Optional[str] = Field(None, description="Event time")
location: Optional[str] = Field(None, description="Event location or venue")
description: Optional[str] = Field(None, description="Event description")
organizer: Optional[str] = Field(None, description="Event organizer")Understanding Pydantic Schema Design:
┌─────────────────────────────────────────────────────────────────────────────┐
│ WHY PYDANTIC SCHEMAS FOR LLM OUTPUT │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Without Schema: With Pydantic Schema: │
│ ┌─────────────────────────────┐ ┌─────────────────────────────────┐ │
│ │ LLM output: "The name is │ │ class ContactInfo(BaseModel): │ │
│ │ John, email john@..." │ │ name: str │ │
│ │ │ │ email: Optional[str] │ │
│ │ Problems: │ │ │ │
│ │ • Format varies each time │ │ Benefits: │ │
│ │ • Hard to parse reliably │ │ • Consistent JSON output │ │
│ │ • No validation │ │ • Type checking built-in │ │
│ │ • Can't use in code easily │ │ • IDE autocomplete works │ │
│ └─────────────────────────────┘ └─────────────────────────────────┘ │
│ │
│ Field Options and When to Use: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Field(...) │ Purpose │ Example │ │
│ │ ─────────────────────┼──────────────────────┼───────────────────── │ │
│ │ description="..." │ Tells LLM what to │ "Full name of person" │ │
│ │ │ extract │ │ │
│ │ ─────────────────────┼──────────────────────┼───────────────────── │ │
│ │ ge=1, le=5 │ Number range │ Rating 1-5 stars │ │
│ │ ─────────────────────┼──────────────────────┼───────────────────── │ │
│ │ default=None │ Optional field │ Phone may be missing │ │
│ │ ─────────────────────┼──────────────────────┼───────────────────── │ │
│ │ default_factory=list │ Empty list default │ Pros/cons lists │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Custom Validators (@field_validator): │
│ • Run AFTER LLM extraction │
│ • Can fix or reject bad data │
│ • Triggers retry if validation fails │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Schema Design Patterns:
| Pattern | Example | When to Use |
|---|---|---|
Optional[str] | Email might not be in text | Field may be missing |
list[str] | Multiple pros/cons | Multiple values expected |
Enum | Sentiment (positive/negative/neutral) | Fixed set of valid values |
Nested BaseModel | InvoiceItem inside Invoice | Complex nested structures |
Step 3: Extractor with Instructor
"""
Structured extraction using Instructor library.
"""
from typing import Type, TypeVar
from pydantic import BaseModel
import instructor
from openai import OpenAI
from .schemas import ContactInfo, ReviewAnalysis, Invoice, EventInfo
T = TypeVar("T", bound=BaseModel)
class StructuredExtractor:
"""
Extract structured data from text using LLMs.
Uses the Instructor library for reliable extraction
with automatic retries and validation.
"""
def __init__(self, model: str = "gpt-4-turbo-preview"):
# Patch OpenAI client with Instructor
self.client = instructor.from_openai(OpenAI())
self.model = model
def extract(
self,
text: str,
schema: Type[T],
instructions: str = ""
) -> T:
"""
Extract structured data matching the schema.
Args:
text: Unstructured text to extract from
schema: Pydantic model defining the structure
instructions: Additional extraction instructions
Returns:
Validated Pydantic model instance
"""
system_prompt = f"""Extract information from the text into the specified structure.
{instructions}
Rules:
- Only extract information that is explicitly stated
- Use null/None for fields not found in the text
- Be precise and accurate"""
return self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": text}
],
response_model=schema,
max_retries=2 # Retry on validation failure
)
def extract_contact(self, text: str) -> ContactInfo:
"""Extract contact information."""
return self.extract(
text,
ContactInfo,
"Extract all contact details including name, email, phone, company, and job title."
)
def analyze_review(self, text: str) -> ReviewAnalysis:
"""Analyze a product review."""
return self.extract(
text,
ReviewAnalysis,
"Analyze the sentiment, identify pros and cons, and provide a rating."
)
def extract_invoice(self, text: str) -> Invoice:
"""Extract invoice data."""
return self.extract(
text,
Invoice,
"Extract all invoice details including line items, totals, and dates."
)
def extract_event(self, text: str) -> EventInfo:
"""Extract event information."""
return self.extract(
text,
EventInfo,
"Extract event details including title, date, time, location, and organizer."
)
class BatchExtractor:
"""Extract multiple items from text."""
def __init__(self, model: str = "gpt-4-turbo-preview"):
self.client = instructor.from_openai(OpenAI())
self.model = model
def extract_all(
self,
text: str,
schema: Type[T],
instructions: str = ""
) -> list[T]:
"""Extract all matching items from text."""
# Create a wrapper model for list extraction
class ItemList(BaseModel):
items: list[schema]
result = self.client.chat.completions.create(
model=self.model,
messages=[
{
"role": "system",
"content": f"Extract ALL items matching the schema. {instructions}"
},
{"role": "user", "content": text}
],
response_model=ItemList
)
return result.itemsUnderstanding How Instructor Works:
┌─────────────────────────────────────────────────────────────────────────────┐
│ INSTRUCTOR LIBRARY INTERNALS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ What instructor.from_openai(OpenAI()) Does: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ 1. Patches the OpenAI client to accept response_model parameter │ │
│ │ │ │
│ │ 2. Converts Pydantic schema to JSON Schema: │ │
│ │ class Contact: │ │
│ │ name: str ──► {"type": "object", │ │
│ │ email: str "properties": { │ │
│ │ "name": {"type": "string"}, │ │
│ │ "email": {"type": "string"} │ │
│ │ }} │ │
│ │ │ │
│ │ 3. Uses function calling (tool_choice) to force structured output │ │
│ │ │ │
│ │ 4. Parses response and validates against Pydantic model │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Retry Flow (max_retries=2): │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ LLM Response ──► Parse JSON ──► Pydantic Validation │ │
│ │ │ │ │ │
│ │ │ Fail? ─┼──► Retry with error msg │ │
│ │ │ │ in prompt │ │
│ │ │ │ │ │ │
│ │ │ Pass? ─┼──► Return validated │ │
│ │ │ │ object │ │
│ │ │ │ │ │
│ │ └──► If JSON parse fails: retry with "invalid JSON" msg │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Why Use Instructor vs Raw Function Calling:
| Approach | Pros | Cons |
|---|---|---|
| Raw function calling | No extra dependency | Manual validation, no retries |
| Instructor | Auto-validation, retries, clean API | Extra dependency |
| JSON mode | Simple | No schema enforcement |
Batch Extraction Pattern:
The BatchExtractor wraps your schema in a list wrapper (class ItemList: items: list[T]) because LLMs handle lists more reliably when explicitly asked for a list container rather than raw array output.
Step 4: FastAPI Application
"""FastAPI application for structured extraction."""
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Any
from .extractor import StructuredExtractor
from .schemas import ContactInfo, ReviewAnalysis, Invoice, EventInfo
app = FastAPI(
title="Structured Extraction API",
description="Extract structured data from unstructured text"
)
extractor = StructuredExtractor()
class ExtractionRequest(BaseModel):
text: str
@app.post("/extract/contact", response_model=ContactInfo)
async def extract_contact(request: ExtractionRequest):
"""Extract contact information from text."""
try:
return extractor.extract_contact(request.text)
except Exception as e:
raise HTTPException(500, str(e))
@app.post("/extract/review", response_model=ReviewAnalysis)
async def analyze_review(request: ExtractionRequest):
"""Analyze a product review."""
try:
return extractor.analyze_review(request.text)
except Exception as e:
raise HTTPException(500, str(e))
@app.post("/extract/invoice", response_model=Invoice)
async def extract_invoice(request: ExtractionRequest):
"""Extract invoice data from text."""
try:
return extractor.extract_invoice(request.text)
except Exception as e:
raise HTTPException(500, str(e))
@app.post("/extract/event", response_model=EventInfo)
async def extract_event(request: ExtractionRequest):
"""Extract event information from text."""
try:
return extractor.extract_event(request.text)
except Exception as e:
raise HTTPException(500, str(e))Understanding the API Architecture:
┌─────────────────────────────────────────────────────────────────────────────┐
│ WHY SEPARATE ENDPOINTS PER EXTRACTION TYPE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Option A: Single Generic Endpoint Option B: Type-Specific Endpoints│
│ ┌─────────────────────────────────┐ ┌─────────────────────────────┐ │
│ │ POST /extract │ │ POST /extract/contact │ │
│ │ Body: {text, type: "contact"} │ │ POST /extract/review │ │
│ │ │ │ POST /extract/invoice │ │
│ │ Problems: │ │ │ │
│ │ • Response type varies │ │ Benefits: │ │
│ │ • Hard to document │ │ • Clear response_model │ │
│ │ • No type hints in clients │ │ • Auto-generated OpenAPI │ │
│ └─────────────────────────────────┘ │ • Type-safe client SDKs │ │
│ └─────────────────────────────┘ │
│ │
│ API Flow: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Client ──► POST /extract/contact ──► FastAPI validates request │ │
│ │ │ │ │ │
│ │ │ ▼ │ │
│ │ │ extractor.extract_contact() │ │
│ │ │ │ │ │
│ │ │ ▼ │ │
│ │ │ LLM + Instructor │ │
│ │ │ │ │ │
│ │ │ ▼ │ │
│ │ │ ContactInfo (validated) │ │
│ │ │ │ │ │
│ │ ◄─────────────────────────────── │ │
│ │ │ │
│ │ Response: {"name": "John", "email": "john@...", ...} │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘Error Handling Strategy:
| Error Type | Cause | Response |
|---|---|---|
| Validation error | LLM output doesn't match schema | 500 after max_retries |
| Missing text | Empty request body | FastAPI auto-returns 422 |
| LLM error | API failure, rate limit | 500 with error message |
Example Usage
# Extract contact info
curl -X POST http://localhost:8000/extract/contact \
-H "Content-Type: application/json" \
-d '{"text": "Hi, I am John Smith, CTO at TechCorp. You can reach me at john@techcorp.com or call 555-0123."}'
# Analyze review
curl -X POST http://localhost:8000/extract/review \
-H "Content-Type: application/json" \
-d '{"text": "Great product! Fast delivery and excellent quality. The only downside is the high price. Would recommend to others."}'Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| Pydantic Schema | Python class defining data structure | Ensures output matches expected format |
| Instructor | Library that patches OpenAI client | Handles retries, validation, structured output |
| response_model | Pass schema to LLM call | LLM knows what structure to return |
| Field Validators | Custom validation rules | Catch invalid data (bad emails, etc.) |
| max_retries | Auto-retry on validation failure | Handles LLM inconsistency gracefully |
| Batch Extraction | Extract multiple items from text | Find all contacts, all events, etc. |
Next Steps
- Content Generation - Generate content with templates
- Code Assistant - Build a coding AI