Structured Extraction
Extract structured data from unstructured text using LLMs
Structured Extraction
| Property | Value |
|---|---|
| Difficulty | Intermediate |
| Time | ~4 hours |
| Code Size | ~300 LOC |
| Prerequisites | Chatbot, Python 3.10+ |
TL;DR
Define a Pydantic schema (name, email, phone...), pass it to an LLM with Instructor library, and get back validated JSON. Instructor handles retries if the LLM output doesn't match your schema. Schema-first = reliable extraction.
Why Structured Output Beats Regex and Manual Parsing
Traditional extraction relies on regex patterns and hand-written parsers. They break on every new format, miss edge cases, and require constant maintenance. An email regex that handles 90% of formats still fails on the other 10%.
LLMs understand context, but raw LLM output is unpredictable -- sometimes JSON, sometimes prose, sometimes malformed. Schema-first extraction with Instructor gives you the best of both worlds: LLM intelligence with Pydantic validation guarantees.
Extraction Approaches Compared
Regex / rule-based
Brittle patterns that break on format changes. "John Smith, CTO" works but "CTO John Smith" fails. Every new pattern needs a new rule.
Raw LLM output
LLM understands context but output format varies. Sometimes returns JSON, sometimes prose. No validation, no retries on malformed output.
LLM + Pydantic schema + Instructor
RecommendedDefine the structure once as a Pydantic model. Instructor forces the LLM to output matching JSON, validates it, and retries automatically on failure. Reliable and adaptable.
What You'll Learn
- Using Pydantic for schema definition
- Function calling for structured output
- Instructor library for reliable extraction
- Handling extraction errors and validation
Tech Stack
| Component | Technology | Why |
|---|---|---|
| LLM | OpenAI GPT-4o-mini | Fast and cheap for extraction tasks |
| Schema | Pydantic v2 | Type-safe validation with Field descriptions |
| Extraction | Instructor | Auto-retries, schema enforcement, clean API |
| API | FastAPI | Auto-generated OpenAPI docs from Pydantic models |
Extraction Pipeline
Structured Extraction Pipeline
Project Structure
structured-extraction/
├── src/
│ ├── __init__.py
│ ├── schemas.py # Pydantic schemas
│ ├── extractor.py # Extraction logic
│ └── api.py # FastAPI application
├── tests/
│ └── test_extraction.py
└── requirements.txtImplementation
Step 1: Setup
openai>=1.0.0
instructor>=0.4.0
pydantic>=2.0.0
fastapi>=0.100.0
uvicorn>=0.23.0Step 2: Define Schemas
"""
Pydantic schemas for structured extraction.
"""
from pydantic import BaseModel, Field, field_validator
from typing import Optional
from datetime import date
from enum import Enum
class ContactInfo(BaseModel):
"""Extracted contact information."""
name: str = Field(description="Full name of the person")
email: Optional[str] = Field(None, description="Email address")
phone: Optional[str] = Field(None, description="Phone number")
company: Optional[str] = Field(None, description="Company or organization")
title: Optional[str] = Field(None, description="Job title")
@field_validator("email")
@classmethod
def validate_email(cls, v):
if v and "@" not in v:
raise ValueError("Invalid email format")
return v
class Sentiment(str, Enum):
POSITIVE = "positive"
NEGATIVE = "negative"
NEUTRAL = "neutral"
class ReviewAnalysis(BaseModel):
"""Analyzed product review."""
sentiment: Sentiment = Field(description="Overall sentiment")
rating: int = Field(ge=1, le=5, description="Rating from 1-5")
pros: list[str] = Field(default_factory=list, description="Positive points")
cons: list[str] = Field(default_factory=list, description="Negative points")
summary: str = Field(description="Brief summary of the review")
class InvoiceItem(BaseModel):
"""Single item on an invoice."""
description: str
quantity: int = Field(ge=1)
unit_price: float = Field(ge=0)
total: float = Field(ge=0)
class Invoice(BaseModel):
"""Extracted invoice data."""
invoice_number: str = Field(description="Invoice ID or number")
vendor_name: str = Field(description="Name of the vendor")
customer_name: str = Field(description="Name of the customer")
date: Optional[str] = Field(None, description="Invoice date")
due_date: Optional[str] = Field(None, description="Payment due date")
items: list[InvoiceItem] = Field(default_factory=list)
subtotal: Optional[float] = None
tax: Optional[float] = None
total: float = Field(description="Total amount")
currency: str = Field(default="USD")
class EventInfo(BaseModel):
"""Extracted event information."""
title: str = Field(description="Event name or title")
date: Optional[str] = Field(None, description="Event date")
time: Optional[str] = Field(None, description="Event time")
location: Optional[str] = Field(None, description="Event location or venue")
description: Optional[str] = Field(None, description="Event description")
organizer: Optional[str] = Field(None, description="Event organizer")Understanding Pydantic Schema Design:
Why Pydantic Schemas for LLM Output
Without Schema
LLM output: "The name is John, email john@..." — Format varies each time, hard to parse reliably, no validation, can't use in code easily
With Pydantic Schema
Recommendedclass ContactInfo(BaseModel): name: str, email: Optional[str] — Consistent JSON output, type checking built-in, IDE autocomplete works
Field Options and When to Use:
| Field Option | Purpose | Example |
|---|---|---|
description="..." | Tells LLM what to extract | "Full name of person" |
ge=1, le=5 | Number range | Rating 1-5 stars |
default=None | Optional field | Phone may be missing |
default_factory=list | Empty list default | Pros/cons lists |
Custom Validators (@field_validator) run AFTER LLM extraction, can fix or reject bad data, and trigger retry if validation fails.
Schema Design Patterns:
| Pattern | Example | When to Use |
|---|---|---|
Optional[str] | Email might not be in text | Field may be missing |
list[str] | Multiple pros/cons | Multiple values expected |
Enum | Sentiment (positive/negative/neutral) | Fixed set of valid values |
Nested BaseModel | InvoiceItem inside Invoice | Complex nested structures |
Step 3: Extractor with Instructor
"""
Structured extraction using Instructor library.
"""
from typing import Type, TypeVar
from pydantic import BaseModel
import instructor
from openai import OpenAI
from .schemas import ContactInfo, ReviewAnalysis, Invoice, EventInfo
T = TypeVar("T", bound=BaseModel)
class StructuredExtractor:
"""
Extract structured data from text using LLMs.
Uses the Instructor library for reliable extraction
with automatic retries and validation.
"""
def __init__(self, model: str = "gpt-4o-mini"):
# Patch OpenAI client with Instructor
self.client = instructor.from_openai(OpenAI())
self.model = model
def extract(
self,
text: str,
schema: Type[T],
instructions: str = ""
) -> T:
"""
Extract structured data matching the schema.
Args:
text: Unstructured text to extract from
schema: Pydantic model defining the structure
instructions: Additional extraction instructions
Returns:
Validated Pydantic model instance
"""
system_prompt = f"""Extract information from the text into the specified structure.
{instructions}
Rules:
- Only extract information that is explicitly stated
- Use null/None for fields not found in the text
- Be precise and accurate"""
return self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": text}
],
response_model=schema,
max_retries=2 # Retry on validation failure
)
def extract_contact(self, text: str) -> ContactInfo:
"""Extract contact information."""
return self.extract(
text,
ContactInfo,
"Extract all contact details including name, email, phone, company, and job title."
)
def analyze_review(self, text: str) -> ReviewAnalysis:
"""Analyze a product review."""
return self.extract(
text,
ReviewAnalysis,
"Analyze the sentiment, identify pros and cons, and provide a rating."
)
def extract_invoice(self, text: str) -> Invoice:
"""Extract invoice data."""
return self.extract(
text,
Invoice,
"Extract all invoice details including line items, totals, and dates."
)
def extract_event(self, text: str) -> EventInfo:
"""Extract event information."""
return self.extract(
text,
EventInfo,
"Extract event details including title, date, time, location, and organizer."
)
class BatchExtractor:
"""Extract multiple items from text."""
def __init__(self, model: str = "gpt-4o-mini"):
self.client = instructor.from_openai(OpenAI())
self.model = model
def extract_all(
self,
text: str,
schema: Type[T],
instructions: str = ""
) -> list[T]:
"""Extract all matching items from text."""
# Create a wrapper model for list extraction
class ItemList(BaseModel):
items: list[schema]
result = self.client.chat.completions.create(
model=self.model,
messages=[
{
"role": "system",
"content": f"Extract ALL items matching the schema. {instructions}"
},
{"role": "user", "content": text}
],
response_model=ItemList
)
return result.itemsUnderstanding How Instructor Works:
What instructor.from_openai(OpenAI()) Does
Retry Flow (max_retries=2)
LLM Response
LLM generates JSON output
Parse JSON
Attempt to parse raw text as JSON
Pydantic Validation
Validate parsed JSON against schema
Return Object
Return validated Pydantic model instance
Why Use Instructor vs Raw Function Calling:
| Approach | Pros | Cons |
|---|---|---|
| Raw function calling | No extra dependency | Manual validation, no retries |
| Instructor | Auto-validation, retries, clean API | Extra dependency |
| JSON mode | Simple | No schema enforcement |
Batch Extraction Pattern:
The BatchExtractor wraps your schema in a list wrapper (class ItemList: items: list[T]) because LLMs handle lists more reliably when explicitly asked for a list container rather than raw array output.
Step 4: FastAPI Application
"""FastAPI application for structured extraction."""
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Any
from .extractor import StructuredExtractor
from .schemas import ContactInfo, ReviewAnalysis, Invoice, EventInfo
app = FastAPI(
title="Structured Extraction API",
description="Extract structured data from unstructured text"
)
extractor = StructuredExtractor()
class ExtractionRequest(BaseModel):
text: str
@app.post("/extract/contact", response_model=ContactInfo)
async def extract_contact(request: ExtractionRequest):
"""Extract contact information from text."""
try:
return extractor.extract_contact(request.text)
except Exception as e:
raise HTTPException(500, str(e))
@app.post("/extract/review", response_model=ReviewAnalysis)
async def analyze_review(request: ExtractionRequest):
"""Analyze a product review."""
try:
return extractor.analyze_review(request.text)
except Exception as e:
raise HTTPException(500, str(e))
@app.post("/extract/invoice", response_model=Invoice)
async def extract_invoice(request: ExtractionRequest):
"""Extract invoice data from text."""
try:
return extractor.extract_invoice(request.text)
except Exception as e:
raise HTTPException(500, str(e))
@app.post("/extract/event", response_model=EventInfo)
async def extract_event(request: ExtractionRequest):
"""Extract event information from text."""
try:
return extractor.extract_event(request.text)
except Exception as e:
raise HTTPException(500, str(e))Understanding the API Architecture:
Why Separate Endpoints per Extraction Type
Single Generic Endpoint
POST /extract with Body containing text and type. Response type varies, hard to document, no type hints in clients.
Type-Specific Endpoints
RecommendedPOST /extract/contact, /extract/review, /extract/invoice — Clear response_model, auto-generated OpenAPI, type-safe client SDKs
API Flow
Error Handling Strategy:
| Error Type | Cause | Response |
|---|---|---|
| Validation error | LLM output doesn't match schema | 500 after max_retries |
| Missing text | Empty request body | FastAPI auto-returns 422 |
| LLM error | API failure, rate limit | 500 with error message |
Example Usage
# Extract contact info
curl -X POST http://localhost:8000/extract/contact \
-H "Content-Type: application/json" \
-d '{"text": "Hi, I am John Smith, CTO at TechCorp. You can reach me at john@techcorp.com or call 555-0123."}'
# Analyze review
curl -X POST http://localhost:8000/extract/review \
-H "Content-Type: application/json" \
-d '{"text": "Great product! Fast delivery and excellent quality. The only downside is the high price. Would recommend to others."}'Key Concepts Recap
| Concept | What It Is | Why It Matters |
|---|---|---|
| Pydantic Schema | Python class defining data structure | Ensures output matches expected format |
| Instructor | Library that patches OpenAI client | Handles retries, validation, structured output |
| response_model | Pass schema to LLM call | LLM knows what structure to return |
| Field Validators | Custom validation rules | Catch invalid data (bad emails, etc.) |
| max_retries | Auto-retry on validation failure | Handles LLM inconsistency gracefully |
| Batch Extraction | Extract multiple items from text | Find all contacts, all events, etc. |
Next Steps
- Content Generation - Generate content with templates
- Code Assistant - Build a coding AI