Structured Extraction

Transform unstructured text into structured, validated data

TL;DR

Define a Pydantic schema (name, email, phone...), pass it to an LLM with Instructor library, and get back validated JSON. Instructor handles retries if the LLM output doesn't match your schema. Schema-first = reliable extraction.

What You'll Learn

Using Pydantic for schema definition
Function calling for structured output
Instructor library for reliable extraction
Handling extraction errors and validation

Tech Stack

Component	Technology
LLM	OpenAI GPT-4
Schema	Pydantic
Extraction	Instructor
API	FastAPI

Extraction Pipeline

┌─────────────────────────────────────────────────────────────────────────────┐
│                    STRUCTURED EXTRACTION PIPELINE                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────┐    ┌──────────────┐    ┌─────────────────┐             │
│  │  Unstructured   │    │   Pydantic   │    │  LLM + Instructor│            │
│  │     Text        │───►│   Schema     │───►│   Extraction     │            │
│  │                 │    │              │    │                  │             │
│  │ "Hi, I'm John   │    │ class Contact│    │ response_model=  │            │
│  │  at john@..."   │    │   name: str  │    │   Contact        │            │
│  │                 │    │   email: str │    │                  │             │
│  └─────────────────┘    └──────────────┘    └────────┬─────────┘            │
│                                                       │                     │
│                                                       ▼                     │
│                         ┌─────────────────────────────────────────────┐     │
│                         │         Pydantic Validation                 │     │
│                         │  • Type checking (str, int, float)          │     │
│                         │  • Field constraints (ge=1, le=5)           │     │
│                         │  • Custom validators (@field_validator)     │     │
│                         │  • Auto-retry on validation failure         │     │
│                         └────────────────────┬────────────────────────┘     │
│                                              │                              │
│                                              ▼                              │
│                         ┌─────────────────────────────────────────────┐     │
│                         │   Structured Data (validated JSON)          │     │
│                         │   {"name": "John", "email": "john@..."}     │     │
│                         └─────────────────────────────────────────────┘     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Project Structure

structured-extraction/
├── src/
│   ├── __init__.py
│   ├── schemas.py         # Pydantic schemas
│   ├── extractor.py       # Extraction logic
│   └── api.py             # FastAPI application
├── tests/
│   └── test_extraction.py
└── requirements.txt

Implementation

Step 1: Setup

requirements.txt

openai>=1.0.0
instructor>=0.4.0
pydantic>=2.0.0
fastapi>=0.100.0
uvicorn>=0.23.0

Step 2: Define Schemas

src/schemas.py

"""
Pydantic schemas for structured extraction.
"""

from pydantic import BaseModel, Field, field_validator
from typing import Optional
from datetime import date
from enum import Enum


class ContactInfo(BaseModel):
    """Extracted contact information."""
    name: str = Field(description="Full name of the person")
    email: Optional[str] = Field(None, description="Email address")
    phone: Optional[str] = Field(None, description="Phone number")
    company: Optional[str] = Field(None, description="Company or organization")
    title: Optional[str] = Field(None, description="Job title")
    
    @field_validator("email")
    @classmethod
    def validate_email(cls, v):
        if v and "@" not in v:
            raise ValueError("Invalid email format")
        return v


class Sentiment(str, Enum):
    POSITIVE = "positive"
    NEGATIVE = "negative"
    NEUTRAL = "neutral"


class ReviewAnalysis(BaseModel):
    """Analyzed product review."""
    sentiment: Sentiment = Field(description="Overall sentiment")
    rating: int = Field(ge=1, le=5, description="Rating from 1-5")
    pros: list[str] = Field(default_factory=list, description="Positive points")
    cons: list[str] = Field(default_factory=list, description="Negative points")
    summary: str = Field(description="Brief summary of the review")


class InvoiceItem(BaseModel):
    """Single item on an invoice."""
    description: str
    quantity: int = Field(ge=1)
    unit_price: float = Field(ge=0)
    total: float = Field(ge=0)


class Invoice(BaseModel):
    """Extracted invoice data."""
    invoice_number: str = Field(description="Invoice ID or number")
    vendor_name: str = Field(description="Name of the vendor")
    customer_name: str = Field(description="Name of the customer")
    date: Optional[str] = Field(None, description="Invoice date")
    due_date: Optional[str] = Field(None, description="Payment due date")
    items: list[InvoiceItem] = Field(default_factory=list)
    subtotal: Optional[float] = None
    tax: Optional[float] = None
    total: float = Field(description="Total amount")
    currency: str = Field(default="USD")


class EventInfo(BaseModel):
    """Extracted event information."""
    title: str = Field(description="Event name or title")
    date: Optional[str] = Field(None, description="Event date")
    time: Optional[str] = Field(None, description="Event time")
    location: Optional[str] = Field(None, description="Event location or venue")
    description: Optional[str] = Field(None, description="Event description")
    organizer: Optional[str] = Field(None, description="Event organizer")

Understanding Pydantic Schema Design:

┌─────────────────────────────────────────────────────────────────────────────┐
│ WHY PYDANTIC SCHEMAS FOR LLM OUTPUT                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Without Schema:                       With Pydantic Schema:                │
│  ┌─────────────────────────────┐       ┌─────────────────────────────────┐ │
│  │ LLM output: "The name is    │       │ class ContactInfo(BaseModel):   │ │
│  │ John, email john@..."       │       │     name: str                   │ │
│  │                             │       │     email: Optional[str]        │ │
│  │ Problems:                   │       │                                 │ │
│  │ • Format varies each time   │       │ Benefits:                       │ │
│  │ • Hard to parse reliably    │       │ • Consistent JSON output        │ │
│  │ • No validation             │       │ • Type checking built-in        │ │
│  │ • Can't use in code easily  │       │ • IDE autocomplete works        │ │
│  └─────────────────────────────┘       └─────────────────────────────────┘ │
│                                                                             │
│  Field Options and When to Use:                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Field(...)           │ Purpose              │ Example               │   │
│  │ ─────────────────────┼──────────────────────┼─────────────────────  │   │
│  │ description="..."    │ Tells LLM what to    │ "Full name of person" │   │
│  │                      │ extract              │                       │   │
│  │ ─────────────────────┼──────────────────────┼─────────────────────  │   │
│  │ ge=1, le=5           │ Number range         │ Rating 1-5 stars      │   │
│  │ ─────────────────────┼──────────────────────┼─────────────────────  │   │
│  │ default=None         │ Optional field       │ Phone may be missing  │   │
│  │ ─────────────────────┼──────────────────────┼─────────────────────  │   │
│  │ default_factory=list │ Empty list default   │ Pros/cons lists       │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  Custom Validators (@field_validator):                                      │
│  • Run AFTER LLM extraction                                                 │
│  • Can fix or reject bad data                                               │
│  • Triggers retry if validation fails                                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Schema Design Patterns:

Pattern	Example	When to Use
`Optional[str]`	Email might not be in text	Field may be missing
`list[str]`	Multiple pros/cons	Multiple values expected
`Enum`	Sentiment (positive/negative/neutral)	Fixed set of valid values
Nested `BaseModel`	InvoiceItem inside Invoice	Complex nested structures

Step 3: Extractor with Instructor

src/extractor.py

"""
Structured extraction using Instructor library.
"""

from typing import Type, TypeVar
from pydantic import BaseModel
import instructor
from openai import OpenAI

from .schemas import ContactInfo, ReviewAnalysis, Invoice, EventInfo


T = TypeVar("T", bound=BaseModel)


class StructuredExtractor:
    """
    Extract structured data from text using LLMs.
    
    Uses the Instructor library for reliable extraction
    with automatic retries and validation.
    """
    
    def __init__(self, model: str = "gpt-4-turbo-preview"):
        # Patch OpenAI client with Instructor
        self.client = instructor.from_openai(OpenAI())
        self.model = model
    
    def extract(
        self,
        text: str,
        schema: Type[T],
        instructions: str = ""
    ) -> T:
        """
        Extract structured data matching the schema.
        
        Args:
            text: Unstructured text to extract from
            schema: Pydantic model defining the structure
            instructions: Additional extraction instructions
            
        Returns:
            Validated Pydantic model instance
        """
        system_prompt = f"""Extract information from the text into the specified structure.
{instructions}

Rules:
- Only extract information that is explicitly stated
- Use null/None for fields not found in the text
- Be precise and accurate"""
        
        return self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": text}
            ],
            response_model=schema,
            max_retries=2  # Retry on validation failure
        )
    
    def extract_contact(self, text: str) -> ContactInfo:
        """Extract contact information."""
        return self.extract(
            text,
            ContactInfo,
            "Extract all contact details including name, email, phone, company, and job title."
        )
    
    def analyze_review(self, text: str) -> ReviewAnalysis:
        """Analyze a product review."""
        return self.extract(
            text,
            ReviewAnalysis,
            "Analyze the sentiment, identify pros and cons, and provide a rating."
        )
    
    def extract_invoice(self, text: str) -> Invoice:
        """Extract invoice data."""
        return self.extract(
            text,
            Invoice,
            "Extract all invoice details including line items, totals, and dates."
        )
    
    def extract_event(self, text: str) -> EventInfo:
        """Extract event information."""
        return self.extract(
            text,
            EventInfo,
            "Extract event details including title, date, time, location, and organizer."
        )


class BatchExtractor:
    """Extract multiple items from text."""
    
    def __init__(self, model: str = "gpt-4-turbo-preview"):
        self.client = instructor.from_openai(OpenAI())
        self.model = model
    
    def extract_all(
        self,
        text: str,
        schema: Type[T],
        instructions: str = ""
    ) -> list[T]:
        """Extract all matching items from text."""
        
        # Create a wrapper model for list extraction
        class ItemList(BaseModel):
            items: list[schema]
        
        result = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {
                    "role": "system",
                    "content": f"Extract ALL items matching the schema. {instructions}"
                },
                {"role": "user", "content": text}
            ],
            response_model=ItemList
        )
        
        return result.items

Understanding How Instructor Works:

┌─────────────────────────────────────────────────────────────────────────────┐
│ INSTRUCTOR LIBRARY INTERNALS                                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  What instructor.from_openai(OpenAI()) Does:                               │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                                                                     │   │
│  │  1. Patches the OpenAI client to accept response_model parameter   │   │
│  │                                                                     │   │
│  │  2. Converts Pydantic schema to JSON Schema:                       │   │
│  │     class Contact:                                                 │   │
│  │         name: str           ──►  {"type": "object",                │   │
│  │         email: str                 "properties": {                 │   │
│  │                                      "name": {"type": "string"},   │   │
│  │                                      "email": {"type": "string"}   │   │
│  │                                    }}                              │   │
│  │                                                                     │   │
│  │  3. Uses function calling (tool_choice) to force structured output │   │
│  │                                                                     │   │
│  │  4. Parses response and validates against Pydantic model          │   │
│  │                                                                     │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  Retry Flow (max_retries=2):                                               │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                                                                     │   │
│  │  LLM Response ──► Parse JSON ──► Pydantic Validation               │   │
│  │        │                                  │                         │   │
│  │        │                           Fail? ─┼──► Retry with error msg │   │
│  │        │                                  │    in prompt            │   │
│  │        │                                  │         │               │   │
│  │        │                           Pass? ─┼──► Return validated     │   │
│  │        │                                  │    object               │   │
│  │        │                                  │                         │   │
│  │        └──► If JSON parse fails: retry with "invalid JSON" msg     │   │
│  │                                                                     │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Why Use Instructor vs Raw Function Calling:

Approach	Pros	Cons
Raw function calling	No extra dependency	Manual validation, no retries
Instructor	Auto-validation, retries, clean API	Extra dependency
JSON mode	Simple	No schema enforcement

Batch Extraction Pattern: The BatchExtractor wraps your schema in a list wrapper (class ItemList: items: list[T]) because LLMs handle lists more reliably when explicitly asked for a list container rather than raw array output.

Step 4: FastAPI Application

src/api.py

"""FastAPI application for structured extraction."""

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Any

from .extractor import StructuredExtractor
from .schemas import ContactInfo, ReviewAnalysis, Invoice, EventInfo


app = FastAPI(
    title="Structured Extraction API",
    description="Extract structured data from unstructured text"
)

extractor = StructuredExtractor()


class ExtractionRequest(BaseModel):
    text: str


@app.post("/extract/contact", response_model=ContactInfo)
async def extract_contact(request: ExtractionRequest):
    """Extract contact information from text."""
    try:
        return extractor.extract_contact(request.text)
    except Exception as e:
        raise HTTPException(500, str(e))


@app.post("/extract/review", response_model=ReviewAnalysis)
async def analyze_review(request: ExtractionRequest):
    """Analyze a product review."""
    try:
        return extractor.analyze_review(request.text)
    except Exception as e:
        raise HTTPException(500, str(e))


@app.post("/extract/invoice", response_model=Invoice)
async def extract_invoice(request: ExtractionRequest):
    """Extract invoice data from text."""
    try:
        return extractor.extract_invoice(request.text)
    except Exception as e:
        raise HTTPException(500, str(e))


@app.post("/extract/event", response_model=EventInfo)
async def extract_event(request: ExtractionRequest):
    """Extract event information from text."""
    try:
        return extractor.extract_event(request.text)
    except Exception as e:
        raise HTTPException(500, str(e))

Understanding the API Architecture:

┌─────────────────────────────────────────────────────────────────────────────┐
│ WHY SEPARATE ENDPOINTS PER EXTRACTION TYPE                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Option A: Single Generic Endpoint        Option B: Type-Specific Endpoints│
│  ┌─────────────────────────────────┐      ┌─────────────────────────────┐  │
│  │ POST /extract                   │      │ POST /extract/contact       │  │
│  │ Body: {text, type: "contact"}   │      │ POST /extract/review        │  │
│  │                                 │      │ POST /extract/invoice       │  │
│  │ Problems:                       │      │                             │  │
│  │ • Response type varies          │      │ Benefits:                   │  │
│  │ • Hard to document              │      │ • Clear response_model      │  │
│  │ • No type hints in clients      │      │ • Auto-generated OpenAPI    │  │
│  └─────────────────────────────────┘      │ • Type-safe client SDKs     │  │
│                                            └─────────────────────────────┘  │
│                                                                             │
│  API Flow:                                                                  │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                                                                     │   │
│  │  Client ──► POST /extract/contact ──► FastAPI validates request    │   │
│  │                │                              │                     │   │
│  │                │                              ▼                     │   │
│  │                │                      extractor.extract_contact()   │   │
│  │                │                              │                     │   │
│  │                │                              ▼                     │   │
│  │                │                      LLM + Instructor              │   │
│  │                │                              │                     │   │
│  │                │                              ▼                     │   │
│  │                │                      ContactInfo (validated)       │   │
│  │                │                              │                     │   │
│  │                ◄───────────────────────────────                     │   │
│  │                                                                     │   │
│  │  Response: {"name": "John", "email": "john@...", ...}             │   │
│  │                                                                     │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Error Handling Strategy:

Error Type	Cause	Response
Validation error	LLM output doesn't match schema	500 after max_retries
Missing text	Empty request body	FastAPI auto-returns 422
LLM error	API failure, rate limit	500 with error message

Example Usage

# Extract contact info
curl -X POST http://localhost:8000/extract/contact \
  -H "Content-Type: application/json" \
  -d '{"text": "Hi, I am John Smith, CTO at TechCorp. You can reach me at john@techcorp.com or call 555-0123."}'

# Analyze review
curl -X POST http://localhost:8000/extract/review \
  -H "Content-Type: application/json" \
  -d '{"text": "Great product! Fast delivery and excellent quality. The only downside is the high price. Would recommend to others."}'

Key Concepts Recap

Concept	What It Is	Why It Matters
Pydantic Schema	Python class defining data structure	Ensures output matches expected format
Instructor	Library that patches OpenAI client	Handles retries, validation, structured output
response_model	Pass schema to LLM call	LLM knows what structure to return
Field Validators	Custom validation rules	Catch invalid data (bad emails, etc.)
max_retries	Auto-retry on validation failure	Handles LLM inconsistency gracefully
Batch Extraction	Extract multiple items from text	Find all contacts, all events, etc.

Next Steps

Content Generation - Generate content with templates
Code Assistant - Build a coding AI

Structured Extraction

Transform unstructured text into structured, validated data

TL;DR

What You'll Learn

Using Pydantic for schema definition
Function calling for structured output
Instructor library for reliable extraction
Handling extraction errors and validation

Tech Stack

Component	Technology
LLM	OpenAI GPT-4
Schema	Pydantic
Extraction	Instructor
API	FastAPI

Extraction Pipeline

┌─────────────────────────────────────────────────────────────────────────────┐
│                    STRUCTURED EXTRACTION PIPELINE                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────┐    ┌──────────────┐    ┌─────────────────┐             │
│  │  Unstructured   │    │   Pydantic   │    │  LLM + Instructor│            │
│  │     Text        │───►│   Schema     │───►│   Extraction     │            │
│  │                 │    │              │    │                  │             │
│  │ "Hi, I'm John   │    │ class Contact│    │ response_model=  │            │
│  │  at john@..."   │    │   name: str  │    │   Contact        │            │
│  │                 │    │   email: str │    │                  │             │
│  └─────────────────┘    └──────────────┘    └────────┬─────────┘            │
│                                                       │                     │
│                                                       ▼                     │
│                         ┌─────────────────────────────────────────────┐     │
│                         │         Pydantic Validation                 │     │
│                         │  • Type checking (str, int, float)          │     │
│                         │  • Field constraints (ge=1, le=5)           │     │
│                         │  • Custom validators (@field_validator)     │     │
│                         │  • Auto-retry on validation failure         │     │
│                         └────────────────────┬────────────────────────┘     │
│                                              │                              │
│                                              ▼                              │
│                         ┌─────────────────────────────────────────────┐     │
│                         │   Structured Data (validated JSON)          │     │
│                         │   {"name": "John", "email": "john@..."}     │     │
│                         └─────────────────────────────────────────────┘     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Project Structure

structured-extraction/
├── src/
│   ├── __init__.py
│   ├── schemas.py         # Pydantic schemas
│   ├── extractor.py       # Extraction logic
│   └── api.py             # FastAPI application
├── tests/
│   └── test_extraction.py
└── requirements.txt

Implementation

Step 1: Setup

requirements.txt

openai>=1.0.0
instructor>=0.4.0
pydantic>=2.0.0
fastapi>=0.100.0
uvicorn>=0.23.0

Step 2: Define Schemas

src/schemas.py

"""
Pydantic schemas for structured extraction.
"""

from pydantic import BaseModel, Field, field_validator
from typing import Optional
from datetime import date
from enum import Enum


class ContactInfo(BaseModel):
    """Extracted contact information."""
    name: str = Field(description="Full name of the person")
    email: Optional[str] = Field(None, description="Email address")
    phone: Optional[str] = Field(None, description="Phone number")
    company: Optional[str] = Field(None, description="Company or organization")
    title: Optional[str] = Field(None, description="Job title")
    
    @field_validator("email")
    @classmethod
    def validate_email(cls, v):
        if v and "@" not in v:
            raise ValueError("Invalid email format")
        return v


class Sentiment(str, Enum):
    POSITIVE = "positive"
    NEGATIVE = "negative"
    NEUTRAL = "neutral"


class ReviewAnalysis(BaseModel):
    """Analyzed product review."""
    sentiment: Sentiment = Field(description="Overall sentiment")
    rating: int = Field(ge=1, le=5, description="Rating from 1-5")
    pros: list[str] = Field(default_factory=list, description="Positive points")
    cons: list[str] = Field(default_factory=list, description="Negative points")
    summary: str = Field(description="Brief summary of the review")


class InvoiceItem(BaseModel):
    """Single item on an invoice."""
    description: str
    quantity: int = Field(ge=1)
    unit_price: float = Field(ge=0)
    total: float = Field(ge=0)


class Invoice(BaseModel):
    """Extracted invoice data."""
    invoice_number: str = Field(description="Invoice ID or number")
    vendor_name: str = Field(description="Name of the vendor")
    customer_name: str = Field(description="Name of the customer")
    date: Optional[str] = Field(None, description="Invoice date")
    due_date: Optional[str] = Field(None, description="Payment due date")
    items: list[InvoiceItem] = Field(default_factory=list)
    subtotal: Optional[float] = None
    tax: Optional[float] = None
    total: float = Field(description="Total amount")
    currency: str = Field(default="USD")


class EventInfo(BaseModel):
    """Extracted event information."""
    title: str = Field(description="Event name or title")
    date: Optional[str] = Field(None, description="Event date")
    time: Optional[str] = Field(None, description="Event time")
    location: Optional[str] = Field(None, description="Event location or venue")
    description: Optional[str] = Field(None, description="Event description")
    organizer: Optional[str] = Field(None, description="Event organizer")

Understanding Pydantic Schema Design:

┌─────────────────────────────────────────────────────────────────────────────┐
│ WHY PYDANTIC SCHEMAS FOR LLM OUTPUT                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Without Schema:                       With Pydantic Schema:                │
│  ┌─────────────────────────────┐       ┌─────────────────────────────────┐ │
│  │ LLM output: "The name is    │       │ class ContactInfo(BaseModel):   │ │
│  │ John, email john@..."       │       │     name: str                   │ │
│  │                             │       │     email: Optional[str]        │ │
│  │ Problems:                   │       │                                 │ │
│  │ • Format varies each time   │       │ Benefits:                       │ │
│  │ • Hard to parse reliably    │       │ • Consistent JSON output        │ │
│  │ • No validation             │       │ • Type checking built-in        │ │
│  │ • Can't use in code easily  │       │ • IDE autocomplete works        │ │
│  └─────────────────────────────┘       └─────────────────────────────────┘ │
│                                                                             │
│  Field Options and When to Use:                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Field(...)           │ Purpose              │ Example               │   │
│  │ ─────────────────────┼──────────────────────┼─────────────────────  │   │
│  │ description="..."    │ Tells LLM what to    │ "Full name of person" │   │
│  │                      │ extract              │                       │   │
│  │ ─────────────────────┼──────────────────────┼─────────────────────  │   │
│  │ ge=1, le=5           │ Number range         │ Rating 1-5 stars      │   │
│  │ ─────────────────────┼──────────────────────┼─────────────────────  │   │
│  │ default=None         │ Optional field       │ Phone may be missing  │   │
│  │ ─────────────────────┼──────────────────────┼─────────────────────  │   │
│  │ default_factory=list │ Empty list default   │ Pros/cons lists       │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  Custom Validators (@field_validator):                                      │
│  • Run AFTER LLM extraction                                                 │
│  • Can fix or reject bad data                                               │
│  • Triggers retry if validation fails                                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Schema Design Patterns:

Pattern	Example	When to Use
`Optional[str]`	Email might not be in text	Field may be missing
`list[str]`	Multiple pros/cons	Multiple values expected
`Enum`	Sentiment (positive/negative/neutral)	Fixed set of valid values
Nested `BaseModel`	InvoiceItem inside Invoice	Complex nested structures

Step 3: Extractor with Instructor

src/extractor.py

"""
Structured extraction using Instructor library.
"""

from typing import Type, TypeVar
from pydantic import BaseModel
import instructor
from openai import OpenAI

from .schemas import ContactInfo, ReviewAnalysis, Invoice, EventInfo


T = TypeVar("T", bound=BaseModel)


class StructuredExtractor:
    """
    Extract structured data from text using LLMs.
    
    Uses the Instructor library for reliable extraction
    with automatic retries and validation.
    """
    
    def __init__(self, model: str = "gpt-4-turbo-preview"):
        # Patch OpenAI client with Instructor
        self.client = instructor.from_openai(OpenAI())
        self.model = model
    
    def extract(
        self,
        text: str,
        schema: Type[T],
        instructions: str = ""
    ) -> T:
        """
        Extract structured data matching the schema.
        
        Args:
            text: Unstructured text to extract from
            schema: Pydantic model defining the structure
            instructions: Additional extraction instructions
            
        Returns:
            Validated Pydantic model instance
        """
        system_prompt = f"""Extract information from the text into the specified structure.
{instructions}

Rules:
- Only extract information that is explicitly stated
- Use null/None for fields not found in the text
- Be precise and accurate"""
        
        return self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": text}
            ],
            response_model=schema,
            max_retries=2  # Retry on validation failure
        )
    
    def extract_contact(self, text: str) -> ContactInfo:
        """Extract contact information."""
        return self.extract(
            text,
            ContactInfo,
            "Extract all contact details including name, email, phone, company, and job title."
        )
    
    def analyze_review(self, text: str) -> ReviewAnalysis:
        """Analyze a product review."""
        return self.extract(
            text,
            ReviewAnalysis,
            "Analyze the sentiment, identify pros and cons, and provide a rating."
        )
    
    def extract_invoice(self, text: str) -> Invoice:
        """Extract invoice data."""
        return self.extract(
            text,
            Invoice,
            "Extract all invoice details including line items, totals, and dates."
        )
    
    def extract_event(self, text: str) -> EventInfo:
        """Extract event information."""
        return self.extract(
            text,
            EventInfo,
            "Extract event details including title, date, time, location, and organizer."
        )


class BatchExtractor:
    """Extract multiple items from text."""
    
    def __init__(self, model: str = "gpt-4-turbo-preview"):
        self.client = instructor.from_openai(OpenAI())
        self.model = model
    
    def extract_all(
        self,
        text: str,
        schema: Type[T],
        instructions: str = ""
    ) -> list[T]:
        """Extract all matching items from text."""
        
        # Create a wrapper model for list extraction
        class ItemList(BaseModel):
            items: list[schema]
        
        result = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {
                    "role": "system",
                    "content": f"Extract ALL items matching the schema. {instructions}"
                },
                {"role": "user", "content": text}
            ],
            response_model=ItemList
        )
        
        return result.items

Understanding How Instructor Works:

┌─────────────────────────────────────────────────────────────────────────────┐
│ INSTRUCTOR LIBRARY INTERNALS                                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  What instructor.from_openai(OpenAI()) Does:                               │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                                                                     │   │
│  │  1. Patches the OpenAI client to accept response_model parameter   │   │
│  │                                                                     │   │
│  │  2. Converts Pydantic schema to JSON Schema:                       │   │
│  │     class Contact:                                                 │   │
│  │         name: str           ──►  {"type": "object",                │   │
│  │         email: str                 "properties": {                 │   │
│  │                                      "name": {"type": "string"},   │   │
│  │                                      "email": {"type": "string"}   │   │
│  │                                    }}                              │   │
│  │                                                                     │   │
│  │  3. Uses function calling (tool_choice) to force structured output │   │
│  │                                                                     │   │
│  │  4. Parses response and validates against Pydantic model          │   │
│  │                                                                     │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  Retry Flow (max_retries=2):                                               │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                                                                     │   │
│  │  LLM Response ──► Parse JSON ──► Pydantic Validation               │   │
│  │        │                                  │                         │   │
│  │        │                           Fail? ─┼──► Retry with error msg │   │
│  │        │                                  │    in prompt            │   │
│  │        │                                  │         │               │   │
│  │        │                           Pass? ─┼──► Return validated     │   │
│  │        │                                  │    object               │   │
│  │        │                                  │                         │   │
│  │        └──► If JSON parse fails: retry with "invalid JSON" msg     │   │
│  │                                                                     │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Why Use Instructor vs Raw Function Calling:

Approach	Pros	Cons
Raw function calling	No extra dependency	Manual validation, no retries
Instructor	Auto-validation, retries, clean API	Extra dependency
JSON mode	Simple	No schema enforcement

Step 4: FastAPI Application

src/api.py

"""FastAPI application for structured extraction."""

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Any

from .extractor import StructuredExtractor
from .schemas import ContactInfo, ReviewAnalysis, Invoice, EventInfo


app = FastAPI(
    title="Structured Extraction API",
    description="Extract structured data from unstructured text"
)

extractor = StructuredExtractor()


class ExtractionRequest(BaseModel):
    text: str


@app.post("/extract/contact", response_model=ContactInfo)
async def extract_contact(request: ExtractionRequest):
    """Extract contact information from text."""
    try:
        return extractor.extract_contact(request.text)
    except Exception as e:
        raise HTTPException(500, str(e))


@app.post("/extract/review", response_model=ReviewAnalysis)
async def analyze_review(request: ExtractionRequest):
    """Analyze a product review."""
    try:
        return extractor.analyze_review(request.text)
    except Exception as e:
        raise HTTPException(500, str(e))


@app.post("/extract/invoice", response_model=Invoice)
async def extract_invoice(request: ExtractionRequest):
    """Extract invoice data from text."""
    try:
        return extractor.extract_invoice(request.text)
    except Exception as e:
        raise HTTPException(500, str(e))


@app.post("/extract/event", response_model=EventInfo)
async def extract_event(request: ExtractionRequest):
    """Extract event information from text."""
    try:
        return extractor.extract_event(request.text)
    except Exception as e:
        raise HTTPException(500, str(e))

Understanding the API Architecture:

┌─────────────────────────────────────────────────────────────────────────────┐
│ WHY SEPARATE ENDPOINTS PER EXTRACTION TYPE                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Option A: Single Generic Endpoint        Option B: Type-Specific Endpoints│
│  ┌─────────────────────────────────┐      ┌─────────────────────────────┐  │
│  │ POST /extract                   │      │ POST /extract/contact       │  │
│  │ Body: {text, type: "contact"}   │      │ POST /extract/review        │  │
│  │                                 │      │ POST /extract/invoice       │  │
│  │ Problems:                       │      │                             │  │
│  │ • Response type varies          │      │ Benefits:                   │  │
│  │ • Hard to document              │      │ • Clear response_model      │  │
│  │ • No type hints in clients      │      │ • Auto-generated OpenAPI    │  │
│  └─────────────────────────────────┘      │ • Type-safe client SDKs     │  │
│                                            └─────────────────────────────┘  │
│                                                                             │
│  API Flow:                                                                  │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                                                                     │   │
│  │  Client ──► POST /extract/contact ──► FastAPI validates request    │   │
│  │                │                              │                     │   │
│  │                │                              ▼                     │   │
│  │                │                      extractor.extract_contact()   │   │
│  │                │                              │                     │   │
│  │                │                              ▼                     │   │
│  │                │                      LLM + Instructor              │   │
│  │                │                              │                     │   │
│  │                │                              ▼                     │   │
│  │                │                      ContactInfo (validated)       │   │
│  │                │                              │                     │   │
│  │                ◄───────────────────────────────                     │   │
│  │                                                                     │   │
│  │  Response: {"name": "John", "email": "john@...", ...}             │   │
│  │                                                                     │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Error Handling Strategy:

Error Type	Cause	Response
Validation error	LLM output doesn't match schema	500 after max_retries
Missing text	Empty request body	FastAPI auto-returns 422
LLM error	API failure, rate limit	500 with error message

Example Usage

# Extract contact info
curl -X POST http://localhost:8000/extract/contact \
  -H "Content-Type: application/json" \
  -d '{"text": "Hi, I am John Smith, CTO at TechCorp. You can reach me at john@techcorp.com or call 555-0123."}'

# Analyze review
curl -X POST http://localhost:8000/extract/review \
  -H "Content-Type: application/json" \
  -d '{"text": "Great product! Fast delivery and excellent quality. The only downside is the high price. Would recommend to others."}'

Key Concepts Recap

Concept	What It Is	Why It Matters
Pydantic Schema	Python class defining data structure	Ensures output matches expected format
Instructor	Library that patches OpenAI client	Handles retries, validation, structured output
response_model	Pass schema to LLM call	LLM knows what structure to return
Field Validators	Custom validation rules	Catch invalid data (bad emails, etc.)
max_retries	Auto-retry on validation failure	Handles LLM inconsistency gracefully
Batch Extraction	Extract multiple items from text	Find all contacts, all events, etc.

Next Steps

Content Generation - Generate content with templates
Code Assistant - Build a coding AI

Structured Extraction

Structured Extraction

What You'll Learn

Tech Stack

Extraction Pipeline

Project Structure

Implementation

Step 1: Setup

Step 2: Define Schemas

Step 3: Extractor with Instructor

Step 4: FastAPI Application

Example Usage

Key Concepts Recap

Next Steps

On this page

Structured Extraction

Structured Extraction

What You'll Learn

Tech Stack

Extraction Pipeline

Project Structure

Implementation

Step 1: Setup

Step 2: Define Schemas

Step 3: Extractor with Instructor

Step 4: FastAPI Application

Example Usage

Key Concepts Recap

Next Steps

On this page