Table of Contents
Concept 01
The Extraction Problem — LLMs Return Prose, Your App Needs Data
LLMs are trained to produce natural, human-readable text. When you ask "Extract the name and salary from this job posting," you might get: "The candidate's name would be filled in by the applicant, and the salary range mentioned is $80,000 to $100,000 per year." That's great prose and useless data.
Your application needs: {"salary_min": 80000, "salary_max": 100000, "currency": "USD"}. The extraction challenge is consistently getting the model to return machine-readable, validated, typed data instead of prose.
This is one of the most common LLM engineering tasks: processing unstructured text (emails, documents, forms, user input) and converting it to structured records for your database or business logic.
- HR systems: Parse uploaded resumes into candidate records
- Finance: Extract line items from invoice PDFs
- E-commerce: Parse product attributes from descriptions
- CRM: Extract contact info from emails and chat logs
- Healthcare: Extract medications and dosages from clinical notes
- Legal: Extract parties, dates, and clauses from contracts
Concept 02
3 Methods Compared — From Fragile to Bulletproof
| Method | Reliability | Type Safety | Works With | When to Use |
|---|---|---|---|---|
| Prompt + Parse | Low (60-80%) | None | All providers | Prototyping only |
| JSON Mode | High (95%+) | Partial | OpenAI, Gemini | When Pydantic parse isn't available |
| Structured Outputs (Pydantic) | Very High (99%+) | Full | OpenAI gpt-4o+ | Production data extraction |
Concept 03
Defining Your Extraction Schema with Pydantic
Pydantic models are the backbone of structured extraction. They define what you expect, enforce types, and let you add validation logic. The OpenAI SDK can generate a JSON schema from your Pydantic model and constrain the LLM to match it exactly.
from pydantic import BaseModel, Field, field_validator
from typing import Optional, List, Literal
from datetime import date
import re
# Example: Job posting extraction schema
class SalaryRange(BaseModel):
min_amount: Optional[float] = Field(None, description="Minimum salary amount")
max_amount: Optional[float] = Field(None, description="Maximum salary amount")
currency: str = Field(default="USD", description="3-letter currency code")
period: Literal["annual", "monthly", "hourly"] = Field(
default="annual",
description="Payment period"
)
class JobPosting(BaseModel):
"""Schema for extracting structured data from a job posting."""
title: str = Field(description="Job title / role name")
company: Optional[str] = Field(None, description="Company name")
location: Optional[str] = Field(None, description="Job location or 'Remote'")
employment_type: Optional[Literal["full-time", "part-time", "contract", "internship"]] = None
salary: Optional[SalaryRange] = Field(None, description="Compensation details if mentioned")
required_skills: List[str] = Field(
default_factory=list,
description="List of required technical skills and tools"
)
years_experience: Optional[int] = Field(
None,
description="Minimum years of experience required. null if not specified."
)
remote_ok: bool = Field(
default=False,
description="True if remote work is allowed"
)
application_deadline: Optional[str] = Field(
None,
description="Application deadline in YYYY-MM-DD format, or null"
)
@field_validator("required_skills")
@classmethod
def deduplicate_skills(cls, v):
"""Remove duplicate skills (case-insensitive)."""
seen = set()
result = []
for skill in v:
key = skill.lower().strip()
if key not in seen:
seen.add(key)
result.append(skill.strip())
return result
@field_validator("years_experience")
@classmethod
def validate_experience(cls, v):
if v is not None and (v < 0 or v > 50):
raise ValueError(f"Invalid years_experience: {v}")
return v
Concept 04
Retry Logic — When Extraction Fails, Try Again Smarter
import json
from openai import OpenAI
from pydantic import BaseModel, ValidationError
from typing import Type, TypeVar
client = OpenAI()
T = TypeVar("T", bound=BaseModel)
def extract_with_retry(
text: str,
schema: Type[T],
system_instruction: str = "",
model: str = "gpt-4o-2024-08-06",
max_retries: int = 3,
) -> T:
"""
Extract structured data from text with retry logic.
On failure, tells the model what went wrong and asks it to fix it.
"""
last_error = None
for attempt in range(max_retries):
try:
if attempt == 0:
# First attempt: clean extraction
user_content = f"Extract the requested information from this text:\n\n{text}"
else:
# Subsequent attempts: include the error so the model can fix it
user_content = f"""Previous attempt failed with this error: {last_error}
Please extract the information again, correcting the error.
Pay special attention to:
- Required fields must be present
- Types must match exactly (integers must be integers, not strings)
- Enum values must be exactly as specified
TEXT TO EXTRACT FROM:
{text}"""
messages = []
if system_instruction:
messages.append({"role": "system", "content": system_instruction})
messages.append({"role": "user", "content": user_content})
# Try OpenAI structured outputs first (best reliability)
try:
response = client.beta.chat.completions.parse(
model=model,
messages=messages,
response_format=schema,
temperature=0,
)
return response.choices[0].message.parsed
except Exception:
# Fallback to JSON mode + manual Pydantic validation
response = client.chat.completions.create(
model=model,
messages=messages,
response_format={"type": "json_object"},
temperature=0,
)
raw_json = response.choices[0].message.content
data = json.loads(raw_json)
return schema(**data)
except (ValidationError, json.JSONDecodeError, Exception) as e:
last_error = str(e)
print(f" Attempt {attempt + 1} failed: {last_error[:100]}")
if attempt == max_retries - 1:
raise RuntimeError(
f"Extraction failed after {max_retries} attempts. "
f"Last error: {last_error}"
) from e
raise RuntimeError("Extraction failed") # Should not reach here
Concept 05
Complete Resume Parser
from pydantic import BaseModel, Field
from typing import Optional, List
from datetime import date
class WorkExperience(BaseModel):
company: str
title: str
start_date: Optional[str] = Field(None, description="YYYY-MM format or year only")
end_date: Optional[str] = Field(None, description="YYYY-MM format, year, or 'present'")
responsibilities: List[str] = Field(default_factory=list)
class Education(BaseModel):
institution: str
degree: str
field_of_study: Optional[str] = None
graduation_year: Optional[int] = None
class Resume(BaseModel):
full_name: str
email: Optional[str] = None
phone: Optional[str] = None
location: Optional[str] = None
summary: Optional[str] = Field(None, description="Professional summary if present")
skills: List[str] = Field(default_factory=list, description="Technical and soft skills")
work_experience: List[WorkExperience] = Field(default_factory=list)
education: List[Education] = Field(default_factory=list)
total_years_experience: Optional[int] = Field(
None,
description="Estimate of total professional experience in years"
)
seniority_level: Optional[str] = Field(
None,
description="Estimated level: junior, mid, senior, lead, executive"
)
RESUME_SYSTEM_PROMPT = """You are an expert resume parser. Extract all information accurately.
For dates: use YYYY-MM format when month is available, YYYY when only year is given.
For skills: extract ALL mentioned technologies, tools, frameworks, and languages.
For total_years_experience: sum up all work experience duration.
For seniority_level: infer from years of experience and titles."""
def parse_resume(resume_text: str) -> Resume:
"""Parse a resume into structured data."""
return extract_with_retry(
text=resume_text,
schema=Resume,
system_instruction=RESUME_SYSTEM_PROMPT,
)
# Test with sample resume
sample_resume = """
SARAH CHEN
sarah.chen@email.com | (555) 123-4567 | San Francisco, CA | linkedin.com/in/sarahchen
SUMMARY
Senior software engineer with 7+ years building scalable Python backends and ML pipelines.
Passionate about turning data into business impact.
EXPERIENCE
Senior Software Engineer | Stripe | 2021 - Present
- Led migration of payment processing service to microservices (Python, FastAPI, Kubernetes)
- Reduced API latency by 40% through caching strategy (Redis, PostgreSQL query optimization)
- Mentored team of 4 junior engineers
Software Engineer | Airbnb | 2018 - 2021
- Built recommendation system serving 50M users daily (Python, TensorFlow, Apache Spark)
- Designed and implemented A/B testing framework
EDUCATION
BS Computer Science | Stanford University | 2018
SKILLS
Python, FastAPI, Django, TensorFlow, PyTorch, PostgreSQL, Redis, Kubernetes, Docker,
AWS (EC2, S3, Lambda), Apache Spark, Kafka, Git, REST APIs, Microservices
"""
result = parse_resume(sample_resume)
print(f"Name: {result.full_name}")
print(f"Email: {result.email}")
print(f"Seniority: {result.seniority_level}")
print(f"Skills: {', '.join(result.skills[:5])}...")
print(f"Companies: {[e.company for e in result.work_experience]}")
Concept 06
Invoice Extractor — Financial Data Extraction
from pydantic import BaseModel, Field, field_validator
from typing import Optional, List
from decimal import Decimal
class LineItem(BaseModel):
description: str
quantity: Optional[float] = None
unit_price: Optional[float] = None
amount: float
tax_rate: Optional[float] = Field(None, description="Tax rate as decimal, e.g. 0.18 for 18%")
class Invoice(BaseModel):
invoice_number: str
issue_date: Optional[str] = Field(None, description="YYYY-MM-DD format")
due_date: Optional[str] = Field(None, description="YYYY-MM-DD format")
vendor_name: str
vendor_address: Optional[str] = None
vendor_gstin: Optional[str] = Field(None, description="GST/VAT number if present")
client_name: Optional[str] = None
client_address: Optional[str] = None
line_items: List[LineItem] = Field(default_factory=list)
subtotal: Optional[float] = None
tax_amount: Optional[float] = None
total_amount: float
currency: str = Field(default="INR", description="3-letter currency code")
payment_terms: Optional[str] = None
notes: Optional[str] = None
@field_validator("total_amount")
@classmethod
def validate_total(cls, v):
if v < 0:
raise ValueError("Total amount cannot be negative")
return round(v, 2)
INVOICE_SYSTEM_PROMPT = """You are an expert invoice data extractor.
Extract ALL fields from the invoice text or image description.
For amounts: always use numeric values (not strings). Remove currency symbols.
For dates: convert to YYYY-MM-DD format.
For GSTIN/VAT: extract exactly as shown in the document.
If a value is not present, set it to null — never guess or infer missing financial data."""
def extract_invoice(invoice_text: str) -> Invoice:
"""Extract structured data from invoice text."""
return extract_with_retry(
text=invoice_text,
schema=Invoice,
system_instruction=INVOICE_SYSTEM_PROMPT,
)
# Test
sample_invoice = """
INVOICE
Invoice #: INV-2026-0342
Date: March 15, 2026
Due Date: April 14, 2026
From:
TechSolutions India Pvt Ltd
42 MG Road, Bangalore 560001
GSTIN: 29AABCT1234A1Z5
To:
Prepflix Learning Platform
New Delhi, 110001
SERVICES:
- Backend API Development (40 hours @ ₹3,500/hr) ₹1,40,000
- Database Optimization (8 hours @ ₹4,000/hr) ₹32,000
- DevOps Setup and CI/CD Pipeline ₹25,000
Subtotal: ₹1,97,000
GST (18%): ₹35,460
TOTAL: ₹2,32,460
Payment Terms: Net 30 days
Bank Transfer to HDFC #XXXXXXXX1234
"""
invoice = extract_invoice(sample_invoice)
print(f"Invoice: {invoice.invoice_number}")
print(f"Vendor: {invoice.vendor_name}")
print(f"Total: {invoice.currency} {invoice.total_amount:,.2f}")
print(f"Line Items: {len(invoice.line_items)}")
for item in invoice.line_items:
print(f" - {item.description}: {item.amount:,.2f}")
Concept 07
Batch Extraction Pipeline
import asyncio
from openai import AsyncOpenAI
from typing import Type, TypeVar
async_client = AsyncOpenAI()
T = TypeVar("T", bound=BaseModel)
async def extract_single_async(
text: str,
schema: Type[T],
system_instruction: str = "",
) -> tuple[T, str]:
"""Async extraction for a single document."""
try:
response = await async_client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
*([{"role": "system", "content": system_instruction}] if system_instruction else []),
{"role": "user", "content": f"Extract from this text:\n\n{text}"}
],
response_format=schema,
temperature=0,
)
return response.choices[0].message.parsed, "success"
except Exception as e:
return None, str(e)
async def batch_extract_async(
texts: list[str],
schema: Type[T],
system_instruction: str = "",
concurrency: int = 5,
) -> list[dict]:
"""
Extract structured data from multiple documents concurrently.
Uses semaphore to limit concurrent API calls (respect rate limits).
"""
semaphore = asyncio.Semaphore(concurrency)
async def extract_with_semaphore(idx: int, text: str):
async with semaphore:
result, error = await extract_single_async(text, schema, system_instruction)
return {"index": idx, "result": result, "error": error}
tasks = [extract_with_semaphore(i, text) for i, text in enumerate(texts)]
results = await asyncio.gather(*tasks)
successes = sum(1 for r in results if r["error"] == "success")
print(f"Batch extraction: {successes}/{len(texts)} succeeded")
return results
# Run batch extraction
async def main():
resumes = ["resume text 1...", "resume text 2...", "resume text 3..."]
results = await batch_extract_async(resumes, Resume, RESUME_SYSTEM_PROMPT)
for r in results:
if r["result"]:
print(f"[{r['index']}] {r['result'].full_name}")
else:
print(f"[{r['index']}] FAILED: {r['error']}")
# asyncio.run(main())
Concept 08
Common Extraction Mistakes
Making too many fields required causes extraction to fail when any field is absent. Use Optional generously and only require truly essential fields. Let business logic handle missing values after extraction.
"amount" with no description is ambiguous. Always specify: "Invoice total amount in the document's original currency, as a numeric value (no currency symbol)." The description is part of the schema the model sees.
Even with Pydantic and structured outputs, validate business rules after extraction. Example: if invoice.total_amount != invoice.subtotal + invoice.tax_amount: flag_for_review(). LLMs can still make arithmetic mistakes in complex invoices.
- Use Pydantic + OpenAI structured outputs for production data extraction
- JSON mode is a reliable fallback when structured outputs aren't available
- Prompt-only parsing belongs in prototypes, not production
- Retry with error feedback — tell the model what went wrong
- Write detailed field descriptions in your Pydantic models — the model reads them
- Always validate business rules after extraction (sums, date ranges, required combos)
- Use async batch extraction for processing large document sets efficiently