Concept 01

The Extraction Problem — LLMs Return Prose, Your App Needs Data

LLMs are trained to produce natural, human-readable text. When you ask "Extract the name and salary from this job posting," you might get: "The candidate's name would be filled in by the applicant, and the salary range mentioned is $80,000 to $100,000 per year." That's great prose and useless data.

Your application needs: {"salary_min": 80000, "salary_max": 100000, "currency": "USD"}. The extraction challenge is consistently getting the model to return machine-readable, validated, typed data instead of prose.

This is one of the most common LLM engineering tasks: processing unstructured text (emails, documents, forms, user input) and converting it to structured records for your database or business logic.

Where Structured Extraction is Used
  • HR systems: Parse uploaded resumes into candidate records
  • Finance: Extract line items from invoice PDFs
  • E-commerce: Parse product attributes from descriptions
  • CRM: Extract contact info from emails and chat logs
  • Healthcare: Extract medications and dosages from clinical notes
  • Legal: Extract parties, dates, and clauses from contracts

Concept 02

3 Methods Compared — From Fragile to Bulletproof

MethodReliabilityType SafetyWorks WithWhen to Use
Prompt + Parse Low (60-80%) None All providers Prototyping only
JSON Mode High (95%+) Partial OpenAI, Gemini When Pydantic parse isn't available
Structured Outputs (Pydantic) Very High (99%+) Full OpenAI gpt-4o+ Production data extraction

Concept 03

Defining Your Extraction Schema with Pydantic

Pydantic models are the backbone of structured extraction. They define what you expect, enforce types, and let you add validation logic. The OpenAI SDK can generate a JSON schema from your Pydantic model and constrain the LLM to match it exactly.

from pydantic import BaseModel, Field, field_validator
from typing import Optional, List, Literal
from datetime import date
import re

# Example: Job posting extraction schema
class SalaryRange(BaseModel):
    min_amount: Optional[float] = Field(None, description="Minimum salary amount")
    max_amount: Optional[float] = Field(None, description="Maximum salary amount")
    currency: str = Field(default="USD", description="3-letter currency code")
    period: Literal["annual", "monthly", "hourly"] = Field(
        default="annual",
        description="Payment period"
    )

class JobPosting(BaseModel):
    """Schema for extracting structured data from a job posting."""
    title: str = Field(description="Job title / role name")
    company: Optional[str] = Field(None, description="Company name")
    location: Optional[str] = Field(None, description="Job location or 'Remote'")
    employment_type: Optional[Literal["full-time", "part-time", "contract", "internship"]] = None
    salary: Optional[SalaryRange] = Field(None, description="Compensation details if mentioned")
    required_skills: List[str] = Field(
        default_factory=list,
        description="List of required technical skills and tools"
    )
    years_experience: Optional[int] = Field(
        None,
        description="Minimum years of experience required. null if not specified."
    )
    remote_ok: bool = Field(
        default=False,
        description="True if remote work is allowed"
    )
    application_deadline: Optional[str] = Field(
        None,
        description="Application deadline in YYYY-MM-DD format, or null"
    )

    @field_validator("required_skills")
    @classmethod
    def deduplicate_skills(cls, v):
        """Remove duplicate skills (case-insensitive)."""
        seen = set()
        result = []
        for skill in v:
            key = skill.lower().strip()
            if key not in seen:
                seen.add(key)
                result.append(skill.strip())
        return result

    @field_validator("years_experience")
    @classmethod
    def validate_experience(cls, v):
        if v is not None and (v < 0 or v > 50):
            raise ValueError(f"Invalid years_experience: {v}")
        return v

Concept 04

Retry Logic — When Extraction Fails, Try Again Smarter

import json
from openai import OpenAI
from pydantic import BaseModel, ValidationError
from typing import Type, TypeVar

client = OpenAI()
T = TypeVar("T", bound=BaseModel)

def extract_with_retry(
    text: str,
    schema: Type[T],
    system_instruction: str = "",
    model: str = "gpt-4o-2024-08-06",
    max_retries: int = 3,
) -> T:
    """
    Extract structured data from text with retry logic.
    On failure, tells the model what went wrong and asks it to fix it.
    """
    last_error = None

    for attempt in range(max_retries):
        try:
            if attempt == 0:
                # First attempt: clean extraction
                user_content = f"Extract the requested information from this text:\n\n{text}"
            else:
                # Subsequent attempts: include the error so the model can fix it
                user_content = f"""Previous attempt failed with this error: {last_error}

Please extract the information again, correcting the error.
Pay special attention to:
- Required fields must be present
- Types must match exactly (integers must be integers, not strings)
- Enum values must be exactly as specified

TEXT TO EXTRACT FROM:
{text}"""

            messages = []
            if system_instruction:
                messages.append({"role": "system", "content": system_instruction})
            messages.append({"role": "user", "content": user_content})

            # Try OpenAI structured outputs first (best reliability)
            try:
                response = client.beta.chat.completions.parse(
                    model=model,
                    messages=messages,
                    response_format=schema,
                    temperature=0,
                )
                return response.choices[0].message.parsed

            except Exception:
                # Fallback to JSON mode + manual Pydantic validation
                response = client.chat.completions.create(
                    model=model,
                    messages=messages,
                    response_format={"type": "json_object"},
                    temperature=0,
                )
                raw_json = response.choices[0].message.content
                data = json.loads(raw_json)
                return schema(**data)

        except (ValidationError, json.JSONDecodeError, Exception) as e:
            last_error = str(e)
            print(f"  Attempt {attempt + 1} failed: {last_error[:100]}")
            if attempt == max_retries - 1:
                raise RuntimeError(
                    f"Extraction failed after {max_retries} attempts. "
                    f"Last error: {last_error}"
                ) from e

    raise RuntimeError("Extraction failed")  # Should not reach here

Concept 05

Complete Resume Parser

from pydantic import BaseModel, Field
from typing import Optional, List
from datetime import date

class WorkExperience(BaseModel):
    company: str
    title: str
    start_date: Optional[str] = Field(None, description="YYYY-MM format or year only")
    end_date: Optional[str] = Field(None, description="YYYY-MM format, year, or 'present'")
    responsibilities: List[str] = Field(default_factory=list)

class Education(BaseModel):
    institution: str
    degree: str
    field_of_study: Optional[str] = None
    graduation_year: Optional[int] = None

class Resume(BaseModel):
    full_name: str
    email: Optional[str] = None
    phone: Optional[str] = None
    location: Optional[str] = None
    summary: Optional[str] = Field(None, description="Professional summary if present")
    skills: List[str] = Field(default_factory=list, description="Technical and soft skills")
    work_experience: List[WorkExperience] = Field(default_factory=list)
    education: List[Education] = Field(default_factory=list)
    total_years_experience: Optional[int] = Field(
        None,
        description="Estimate of total professional experience in years"
    )
    seniority_level: Optional[str] = Field(
        None,
        description="Estimated level: junior, mid, senior, lead, executive"
    )

RESUME_SYSTEM_PROMPT = """You are an expert resume parser. Extract all information accurately.
For dates: use YYYY-MM format when month is available, YYYY when only year is given.
For skills: extract ALL mentioned technologies, tools, frameworks, and languages.
For total_years_experience: sum up all work experience duration.
For seniority_level: infer from years of experience and titles."""

def parse_resume(resume_text: str) -> Resume:
    """Parse a resume into structured data."""
    return extract_with_retry(
        text=resume_text,
        schema=Resume,
        system_instruction=RESUME_SYSTEM_PROMPT,
    )

# Test with sample resume
sample_resume = """
SARAH CHEN
sarah.chen@email.com | (555) 123-4567 | San Francisco, CA | linkedin.com/in/sarahchen

SUMMARY
Senior software engineer with 7+ years building scalable Python backends and ML pipelines.
Passionate about turning data into business impact.

EXPERIENCE
Senior Software Engineer | Stripe | 2021 - Present
- Led migration of payment processing service to microservices (Python, FastAPI, Kubernetes)
- Reduced API latency by 40% through caching strategy (Redis, PostgreSQL query optimization)
- Mentored team of 4 junior engineers

Software Engineer | Airbnb | 2018 - 2021
- Built recommendation system serving 50M users daily (Python, TensorFlow, Apache Spark)
- Designed and implemented A/B testing framework

EDUCATION
BS Computer Science | Stanford University | 2018

SKILLS
Python, FastAPI, Django, TensorFlow, PyTorch, PostgreSQL, Redis, Kubernetes, Docker,
AWS (EC2, S3, Lambda), Apache Spark, Kafka, Git, REST APIs, Microservices
"""

result = parse_resume(sample_resume)
print(f"Name: {result.full_name}")
print(f"Email: {result.email}")
print(f"Seniority: {result.seniority_level}")
print(f"Skills: {', '.join(result.skills[:5])}...")
print(f"Companies: {[e.company for e in result.work_experience]}")

Concept 06

Invoice Extractor — Financial Data Extraction

from pydantic import BaseModel, Field, field_validator
from typing import Optional, List
from decimal import Decimal

class LineItem(BaseModel):
    description: str
    quantity: Optional[float] = None
    unit_price: Optional[float] = None
    amount: float
    tax_rate: Optional[float] = Field(None, description="Tax rate as decimal, e.g. 0.18 for 18%")

class Invoice(BaseModel):
    invoice_number: str
    issue_date: Optional[str] = Field(None, description="YYYY-MM-DD format")
    due_date: Optional[str] = Field(None, description="YYYY-MM-DD format")
    vendor_name: str
    vendor_address: Optional[str] = None
    vendor_gstin: Optional[str] = Field(None, description="GST/VAT number if present")
    client_name: Optional[str] = None
    client_address: Optional[str] = None
    line_items: List[LineItem] = Field(default_factory=list)
    subtotal: Optional[float] = None
    tax_amount: Optional[float] = None
    total_amount: float
    currency: str = Field(default="INR", description="3-letter currency code")
    payment_terms: Optional[str] = None
    notes: Optional[str] = None

    @field_validator("total_amount")
    @classmethod
    def validate_total(cls, v):
        if v < 0:
            raise ValueError("Total amount cannot be negative")
        return round(v, 2)

INVOICE_SYSTEM_PROMPT = """You are an expert invoice data extractor.
Extract ALL fields from the invoice text or image description.
For amounts: always use numeric values (not strings). Remove currency symbols.
For dates: convert to YYYY-MM-DD format.
For GSTIN/VAT: extract exactly as shown in the document.
If a value is not present, set it to null — never guess or infer missing financial data."""

def extract_invoice(invoice_text: str) -> Invoice:
    """Extract structured data from invoice text."""
    return extract_with_retry(
        text=invoice_text,
        schema=Invoice,
        system_instruction=INVOICE_SYSTEM_PROMPT,
    )

# Test
sample_invoice = """
INVOICE

Invoice #: INV-2026-0342
Date: March 15, 2026
Due Date: April 14, 2026

From:
TechSolutions India Pvt Ltd
42 MG Road, Bangalore 560001
GSTIN: 29AABCT1234A1Z5

To:
Prepflix Learning Platform
New Delhi, 110001

SERVICES:
- Backend API Development (40 hours @ ₹3,500/hr)    ₹1,40,000
- Database Optimization (8 hours @ ₹4,000/hr)         ₹32,000
- DevOps Setup and CI/CD Pipeline                      ₹25,000

                                          Subtotal:  ₹1,97,000
                                          GST (18%): ₹35,460
                                          TOTAL:     ₹2,32,460

Payment Terms: Net 30 days
Bank Transfer to HDFC #XXXXXXXX1234
"""

invoice = extract_invoice(sample_invoice)
print(f"Invoice: {invoice.invoice_number}")
print(f"Vendor: {invoice.vendor_name}")
print(f"Total: {invoice.currency} {invoice.total_amount:,.2f}")
print(f"Line Items: {len(invoice.line_items)}")
for item in invoice.line_items:
    print(f"  - {item.description}: {item.amount:,.2f}")

Concept 07

Batch Extraction Pipeline

import asyncio
from openai import AsyncOpenAI
from typing import Type, TypeVar

async_client = AsyncOpenAI()
T = TypeVar("T", bound=BaseModel)

async def extract_single_async(
    text: str,
    schema: Type[T],
    system_instruction: str = "",
) -> tuple[T, str]:
    """Async extraction for a single document."""
    try:
        response = await async_client.beta.chat.completions.parse(
            model="gpt-4o-mini",
            messages=[
                *([{"role": "system", "content": system_instruction}] if system_instruction else []),
                {"role": "user", "content": f"Extract from this text:\n\n{text}"}
            ],
            response_format=schema,
            temperature=0,
        )
        return response.choices[0].message.parsed, "success"
    except Exception as e:
        return None, str(e)

async def batch_extract_async(
    texts: list[str],
    schema: Type[T],
    system_instruction: str = "",
    concurrency: int = 5,
) -> list[dict]:
    """
    Extract structured data from multiple documents concurrently.
    Uses semaphore to limit concurrent API calls (respect rate limits).
    """
    semaphore = asyncio.Semaphore(concurrency)

    async def extract_with_semaphore(idx: int, text: str):
        async with semaphore:
            result, error = await extract_single_async(text, schema, system_instruction)
            return {"index": idx, "result": result, "error": error}

    tasks = [extract_with_semaphore(i, text) for i, text in enumerate(texts)]
    results = await asyncio.gather(*tasks)

    successes = sum(1 for r in results if r["error"] == "success")
    print(f"Batch extraction: {successes}/{len(texts)} succeeded")
    return results

# Run batch extraction
async def main():
    resumes = ["resume text 1...", "resume text 2...", "resume text 3..."]
    results = await batch_extract_async(resumes, Resume, RESUME_SYSTEM_PROMPT)
    for r in results:
        if r["result"]:
            print(f"[{r['index']}] {r['result'].full_name}")
        else:
            print(f"[{r['index']}] FAILED: {r['error']}")

# asyncio.run(main())

Concept 08

Common Extraction Mistakes

Mistake 1: Overly strict required fields

Making too many fields required causes extraction to fail when any field is absent. Use Optional generously and only require truly essential fields. Let business logic handle missing values after extraction.

Mistake 2: Ambiguous field descriptions

"amount" with no description is ambiguous. Always specify: "Invoice total amount in the document's original currency, as a numeric value (no currency symbol)." The description is part of the schema the model sees.

Mistake 3: No post-validation logic

Even with Pydantic and structured outputs, validate business rules after extraction. Example: if invoice.total_amount != invoice.subtotal + invoice.tax_amount: flag_for_review(). LLMs can still make arithmetic mistakes in complex invoices.

CP-08 Summary
  • Use Pydantic + OpenAI structured outputs for production data extraction
  • JSON mode is a reliable fallback when structured outputs aren't available
  • Prompt-only parsing belongs in prototypes, not production
  • Retry with error feedback — tell the model what went wrong
  • Write detailed field descriptions in your Pydantic models — the model reads them
  • Always validate business rules after extraction (sums, date ranges, required combos)
  • Use async batch extraction for processing large document sets efficiently