Prompt Engineering for Developers That Actually Works

01System Prompt Anatomy (6 Parts) 02Zero-Shot, Few-Shot & Chain of Thought 03Guaranteed JSON Output (3 Methods) 04Prompt Templates 05Prompt Versioning Pattern 06Testing Prompts with pytest 07The 7 Deadly Prompt Mistakes 08Summary & What's Next

What You'll Build

By the end of this checkpoint, you'll have a prompt template system with versioning support and a pytest test suite that validates your prompts automatically. You'll know three different methods to guarantee structured JSON output from any LLM.

Concept 01

System Prompt Anatomy — The 6 Parts Every Great Prompt Has

The system prompt is your most powerful tool. It runs before any user input and shapes every response the model produces. Most developers write vague, one-line system prompts and wonder why their outputs are inconsistent. Here's the full anatomy of a production-quality system prompt:

#	Part	Purpose	Example
1	Role & Identity	Tell the model who it is	"You are a senior software engineer specializing in Python backend development."
2	Context & Domain	What is this system for?	"You are helping developers at a fintech startup debug and improve their code."
3	Task Description	What should the model do?	"When given code, identify bugs, explain why they're bugs, and provide fixed code."
4	Output Format	How should the response be structured?	"Always respond with: 1) Bug description 2) Fixed code block 3) Explanation."
5	Constraints & Rules	What should it never do?	"Never add new features — only fix the bug. Don't rewrite working code."
6	Tone & Style	Voice and communication style	"Be direct and technical. Skip pleasantries. Use code blocks for all code."

# A complete production system prompt for a code review assistant

CODE_REVIEW_SYSTEM_PROMPT = """
You are a senior software engineer with 15 years of Python experience, specializing in
clean architecture, performance optimization, and security best practices.

CONTEXT:
You are integrated into a developer IDE as a code review assistant. Developers paste
code and ask for review. Your audience is intermediate-to-senior Python developers.

YOUR TASK:
Analyze the provided Python code and identify: bugs, security vulnerabilities,
performance issues, and style violations (PEP 8). Suggest specific improvements.

OUTPUT FORMAT:
Respond in exactly this structure:
## Issues Found
- List each issue as: [SEVERITY: HIGH/MEDIUM/LOW] Description

## Fixed Code
```python
[improved code here]
```

## Key Changes
- Bullet list explaining what changed and why

CONSTRAINTS:
- Only suggest changes that genuinely improve the code
- Never rewrite working, idiomatic code just to show off
- If the code is good, say so clearly — don't invent issues
- Never add features that weren't requested

TONE:
Direct and technical. No pleasantries. Assume the developer is smart.
"""

Concept 02

Zero-Shot, Few-Shot, and Chain of Thought — When to Use Each

Zero-shot prompting means giving the model a task with no examples. Works for well-understood tasks where the model has seen plenty of similar training data.

from openai import OpenAI
client = OpenAI()

# Zero-shot: task with no examples
def classify_sentiment_zero_shot(text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Classify the sentiment of user reviews. "
                           "Respond with exactly one word: POSITIVE, NEGATIVE, or NEUTRAL."
            },
            {"role": "user", "content": text}
        ],
        temperature=0,
    )
    return response.choices[0].message.content.strip()

# Works fine for obvious cases
print(classify_sentiment_zero_shot("This product is amazing!"))  # POSITIVE
print(classify_sentiment_zero_shot("Terrible quality, broke after a week"))  # NEGATIVE

Few-shot prompting provides examples in the prompt. Dramatically improves consistency for edge cases and unusual formats. The examples teach the model exactly what output you expect.

def classify_sentiment_few_shot(text: str) -> str:
    """
    Few-shot prompting: examples teach the model the exact format and edge cases.
    Use when zero-shot is inconsistent or when format matters precisely.
    """
    few_shot_messages = [
        {"role": "system", "content": "Classify review sentiment."},
        # Example 1
        {"role": "user", "content": "The shipping was fast but the product broke immediately."},
        {"role": "assistant", "content": "NEGATIVE"},
        # Example 2
        {"role": "user", "content": "Does what it says on the tin. Nothing special."},
        {"role": "assistant", "content": "NEUTRAL"},
        # Example 3
        {"role": "user", "content": "Exceeded all expectations. Will definitely buy again!"},
        {"role": "assistant", "content": "POSITIVE"},
        # Example 4: edge case
        {"role": "user", "content": "Great build quality but overpriced for what you get."},
        {"role": "assistant", "content": "NEUTRAL"},
        # Actual query
        {"role": "user", "content": text},
    ]

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=few_shot_messages,
        temperature=0,
        max_tokens=10,  # We only need one word
    )
    return response.choices[0].message.content.strip()

Chain of Thought (CoT) prompting tells the model to think step by step before answering. Dramatically improves accuracy on reasoning tasks, math, and logic problems. The intuition: by "showing its work," the model is less likely to jump to wrong conclusions.

def solve_with_chain_of_thought(problem: str) -> dict:
    """
    Chain of thought prompting for reasoning tasks.
    Returns both the reasoning chain and the final answer.
    """
    response = client.chat.completions.create(
        model="gpt-4o",  # CoT benefits most from capable models
        messages=[
            {
                "role": "system",
                "content": """You are a careful problem solver. When given a problem:
1. Think through it step by step, showing your reasoning
2. After your reasoning, state your final answer clearly
3. Use this format:
   REASONING: [your step-by-step thinking]
   ANSWER: [the final answer only]"""
            },
            {"role": "user", "content": problem}
        ],
        temperature=0,
    )

    content = response.choices[0].message.content

    # Parse the structured response
    reasoning = ""
    answer = ""
    if "REASONING:" in content and "ANSWER:" in content:
        parts = content.split("ANSWER:")
        reasoning = parts[0].replace("REASONING:", "").strip()
        answer = parts[1].strip()

    return {"reasoning": reasoning, "answer": answer, "full_response": content}

# Test with a tricky problem
result = solve_with_chain_of_thought(
    "A train leaves New York at 2pm traveling at 80mph toward Chicago (790 miles away). "
    "Another train leaves Chicago at 3pm traveling at 100mph toward New York. "
    "At what time do they meet, and how far from New York?"
)
print("Reasoning:", result["reasoning"][:200])
print("Answer:", result["answer"])

Concept 03

Getting Guaranteed JSON Output — 3 Methods Ranked

Reliable JSON output is the foundation of every data extraction feature. Here are the three methods, from least to most reliable:

Method 1: Prompt + Parse (fragile, avoid in production)

import json

def extract_with_prompt_only(text: str) -> dict:
    """
    Weakest method — depends on model following instructions.
    Use only for prototyping.
    """
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Extract the person's name and age. Respond ONLY with valid JSON: "
                           '{"name": "...", "age": ...}'
            },
            {"role": "user", "content": text}
        ],
        temperature=0,
    )
    # This can still fail — model might add text before/after the JSON
    return json.loads(response.choices[0].message.content)

Method 2: JSON Mode (reliable, OpenAI/Gemini)

def extract_with_json_mode(text: str) -> dict:
    """
    JSON mode guarantees valid JSON output.
    Works with OpenAI gpt-4o, gpt-4o-mini, gpt-4-turbo.
    You still need to specify the schema in the prompt — JSON mode only guarantees valid JSON.
    """
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Extract person details. Return JSON with keys: "
                           "name (string), age (integer or null), email (string or null)"
            },
            {"role": "user", "content": text}
        ],
        response_format={"type": "json_object"},  # THE KEY FLAG
        temperature=0,
    )
    return json.loads(response.choices[0].message.content)

Method 3: Pydantic + Structured Outputs (most reliable)

from pydantic import BaseModel, Field
from typing import Optional
import json

class PersonExtraction(BaseModel):
    name: str = Field(description="Full name of the person")
    age: Optional[int] = Field(default=None, description="Age in years, null if not mentioned")
    email: Optional[str] = Field(default=None, description="Email address, null if not mentioned")
    occupation: Optional[str] = Field(default=None, description="Job or role")

def extract_person_structured(text: str) -> PersonExtraction:
    """
    Most reliable method using OpenAI's structured outputs (parse).
    The schema is derived directly from the Pydantic model.
    """
    response = client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",  # Must use this model or newer for structured outputs
        messages=[
            {"role": "system", "content": "Extract person information from the text."},
            {"role": "user", "content": text}
        ],
        response_format=PersonExtraction,
    )
    return response.choices[0].message.parsed

# Test it
result = extract_person_structured(
    "Hi, I'm Sarah Chen, 32, working as a senior engineer at Stripe. "
    "You can reach me at sarah@example.com"
)
print(result.name)        # Sarah Chen
print(result.age)         # 32
print(result.email)       # sarah@example.com
print(result.occupation)  # senior engineer

Concept 04

Prompt Templates — Stop Hardcoding, Start Parameterizing

Hardcoding prompts in function bodies makes them impossible to test, version, or reuse. Use template systems instead.

from string import Template

# Simple string Template for basic substitution
SUMMARIZE_TEMPLATE = Template("""
You are an expert technical writer.

Summarize the following $document_type in $max_words words or fewer.
Focus on: $focus_areas
Audience: $audience_description
Tone: $tone

DOCUMENT TO SUMMARIZE:
$document
""")

def summarize(
    document: str,
    document_type: str = "article",
    max_words: int = 150,
    focus_areas: str = "key findings and actionable insights",
    audience_description: str = "technical professionals",
    tone: str = "professional and concise"
) -> str:
    prompt = SUMMARIZE_TEMPLATE.substitute(
        document=document,
        document_type=document_type,
        max_words=max_words,
        focus_areas=focus_areas,
        audience_description=audience_description,
        tone=tone,
    )

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3,
    )
    return response.choices[0].message.content


# Jinja2 for more complex templates with conditionals and loops
from jinja2 import Template

ANALYSIS_TEMPLATE = Template("""
You are a {{ role }}.

Analyze the following text:
{{ text }}

{% if examples %}
Here are examples of the analysis format:
{% for example in examples %}
Input: {{ example.input }}
Output: {{ example.output }}
{% endfor %}
{% endif %}

{% if output_format == "json" %}
Return your analysis as valid JSON.
{% elif output_format == "markdown" %}
Return your analysis formatted in Markdown.
{% else %}
Return your analysis as plain text.
{% endif %}
""")

prompt = ANALYSIS_TEMPLATE.render(
    role="data analyst",
    text="Revenue grew 23% YoY but margins declined 4 percentage points.",
    examples=[
        {"input": "Sales up 10%, costs up 20%", "output": "Negative margin trend despite revenue growth"},
    ],
    output_format="json",
)

Concept 05

Prompt Versioning — Treating Prompts Like Code

Prompts should be versioned, tested, and deployed just like code. When you change a prompt, you need to know: did that change make things better or worse? Here's a simple but effective versioning pattern:

from dataclasses import dataclass
from datetime import datetime

@dataclass
class PromptVersion:
    version: str
    created_at: str
    description: str
    system_prompt: str
    notes: str = ""

class PromptRegistry:
    """Central registry for all versioned prompts."""

    _prompts: dict = {}

    @classmethod
    def register(cls, name: str, prompt: PromptVersion):
        if name not in cls._prompts:
            cls._prompts[name] = []
        cls._prompts[name].append(prompt)

    @classmethod
    def get(cls, name: str, version: str = "latest") -> PromptVersion:
        if name not in cls._prompts:
            raise ValueError(f"No prompt registered with name: {name}")
        versions = cls._prompts[name]
        if version == "latest":
            return versions[-1]
        return next(v for v in versions if v.version == version)

# Register prompts with explicit versioning
PromptRegistry.register("customer_support", PromptVersion(
    version="1.0",
    created_at="2026-01-15",
    description="Initial customer support prompt",
    system_prompt="You are a helpful customer support agent. Be polite and concise.",
))

PromptRegistry.register("customer_support", PromptVersion(
    version="1.1",
    created_at="2026-02-10",
    description="Added escalation instructions after testing showed missed escalations",
    system_prompt="""You are a helpful customer support agent. Be polite and concise.

ESCALATION RULES:
- If the customer mentions a refund > $500, escalate to human agent
- If the customer expresses frustration 3+ times, escalate
- If the issue involves account security, escalate immediately

For escalation, say: "I'm connecting you with a specialist now."
""",
    notes="v1.0 had 23% missed escalation rate in A/B test"
))

# Use in production
prompt = PromptRegistry.get("customer_support")  # Gets v1.1 (latest)

Concept 06

Testing Prompts with pytest — Catching Regressions Before They Ship

import pytest
from openai import OpenAI
import json

client = OpenAI()

def run_prompt(system: str, user: str, temperature: float = 0) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
        temperature=temperature,
        max_tokens=500,
    )
    return response.choices[0].message.content

SENTIMENT_SYSTEM = "Classify sentiment. Return exactly: POSITIVE, NEGATIVE, or NEUTRAL."

class TestSentimentClassifier:
    """Test suite for the sentiment classifier prompt."""

    def test_obvious_positive(self):
        result = run_prompt(SENTIMENT_SYSTEM, "This is the best product I've ever used!")
        assert result.strip() == "POSITIVE"

    def test_obvious_negative(self):
        result = run_prompt(SENTIMENT_SYSTEM, "Terrible. Broke after one use. Waste of money.")
        assert result.strip() == "NEGATIVE"

    def test_mixed_sentiment_is_neutral(self):
        result = run_prompt(SENTIMENT_SYSTEM, "Good quality but way too expensive for what it is.")
        assert result.strip() == "NEUTRAL"

    def test_output_format_strict(self):
        """Ensure no extra text, just the label."""
        result = run_prompt(SENTIMENT_SYSTEM, "Great experience overall.")
        assert result.strip() in {"POSITIVE", "NEGATIVE", "NEUTRAL"}, \
            f"Unexpected output: {result}"

    def test_handles_empty_input(self):
        """Graceful handling of edge cases."""
        result = run_prompt(SENTIMENT_SYSTEM, ".")
        # Should return one of the three labels, not crash
        assert result.strip() in {"POSITIVE", "NEGATIVE", "NEUTRAL"}

    @pytest.mark.parametrize("text,expected", [
        ("Absolutely fantastic!", "POSITIVE"),
        ("Completely useless", "NEGATIVE"),
        ("It's okay I guess", "NEUTRAL"),
        ("Would not recommend", "NEGATIVE"),
    ])
    def test_parametrized_cases(self, text, expected):
        result = run_prompt(SENTIMENT_SYSTEM, text)
        assert result.strip() == expected

Concept 07

The 7 Deadly Prompt Mistakes — With Before/After Code

Mistake 1: Vague instructions

# BAD: What does "summarize" mean? How long? What format?
bad_prompt = "Summarize this article."

# GOOD: Explicit, specific, measurable
good_prompt = "Summarize this article in exactly 3 bullet points. "
              "Each bullet: one sentence, max 20 words. Start each with a verb."

Mistake 2: No output format specification

# BAD: Model will choose whatever format it feels like
bad_prompt = "Extract the key information from this resume."

# GOOD: Exact schema specified
good_prompt = """Extract resume information. Return JSON with this exact structure:
{
  "name": string,
  "email": string or null,
  "skills": [list of strings],
  "years_experience": integer or null,
  "education": [{"degree": string, "institution": string}]
}"""

Mistake 3: Using high temperature for structured tasks

# BAD: temperature=1.0 for classification — inconsistent outputs
bad_call = client.chat.completions.create(
    model="gpt-4o-mini", messages=[...], temperature=1.0)

# GOOD: temperature=0 for deterministic structured tasks
good_call = client.chat.completions.create(
    model="gpt-4o-mini", messages=[...], temperature=0)

Mistake 4: Not handling the "length" finish_reason

# BAD: Assumes the response is complete
content = response.choices[0].message.content  # Could be truncated!

# GOOD: Check finish_reason before using the content
if response.choices[0].finish_reason == "length":
    raise ValueError("Response truncated — increase max_tokens")
content = response.choices[0].message.content

Mistake 5: Putting examples in the system prompt instead of few-shot turns

# BAD: Examples buried in a wall of text in system prompt
bad_system = """Classify sentiment.
Here's an example: 'Great!' -> POSITIVE. 'Terrible' -> NEGATIVE.
Now classify the user's text."""

# GOOD: Examples as proper message turns
good_messages = [
    {"role": "system", "content": "Classify sentiment: POSITIVE, NEGATIVE, or NEUTRAL."},
    {"role": "user", "content": "Great!"},
    {"role": "assistant", "content": "POSITIVE"},
    {"role": "user", "content": "Terrible"},
    {"role": "assistant", "content": "NEGATIVE"},
    {"role": "user", "content": "[actual text to classify]"},
]

Mistake 6: No version control for prompts — Change a prompt → behavior changes → you don't know why production broke. Always use the PromptRegistry pattern from section 5.

Mistake 7: Not testing edge cases — Empty input, very long input, input in different languages, adversarial input. Ship prompt tests with your code.

Practice

Interview Questions for This Checkpoint

Expected Interview Questions

Q: What are the components of an effective system prompt?
A: Six parts: (1) Role — who the model is. (2) Context — what it knows. (3) Task — what it should do. (4) Constraints — what it must not do. (5) Output format — exact structure expected. (6) Examples — one or two shots demonstrating the pattern. Clear beats clever.
Q: When do you use few-shot vs zero-shot prompting?
A: Few-shot when the task has a specific output style, domain-specific language, or format the model doesn't naturally produce. Zero-shot for general tasks where examples would add noise or cost. Always test both — few-shot isn't always better.
Q: How do you guarantee JSON output from an LLM?
A: Three methods in order of reliability: (1) Prompt instruction + JSON parse with retry — fragile. (2) JSON mode (OpenAI) — guarantees valid JSON but not your schema. (3) Structured outputs with a Pydantic model schema — guarantees exact schema. Use (3) in production.
Q: How do you version and test prompts?
A: Store prompts in a versioned dict or file (not hardcoded in business logic). Write pytest assertions on output properties — not exact strings. Never change a production prompt without a failing test first. Treat prompts as code.
Q: What are the most common prompt engineering mistakes?
A: Vague instructions ('be helpful'), no output format spec, no negative examples ('do NOT include...'), testing on one input only, and hardcoding prompts with no versioning. The biggest: optimizing for a demo instead of for distribution of real inputs.

Your Turn — Exercises

Write a system prompt for a SQL query generator using the 6-part anatomy (role, context, task, constraints, format, examples). Test it with 5 natural-language inputs.
Compare zero-shot vs 3-shot prompting on a sentiment classification task. Run both on 20 test sentences and measure accuracy. Which performed better?
Write a pytest test that calls your classification prompt and asserts: (a) output is valid JSON, (b) 'sentiment' key exists, (c) value is one of ['positive', 'neutral', 'negative'].

CP-03 Summary

A great system prompt has 6 parts: role, context, task, output format, constraints, tone
Use zero-shot for obvious tasks, few-shot for format-critical tasks, CoT for reasoning
For JSON: use OpenAI structured outputs (Pydantic parse) in production
Parameterize prompts with templates — never hardcode dynamic values
Version all prompts in a registry; include change notes
Write pytest tests for every prompt in production; test edge cases explicitly
Temperature 0 for all structured/extraction tasks

← CP-02: LLM APIs CP-04: Structured Outputs →

Prompt Engineering for Developers: Write Prompts That Actually Work
The only prompt engineering guide written for developers who ship

Table of Contents