Table of Contents
Concept 01
System Prompt Anatomy — The 6 Parts Every Great Prompt Has
The system prompt is your most powerful tool. It runs before any user input and shapes every response the model produces. Most developers write vague, one-line system prompts and wonder why their outputs are inconsistent. Here's the full anatomy of a production-quality system prompt:
| # | Part | Purpose | Example |
|---|---|---|---|
| 1 | Role & Identity | Tell the model who it is | "You are a senior software engineer specializing in Python backend development." |
| 2 | Context & Domain | What is this system for? | "You are helping developers at a fintech startup debug and improve their code." |
| 3 | Task Description | What should the model do? | "When given code, identify bugs, explain why they're bugs, and provide fixed code." |
| 4 | Output Format | How should the response be structured? | "Always respond with: 1) Bug description 2) Fixed code block 3) Explanation." |
| 5 | Constraints & Rules | What should it never do? | "Never add new features — only fix the bug. Don't rewrite working code." |
| 6 | Tone & Style | Voice and communication style | "Be direct and technical. Skip pleasantries. Use code blocks for all code." |
# A complete production system prompt for a code review assistant
CODE_REVIEW_SYSTEM_PROMPT = """
You are a senior software engineer with 15 years of Python experience, specializing in
clean architecture, performance optimization, and security best practices.
CONTEXT:
You are integrated into a developer IDE as a code review assistant. Developers paste
code and ask for review. Your audience is intermediate-to-senior Python developers.
YOUR TASK:
Analyze the provided Python code and identify: bugs, security vulnerabilities,
performance issues, and style violations (PEP 8). Suggest specific improvements.
OUTPUT FORMAT:
Respond in exactly this structure:
## Issues Found
- List each issue as: [SEVERITY: HIGH/MEDIUM/LOW] Description
## Fixed Code
```python
[improved code here]
```
## Key Changes
- Bullet list explaining what changed and why
CONSTRAINTS:
- Only suggest changes that genuinely improve the code
- Never rewrite working, idiomatic code just to show off
- If the code is good, say so clearly — don't invent issues
- Never add features that weren't requested
TONE:
Direct and technical. No pleasantries. Assume the developer is smart.
"""
Concept 02
Zero-Shot, Few-Shot, and Chain of Thought — When to Use Each
Zero-shot prompting means giving the model a task with no examples. Works for well-understood tasks where the model has seen plenty of similar training data.
from openai import OpenAI
client = OpenAI()
# Zero-shot: task with no examples
def classify_sentiment_zero_shot(text: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "Classify the sentiment of user reviews. "
"Respond with exactly one word: POSITIVE, NEGATIVE, or NEUTRAL."
},
{"role": "user", "content": text}
],
temperature=0,
)
return response.choices[0].message.content.strip()
# Works fine for obvious cases
print(classify_sentiment_zero_shot("This product is amazing!")) # POSITIVE
print(classify_sentiment_zero_shot("Terrible quality, broke after a week")) # NEGATIVE
Few-shot prompting provides examples in the prompt. Dramatically improves consistency for edge cases and unusual formats. The examples teach the model exactly what output you expect.
def classify_sentiment_few_shot(text: str) -> str:
"""
Few-shot prompting: examples teach the model the exact format and edge cases.
Use when zero-shot is inconsistent or when format matters precisely.
"""
few_shot_messages = [
{"role": "system", "content": "Classify review sentiment."},
# Example 1
{"role": "user", "content": "The shipping was fast but the product broke immediately."},
{"role": "assistant", "content": "NEGATIVE"},
# Example 2
{"role": "user", "content": "Does what it says on the tin. Nothing special."},
{"role": "assistant", "content": "NEUTRAL"},
# Example 3
{"role": "user", "content": "Exceeded all expectations. Will definitely buy again!"},
{"role": "assistant", "content": "POSITIVE"},
# Example 4: edge case
{"role": "user", "content": "Great build quality but overpriced for what you get."},
{"role": "assistant", "content": "NEUTRAL"},
# Actual query
{"role": "user", "content": text},
]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=few_shot_messages,
temperature=0,
max_tokens=10, # We only need one word
)
return response.choices[0].message.content.strip()
Chain of Thought (CoT) prompting tells the model to think step by step before answering. Dramatically improves accuracy on reasoning tasks, math, and logic problems. The intuition: by "showing its work," the model is less likely to jump to wrong conclusions.
def solve_with_chain_of_thought(problem: str) -> dict:
"""
Chain of thought prompting for reasoning tasks.
Returns both the reasoning chain and the final answer.
"""
response = client.chat.completions.create(
model="gpt-4o", # CoT benefits most from capable models
messages=[
{
"role": "system",
"content": """You are a careful problem solver. When given a problem:
1. Think through it step by step, showing your reasoning
2. After your reasoning, state your final answer clearly
3. Use this format:
REASONING: [your step-by-step thinking]
ANSWER: [the final answer only]"""
},
{"role": "user", "content": problem}
],
temperature=0,
)
content = response.choices[0].message.content
# Parse the structured response
reasoning = ""
answer = ""
if "REASONING:" in content and "ANSWER:" in content:
parts = content.split("ANSWER:")
reasoning = parts[0].replace("REASONING:", "").strip()
answer = parts[1].strip()
return {"reasoning": reasoning, "answer": answer, "full_response": content}
# Test with a tricky problem
result = solve_with_chain_of_thought(
"A train leaves New York at 2pm traveling at 80mph toward Chicago (790 miles away). "
"Another train leaves Chicago at 3pm traveling at 100mph toward New York. "
"At what time do they meet, and how far from New York?"
)
print("Reasoning:", result["reasoning"][:200])
print("Answer:", result["answer"])
Concept 03
Getting Guaranteed JSON Output — 3 Methods Ranked
Reliable JSON output is the foundation of every data extraction feature. Here are the three methods, from least to most reliable:
Method 1: Prompt + Parse (fragile, avoid in production)
import json
def extract_with_prompt_only(text: str) -> dict:
"""
Weakest method — depends on model following instructions.
Use only for prototyping.
"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "Extract the person's name and age. Respond ONLY with valid JSON: "
'{"name": "...", "age": ...}'
},
{"role": "user", "content": text}
],
temperature=0,
)
# This can still fail — model might add text before/after the JSON
return json.loads(response.choices[0].message.content)
Method 2: JSON Mode (reliable, OpenAI/Gemini)
def extract_with_json_mode(text: str) -> dict:
"""
JSON mode guarantees valid JSON output.
Works with OpenAI gpt-4o, gpt-4o-mini, gpt-4-turbo.
You still need to specify the schema in the prompt — JSON mode only guarantees valid JSON.
"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "Extract person details. Return JSON with keys: "
"name (string), age (integer or null), email (string or null)"
},
{"role": "user", "content": text}
],
response_format={"type": "json_object"}, # THE KEY FLAG
temperature=0,
)
return json.loads(response.choices[0].message.content)
Method 3: Pydantic + Structured Outputs (most reliable)
from pydantic import BaseModel, Field
from typing import Optional
import json
class PersonExtraction(BaseModel):
name: str = Field(description="Full name of the person")
age: Optional[int] = Field(default=None, description="Age in years, null if not mentioned")
email: Optional[str] = Field(default=None, description="Email address, null if not mentioned")
occupation: Optional[str] = Field(default=None, description="Job or role")
def extract_person_structured(text: str) -> PersonExtraction:
"""
Most reliable method using OpenAI's structured outputs (parse).
The schema is derived directly from the Pydantic model.
"""
response = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06", # Must use this model or newer for structured outputs
messages=[
{"role": "system", "content": "Extract person information from the text."},
{"role": "user", "content": text}
],
response_format=PersonExtraction,
)
return response.choices[0].message.parsed
# Test it
result = extract_person_structured(
"Hi, I'm Sarah Chen, 32, working as a senior engineer at Stripe. "
"You can reach me at sarah@example.com"
)
print(result.name) # Sarah Chen
print(result.age) # 32
print(result.email) # sarah@example.com
print(result.occupation) # senior engineer
Concept 04
Prompt Templates — Stop Hardcoding, Start Parameterizing
Hardcoding prompts in function bodies makes them impossible to test, version, or reuse. Use template systems instead.
from string import Template
# Simple string Template for basic substitution
SUMMARIZE_TEMPLATE = Template("""
You are an expert technical writer.
Summarize the following $document_type in $max_words words or fewer.
Focus on: $focus_areas
Audience: $audience_description
Tone: $tone
DOCUMENT TO SUMMARIZE:
$document
""")
def summarize(
document: str,
document_type: str = "article",
max_words: int = 150,
focus_areas: str = "key findings and actionable insights",
audience_description: str = "technical professionals",
tone: str = "professional and concise"
) -> str:
prompt = SUMMARIZE_TEMPLATE.substitute(
document=document,
document_type=document_type,
max_words=max_words,
focus_areas=focus_areas,
audience_description=audience_description,
tone=tone,
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
)
return response.choices[0].message.content
# Jinja2 for more complex templates with conditionals and loops
from jinja2 import Template
ANALYSIS_TEMPLATE = Template("""
You are a {{ role }}.
Analyze the following text:
{{ text }}
{% if examples %}
Here are examples of the analysis format:
{% for example in examples %}
Input: {{ example.input }}
Output: {{ example.output }}
{% endfor %}
{% endif %}
{% if output_format == "json" %}
Return your analysis as valid JSON.
{% elif output_format == "markdown" %}
Return your analysis formatted in Markdown.
{% else %}
Return your analysis as plain text.
{% endif %}
""")
prompt = ANALYSIS_TEMPLATE.render(
role="data analyst",
text="Revenue grew 23% YoY but margins declined 4 percentage points.",
examples=[
{"input": "Sales up 10%, costs up 20%", "output": "Negative margin trend despite revenue growth"},
],
output_format="json",
)
Concept 05
Prompt Versioning — Treating Prompts Like Code
Prompts should be versioned, tested, and deployed just like code. When you change a prompt, you need to know: did that change make things better or worse? Here's a simple but effective versioning pattern:
from dataclasses import dataclass
from datetime import datetime
@dataclass
class PromptVersion:
version: str
created_at: str
description: str
system_prompt: str
notes: str = ""
class PromptRegistry:
"""Central registry for all versioned prompts."""
_prompts: dict = {}
@classmethod
def register(cls, name: str, prompt: PromptVersion):
if name not in cls._prompts:
cls._prompts[name] = []
cls._prompts[name].append(prompt)
@classmethod
def get(cls, name: str, version: str = "latest") -> PromptVersion:
if name not in cls._prompts:
raise ValueError(f"No prompt registered with name: {name}")
versions = cls._prompts[name]
if version == "latest":
return versions[-1]
return next(v for v in versions if v.version == version)
# Register prompts with explicit versioning
PromptRegistry.register("customer_support", PromptVersion(
version="1.0",
created_at="2026-01-15",
description="Initial customer support prompt",
system_prompt="You are a helpful customer support agent. Be polite and concise.",
))
PromptRegistry.register("customer_support", PromptVersion(
version="1.1",
created_at="2026-02-10",
description="Added escalation instructions after testing showed missed escalations",
system_prompt="""You are a helpful customer support agent. Be polite and concise.
ESCALATION RULES:
- If the customer mentions a refund > $500, escalate to human agent
- If the customer expresses frustration 3+ times, escalate
- If the issue involves account security, escalate immediately
For escalation, say: "I'm connecting you with a specialist now."
""",
notes="v1.0 had 23% missed escalation rate in A/B test"
))
# Use in production
prompt = PromptRegistry.get("customer_support") # Gets v1.1 (latest)
Concept 06
Testing Prompts with pytest — Catching Regressions Before They Ship
import pytest
from openai import OpenAI
import json
client = OpenAI()
def run_prompt(system: str, user: str, temperature: float = 0) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": user},
],
temperature=temperature,
max_tokens=500,
)
return response.choices[0].message.content
SENTIMENT_SYSTEM = "Classify sentiment. Return exactly: POSITIVE, NEGATIVE, or NEUTRAL."
class TestSentimentClassifier:
"""Test suite for the sentiment classifier prompt."""
def test_obvious_positive(self):
result = run_prompt(SENTIMENT_SYSTEM, "This is the best product I've ever used!")
assert result.strip() == "POSITIVE"
def test_obvious_negative(self):
result = run_prompt(SENTIMENT_SYSTEM, "Terrible. Broke after one use. Waste of money.")
assert result.strip() == "NEGATIVE"
def test_mixed_sentiment_is_neutral(self):
result = run_prompt(SENTIMENT_SYSTEM, "Good quality but way too expensive for what it is.")
assert result.strip() == "NEUTRAL"
def test_output_format_strict(self):
"""Ensure no extra text, just the label."""
result = run_prompt(SENTIMENT_SYSTEM, "Great experience overall.")
assert result.strip() in {"POSITIVE", "NEGATIVE", "NEUTRAL"}, \
f"Unexpected output: {result}"
def test_handles_empty_input(self):
"""Graceful handling of edge cases."""
result = run_prompt(SENTIMENT_SYSTEM, ".")
# Should return one of the three labels, not crash
assert result.strip() in {"POSITIVE", "NEGATIVE", "NEUTRAL"}
@pytest.mark.parametrize("text,expected", [
("Absolutely fantastic!", "POSITIVE"),
("Completely useless", "NEGATIVE"),
("It's okay I guess", "NEUTRAL"),
("Would not recommend", "NEGATIVE"),
])
def test_parametrized_cases(self, text, expected):
result = run_prompt(SENTIMENT_SYSTEM, text)
assert result.strip() == expected
Concept 07
The 7 Deadly Prompt Mistakes — With Before/After Code
Mistake 1: Vague instructions
# BAD: What does "summarize" mean? How long? What format?
bad_prompt = "Summarize this article."
# GOOD: Explicit, specific, measurable
good_prompt = "Summarize this article in exactly 3 bullet points. "
"Each bullet: one sentence, max 20 words. Start each with a verb."
Mistake 2: No output format specification
# BAD: Model will choose whatever format it feels like
bad_prompt = "Extract the key information from this resume."
# GOOD: Exact schema specified
good_prompt = """Extract resume information. Return JSON with this exact structure:
{
"name": string,
"email": string or null,
"skills": [list of strings],
"years_experience": integer or null,
"education": [{"degree": string, "institution": string}]
}"""
Mistake 3: Using high temperature for structured tasks
# BAD: temperature=1.0 for classification — inconsistent outputs
bad_call = client.chat.completions.create(
model="gpt-4o-mini", messages=[...], temperature=1.0)
# GOOD: temperature=0 for deterministic structured tasks
good_call = client.chat.completions.create(
model="gpt-4o-mini", messages=[...], temperature=0)
Mistake 4: Not handling the "length" finish_reason
# BAD: Assumes the response is complete
content = response.choices[0].message.content # Could be truncated!
# GOOD: Check finish_reason before using the content
if response.choices[0].finish_reason == "length":
raise ValueError("Response truncated — increase max_tokens")
content = response.choices[0].message.content
Mistake 5: Putting examples in the system prompt instead of few-shot turns
# BAD: Examples buried in a wall of text in system prompt
bad_system = """Classify sentiment.
Here's an example: 'Great!' -> POSITIVE. 'Terrible' -> NEGATIVE.
Now classify the user's text."""
# GOOD: Examples as proper message turns
good_messages = [
{"role": "system", "content": "Classify sentiment: POSITIVE, NEGATIVE, or NEUTRAL."},
{"role": "user", "content": "Great!"},
{"role": "assistant", "content": "POSITIVE"},
{"role": "user", "content": "Terrible"},
{"role": "assistant", "content": "NEGATIVE"},
{"role": "user", "content": "[actual text to classify]"},
]
Mistake 6: No version control for prompts — Change a prompt → behavior changes → you don't know why production broke. Always use the PromptRegistry pattern from section 5.
Mistake 7: Not testing edge cases — Empty input, very long input, input in different languages, adversarial input. Ship prompt tests with your code.
- A great system prompt has 6 parts: role, context, task, output format, constraints, tone
- Use zero-shot for obvious tasks, few-shot for format-critical tasks, CoT for reasoning
- For JSON: use OpenAI structured outputs (Pydantic parse) in production
- Parameterize prompts with templates — never hardcode dynamic values
- Version all prompts in a registry; include change notes
- Write pytest tests for every prompt in production; test edge cases explicitly
- Temperature 0 for all structured/extraction tasks