Concept 01

What is an LLM, Really? The f(prompt) → completion Mental Model

Forget everything you've read about attention mechanisms, transformer architectures, and gradient descent. As a developer, you don't need any of that to build powerful AI features. Here's the only mental model you need:

An LLM is a function. You give it text. It gives you text back.

# The simplest possible mental model
def llm(prompt: str) -> str:
    # A trillion-parameter function trained on the internet
    # You don't control what's inside. You only control the input.
    return completion

# That's it. Everything else is just how you craft the input
# and how you parse the output.

result = llm("Explain recursion in simple terms")
print(result)  # "Recursion is when a function calls itself..."

The reason this mental model matters: it prevents you from overthinking. Every prompt engineering trick, every RAG system, every agent pattern — they're all just different ways of crafting that input string (or sequence of messages) to get better output text. The complexity lives in how you use the function, not inside the function itself.

Analogy That Sticks

Think of an LLM like a massively experienced consultant who has read every book, article, forum post, and code repository on the internet. You can ask them anything. They'll give you a thoughtful answer based on everything they've ever read. But they have limitations: they can only work with what you tell them in the current conversation, they sometimes misremember things (hallucinations), and they know everything up to a certain date (training cutoff).

The technical reality, explained simply: an LLM is trained to predict the next token given all previous tokens. During training on billions of pages of text, it learns statistical patterns — which words follow which, how arguments are structured, what code looks like, how to reason step by step. The result is a system that, when given your prompt, generates tokens one by one, each chosen probabilistically based on all preceding context.

That probabilistic choice is what temperature controls. But we'll get to that.

Concept 02

Training vs Inference — Why Developers Only Care About One

There are two phases to an LLM's life, and understanding the distinction saves you from a lot of confusion when reading AI papers or job descriptions.

Training is when the model learns. It involves processing enormous amounts of text data, computing gradients, updating billions of parameters, and running on clusters of hundreds of GPU machines for weeks or months. Training GPT-4 reportedly cost over $100 million. You will never do this as an application developer.

Inference is when the model generates a response to your prompt. It's using the already-trained weights to produce an output. This is what happens every time you call the OpenAI API. The model parameters are frozen — they don't change based on what you send. You're just running a forward pass through a fixed neural network.

AspectTrainingInference
Who does it?OpenAI, Anthropic, GoogleYou (via API calls)
Cost$10M–$100M+$0.001–$0.06 per 1K tokens
TimeWeeks to months1–30 seconds
Changes model?Yes — updates weightsNo — weights are frozen
Your roleNot your concernThis is everything you do

There is a third phase called fine-tuning, which is taking a pre-trained model and doing additional training on your specific dataset. Fine-tuning is something developers occasionally do, but it's expensive, requires data curation, and is rarely the right first approach. In 95% of cases, prompt engineering gets you further faster.

Key Takeaway

As a developer, your entire job is inference. You craft inputs, call the API, and handle outputs. The model's knowledge and capabilities were baked in during training — you can't change them at runtime. You can only guide the output through clever prompting and context.

Concept 03

The Token System — Why "Characters" and "Words" Are the Wrong Mental Model

LLMs don't process characters. They don't process words. They process tokens. A token is a chunk of text — roughly 3–4 characters on average for English text, but it varies enormously based on the content.

import tiktoken  # OpenAI's tokenizer library

enc = tiktoken.get_encoding("cl100k_base")  # GPT-4's encoding

# Tokenize some example text
examples = [
    "Hello, world!",
    "machine learning",
    "pneumonoultramicroscopicsilicovolcanoconiosis",
    "def calculate_fibonacci(n: int) -> int:",
    "1234567890",
]

for text in examples:
    tokens = enc.encode(text)
    print(f"'{text}'")
    print(f"  Tokens: {len(tokens)}")
    print(f"  Token IDs: {tokens[:10]}")
    print(f"  Decoded pieces: {[enc.decode([t]) for t in tokens]}")
    print()

Running this gives you intuition for how tokenization works. "Hello" is one token. ", " is one token. "world" is one token. "!" is one token. The phrase "machine learning" might be 2–3 tokens. That 45-letter word about lung disease? Probably 10+ tokens because the tokenizer has never seen it frequently enough to assign it a single token.

This matters because:

  • Pricing is per-token — not per character or word. 1,000 tokens ≈ 750 words ≈ 3,000 characters.
  • Context limits are in tokens — "128k context window" means 128,000 tokens, roughly 100,000 words or 300 pages of text.
  • Code is token-expensive — variable names, brackets, and indentation all add tokens. A 50-line function might be 400–600 tokens.
  • Non-English text costs more — many Asian scripts use 2–4 tokens per character instead of 0.3 tokens per character for English.
Token Rule of Thumb

For quick mental math: 1,000 tokens ≈ 750 English words ≈ 1 page of text. A typical blog post is about 800 words = ~1,066 tokens. A 50-page technical document is about 37,500 words = ~50,000 tokens. A novel (100,000 words) = ~133,000 tokens — which would overflow a 128k context window.

def estimate_tokens(text: str) -> dict:
    """
    Rough token estimation without needing tiktoken.
    Good for quick cost calculations.
    """
    char_count = len(text)
    word_count = len(text.split())

    # English text: ~4 chars per token
    tokens_by_chars = char_count / 4

    # English text: ~0.75 tokens per word (or 1.33 words per token)
    tokens_by_words = word_count / 0.75

    # Average estimate
    estimated_tokens = (tokens_by_chars + tokens_by_words) / 2

    return {
        "chars": char_count,
        "words": word_count,
        "estimated_tokens": round(estimated_tokens),
        "cost_gpt4o_input": round(estimated_tokens * 0.0000025, 6),    # $2.50/1M tokens
        "cost_gpt4o_mini_input": round(estimated_tokens * 0.00000015, 6),  # $0.15/1M tokens
    }

text = "Write a Python function that reverses a linked list iteratively."
print(estimate_tokens(text))

Concept 04

The Provider Landscape — Choosing Your LLM

In 2026, you have more LLM providers than you can count. But the real decision space is smaller than it looks. Here are the providers that matter for production applications:

ProviderBest ModelStrengthsInput PriceContext WindowBest For
OpenAI GPT-4o Balanced, multimodal, huge ecosystem $2.50/1M tokens 128k General-purpose apps, image input
OpenAI GPT-4o-mini Fast, cheap, surprisingly capable $0.15/1M tokens 128k High-volume, cost-sensitive tasks
Anthropic Claude 3.5 Sonnet Best at reasoning, long docs, coding $3.00/1M tokens 200k Complex reasoning, code generation
Anthropic Claude 3 Haiku Fastest Anthropic model, cheap $0.25/1M tokens 200k Classification, quick extractions
Google Gemini 1.5 Pro 1M token context, multimodal $3.50/1M tokens 1M Entire codebase analysis, video
Meta (via API) Llama 3.3 70B Open source, self-hostable ~$0.59/1M tokens 128k Privacy-sensitive apps, cost control

The decision framework is straightforward: start with GPT-4o-mini for prototyping (cheap enough to experiment freely), switch to GPT-4o or Claude 3.5 Sonnet for quality-critical tasks, and consider Llama or Mistral if you need on-premise deployment for data privacy reasons.

Common Mistake

Don't commit to one provider in your architecture. The market moves fast — the best model in 2024 is often surpassed within 6 months. Build a provider-agnostic layer (covered in Checkpoint 2) that lets you swap models by changing a config value, not rewriting your application.

Concept 05

Temperature — The Most Misunderstood Parameter

Temperature is a number between 0 and 2 (depending on the provider) that controls how "creative" or "random" the model's output is. Here's the intuition:

When the model is about to generate the next token, it has a probability distribution over its entire vocabulary (50,000+ possible tokens). Temperature reshapes this distribution before sampling from it.

  • Temperature = 0: Always pick the highest-probability token. Deterministic, repetitive, "safe." Best for code generation, classification, extraction — tasks where there's a correct answer.
  • Temperature = 0.7: A good balance. The most likely token still wins most of the time, but there's enough variance for natural-sounding text. Best for general conversational applications.
  • Temperature = 1.0: Sample according to the raw probability distribution. Good for creative writing, brainstorming, generating diverse options.
  • Temperature = 1.5–2.0: Flatten the distribution significantly. Outputs become unpredictable, often incoherent. Rarely useful in production.
from openai import OpenAI

client = OpenAI()

prompt = "Complete this sentence: The best programming language is"

# Temperature 0: deterministic, picks highest probability
response_cold = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    temperature=0,
    max_tokens=50
)

# Temperature 1.0: more varied
response_warm = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    temperature=1.0,
    max_tokens=50
)

print("Cold (temp=0):", response_cold.choices[0].message.content)
print("Warm (temp=1.0):", response_warm.choices[0].message.content)

# Run the warm version multiple times — you'll get different answers each time
# Run the cold version multiple times — you'll get the same answer

There's also a related parameter called top_p (nucleus sampling). Instead of reshaping the whole distribution, it only considers tokens that together make up the top P% of probability mass. top_p=0.9 means: "only sample from tokens whose cumulative probability reaches 90%." In practice, you adjust either temperature OR top_p — not both simultaneously.

Temperature Cheat Sheet
  • Code generation, extraction, classification: temperature=0
  • Q&A, factual responses, summarization: temperature=0.3
  • Chatbots, customer support: temperature=0.7
  • Creative writing, brainstorming: temperature=0.9–1.0
  • Never use above 1.2 in production — outputs become unreliable

Concept 06

The Context Window — Your LLM's Working Memory

The context window is the total number of tokens the model can "see" at once when generating a response. Think of it as the model's working memory — everything it knows about the current conversation or task must fit within this window.

The context window includes:

  • The system prompt (your instructions to the model)
  • All previous turns of the conversation (user messages + assistant responses)
  • The current user message
  • The response being generated (output tokens)
  • Any documents, code, or data you've injected
Context Window (e.g., 128,000 tokens total) ┌─────────────────────────────────────────────────┐ │ System Prompt (~500 tokens) │ │ ───────────────────────────────────────────── │ │ User Message 1 (~200 tokens) │ │ Assistant Response 1 (~400 tokens) │ │ User Message 2 (~150 tokens) │ │ Assistant Response 2 (~300 tokens) │ │ ... │ │ Current User Message (~100 tokens) │ │ ───────────────────────────────────────────── │ │ [Space for output] (~2000 tokens reserved) │ └─────────────────────────────────────────────────┘ Total used: varies — must stay under limit

The critical insight: the model doesn't have memory between API calls. Each call is independent. The "conversation history" you see in ChatGPT only works because the application resends the entire conversation history on every API call. When you build a chatbot, you're responsible for managing this history and making sure it doesn't overflow the context window.

class ConversationManager:
    """
    Manages conversation history within a token budget.
    Simple sliding window implementation.
    """
    def __init__(self, model: str = "gpt-4o-mini", max_tokens: int = 100_000):
        self.model = model
        self.max_tokens = max_tokens
        self.history = []
        self.system_prompt = ""

    def set_system_prompt(self, prompt: str):
        self.system_prompt = prompt

    def estimate_tokens(self, messages: list) -> int:
        """Rough token count for a messages list."""
        total_chars = sum(len(m.get("content", "")) for m in messages)
        return total_chars // 4  # ~4 chars per token

    def add_message(self, role: str, content: str):
        self.history.append({"role": role, "content": content})
        self._trim_to_budget()

    def _trim_to_budget(self):
        """Remove oldest messages if we exceed the token budget."""
        messages = self.get_messages()
        while self.estimate_tokens(messages) > self.max_tokens and len(self.history) > 1:
            # Remove the oldest user/assistant exchange (first 2 messages)
            self.history.pop(0)
            messages = self.get_messages()

    def get_messages(self) -> list:
        messages = []
        if self.system_prompt:
            messages.append({"role": "system", "content": self.system_prompt})
        messages.extend(self.history)
        return messages

Concept 07

When to Use AI — And When Not To

One of the most valuable skills an AI developer can have is knowing when not to reach for an LLM. LLMs are powerful but they're also expensive, slow, and unpredictable compared to traditional code. Here's the decision framework:

Use an LLM when:

  • The task requires understanding natural language input from users
  • The output needs to be natural language (summaries, emails, explanations)
  • The logic is fuzzy — classification, sentiment, intent detection
  • You're dealing with unstructured data (free-form text, documents)
  • The rules are too complex to hardcode (e.g., "is this email professional?")

Don't use an LLM when:

  • The task is purely computational (sorting, filtering, aggregating data)
  • You need guaranteed determinism (financial calculations, legal logic)
  • A simple regex or keyword match solves the problem
  • Latency is critical and deterministic code is 100x faster
  • The answer is a simple database lookup
The Hammer Problem

When you discover LLMs, everything starts to look like a nail. Resist the urge. "Should I use an LLM to sort this list?" — No. "Should I use an LLM to parse this user's free-form address input and extract city, state, zip?" — Yes, that's exactly what it's good at.

Concept 08

The Interview Answer: "How Does an LLM Work?"

This question comes up in every AI-adjacent interview. Here's a developer-focused answer that demonstrates real understanding without getting lost in academic details:

Model Interview Answer (Speak This)

"An LLM is a neural network — specifically a transformer — trained on massive amounts of text data using a process called self-supervised learning. The training objective is simple: predict the next token given all previous tokens. Through billions of iterations of this task across trillions of tokens of text, the model learns to encode an enormous amount of world knowledge, reasoning patterns, and language understanding in its billions of parameters.

At inference time — when you make an API call — the trained weights are frozen. The model takes your input prompt, converts it to token IDs, and generates output tokens one at a time, each chosen based on the probability distribution the model assigns to all possible next tokens. Temperature controls how we sample from that distribution.

The key architectural insight is the attention mechanism: for every token it generates, the model can 'attend' to any previous token in the context window, giving it the ability to track long-range dependencies in text. This is what makes transformers so much better than previous RNN-based approaches.

From a developer perspective, I think of it as a very sophisticated function: give it text, get text back. My job is crafting the right inputs through prompt engineering and surrounding architecture to get reliable, useful outputs."

That answer is about 2 minutes long, demonstrates technical depth, and ends with practical developer perspective — which is exactly what interviewers want to hear. It shows you understand the fundamentals without pretending you understand things you don't.

What's Next: Checkpoint 2

Now that you have the mental model, it's time to write actual code. In Checkpoint 2, we'll make our first API calls to OpenAI, Anthropic, and Gemini, handle errors properly, implement streaming, and build the provider-agnostic wrapper that you'll use throughout the course. By the end of CP-02, you'll have a LLMClient class that can switch between any major provider with a single config change.

CP-01 Summary
  • An LLM is f(prompt) → completion — you control the input, the model controls the output
  • Training happens once (expensive, not your job); inference is every API call (your job)
  • Tokens are ~4 chars each; pricing, limits, and costs are all token-based
  • Temperature 0 for determinism, 0.7 for conversation, 1.0 for creativity
  • Context window is the model's working memory — you must manage it in code
  • Use LLMs for fuzzy, linguistic, unstructured tasks; use regular code for everything else