Contents
Topic 01
How LLMs Generate Text
An LLM doesn't "know" things the way humans do. It's a function that takes tokens (text chunks) as input and outputs a probability distribution over the next token. Generation is sampling from this distribution, one token at a time.
- Tokenization: "Hello world" → [15496, 995] (token IDs, not words)
- Embedding: Each token ID → 4096-dimensional vector
- Transformer blocks: 32-96 layers of self-attention + feed-forward
- Output projection: 4096-dim → 50,000-dim (vocabulary size) logits
- Sampling: Apply temperature, sample next token, append to sequence, repeat
Temperature controls randomness: T=0 always picks the highest probability token (deterministic, boring). T=1 is balanced. T=2 is chaotic and creative. Think of it like spice level — T=0 is plain, T=2 is extra hot.
LLMs don't retrieve facts from memory — they generate text that "sounds right" based on patterns in training data. There's no fact database being queried. When asked about an obscure topic, the model generates statistically plausible-sounding text, which may be factually wrong. This is why RAG (Retrieval-Augmented Generation) is so important — ground the model in real documents.
Topic 02
Scaling Laws, RLHF & Instruction Tuning
The Chinchilla scaling law
The Chinchilla paper (DeepMind, 2022) found that model size and training tokens should scale together: optimal training uses roughly 20 tokens per parameter. A 70B parameter model should train on ~1.4 trillion tokens.
LLaMA 3 and Mistral were trained with many more tokens than optimal (15T+ tokens for 8B parameter models). This "overtraining" produces smaller models that perform like much larger ones — great for inference cost.
Why raw pretrained models need human feedback
A model pretrained on internet text learns to be a statistical mirror of the internet — including toxic content, bias, and harmful information. RLHF (Reinforcement Learning from Human Feedback) fine-tunes the model to be helpful, harmless, and honest:
- Collect human preference data: show two responses, ask which is better
- Train a Reward Model to predict human preferences
- Use PPO (a reinforcement learning algorithm) to optimize the LLM to maximize reward
Topic 03
Prompt Engineering Mastery
| Technique | When to use | Example |
|---|---|---|
| Zero-shot | Simple, well-defined tasks | "Classify as positive/negative: 'This movie was great!'" |
| Few-shot | Model needs format examples | Provide 3-5 input→output pairs before your query |
| Chain-of-Thought | Math, logic, multi-step reasoning | "Let's think step by step..." forces explicit reasoning |
| ReAct | Agent tasks requiring tools | Reason → Take action → Observe result → Repeat |
| Self-consistency | Accuracy-critical reasoning | Sample 5 CoT responses, take majority vote answer |
| Structured output | When you need JSON/specific format | Specify exact JSON schema in prompt + use instructor lib |
Specific beats vague. "Write a Python function that takes a list of dicts with 'name' and 'score' keys, sorts by score descending, and returns the top 3" is 10× better than "sort a list." Give the model your exact constraints, format requirements, and edge cases upfront.
Topic 04
Fine-tuning: LoRA & QLoRA
Train 0.1% of parameters, get 99% of the result
Full fine-tuning updates all 7-70 billion parameters — requiring 80-320GB of GPU RAM. LoRA (Low-Rank Adaptation) injects small trainable matrices into the attention layers while freezing the original weights.
The key insight: weight updates during fine-tuning tend to be low-rank — meaning the update matrix ΔW can be decomposed as two small matrices: ΔW = A × B where A is d×r and B is r×d, with r ≪ d (r=8 or 16 typically).
70B fine-tuning on 2 consumer GPUs
QLoRA combines 4-bit quantization with LoRA. The frozen base model is quantized to 4-bit (NF4 format), reducing memory 4×. LoRA adapters remain in full precision. This enables fine-tuning a 70B model on two consumer RTX 4090 GPUs (48GB total VRAM).
Fine-tune only when you need a consistent style or format, or when dealing with highly specialized domain knowledge. For most tasks, good prompt engineering gets you 80-90% of the way there at zero cost. Fine-tuning to "inject knowledge" rarely works — use RAG instead.
Topic 05
Building Production LLM Applications
| Layer | Technology choices | Purpose |
|---|---|---|
| LLM Provider | OpenAI, Anthropic, Groq, Together.ai | The model itself |
| Orchestration | LangChain, LlamaIndex, LangGraph | Chains, agents, RAG pipelines |
| Vector DB | Pinecone, Chroma, Weaviate, pgvector | Semantic search for RAG |
| Evaluation | RAGAS, LangSmith, PromptFoo | Measure quality at scale |
| Observability | LangFuse, Helicone | Latency, cost, error tracking |
| Caching | Redis, GPTCache | Cut cost on repeated queries |
- Implement retry logic with exponential backoff — LLM APIs rate-limit aggressively
- Cache frequent LLM calls (semantic caching — even similar queries hit cache)
- Set per-user token budgets to prevent abuse
- Log every LLM call: prompt, response, latency, cost, user ID
- Have fallback models: if OpenAI is down → try Claude → try Groq
- Evaluate with LLM-as-judge on a fixed test set weekly to catch regressions
Topic 06
Hallucinations, Safety & Reliability
LLMs hallucinate with confidence — they'll invent citations, describe events that never happened, and provide wrong medical advice in a reassuring tone. This is a fundamental limitation of the architecture, not a bug that will be "fixed."
Mitigation strategies:
- RAG: Ground every response in retrieved documents. Force the model to cite sources
- Lower temperature: T=0 for factual tasks, higher T for creative tasks
- Constitutional AI / system prompts: Explicit rules about what the model should and shouldn't do
- Human-in-the-loop: For high-stakes decisions (medical, legal, financial), always require human review
- Uncertainty quantification: Ask the model to express confidence ("I'm not sure about this, but...")
Cost = (input tokens × input price) + (output tokens × output price). Output tokens are 3-5× more expensive than input. Constrain output length explicitly. Use smaller, faster models (GPT-4o-mini, Haiku) for simple tasks and reserve large models (GPT-4o, Claude Opus) for complex reasoning. A 10× cost difference often gives only 5-10% quality improvement.