Topic 01

How LLMs Generate Text

An LLM doesn't "know" things the way humans do. It's a function that takes tokens (text chunks) as input and outputs a probability distribution over the next token. Generation is sampling from this distribution, one token at a time.

  1. Tokenization: "Hello world" → [15496, 995] (token IDs, not words)
  2. Embedding: Each token ID → 4096-dimensional vector
  3. Transformer blocks: 32-96 layers of self-attention + feed-forward
  4. Output projection: 4096-dim → 50,000-dim (vocabulary size) logits
  5. Sampling: Apply temperature, sample next token, append to sequence, repeat
P(token_i) = softmax(logits / T)_i    |    T = temperature

Temperature controls randomness: T=0 always picks the highest probability token (deterministic, boring). T=1 is balanced. T=2 is chaotic and creative. Think of it like spice level — T=0 is plain, T=2 is extra hot.

Why LLMs hallucinate

LLMs don't retrieve facts from memory — they generate text that "sounds right" based on patterns in training data. There's no fact database being queried. When asked about an obscure topic, the model generates statistically plausible-sounding text, which may be factually wrong. This is why RAG (Retrieval-Augmented Generation) is so important — ground the model in real documents.

Topic 02

Scaling Laws, RLHF & Instruction Tuning

SCALING

The Chinchilla scaling law

The Chinchilla paper (DeepMind, 2022) found that model size and training tokens should scale together: optimal training uses roughly 20 tokens per parameter. A 70B parameter model should train on ~1.4 trillion tokens.

LLaMA 3 and Mistral were trained with many more tokens than optimal (15T+ tokens for 8B parameter models). This "overtraining" produces smaller models that perform like much larger ones — great for inference cost.

RLHF

Why raw pretrained models need human feedback

A model pretrained on internet text learns to be a statistical mirror of the internet — including toxic content, bias, and harmful information. RLHF (Reinforcement Learning from Human Feedback) fine-tunes the model to be helpful, harmless, and honest:

  1. Collect human preference data: show two responses, ask which is better
  2. Train a Reward Model to predict human preferences
  3. Use PPO (a reinforcement learning algorithm) to optimize the LLM to maximize reward

Topic 03

Prompt Engineering Mastery

TechniqueWhen to useExample
Zero-shotSimple, well-defined tasks"Classify as positive/negative: 'This movie was great!'"
Few-shotModel needs format examplesProvide 3-5 input→output pairs before your query
Chain-of-ThoughtMath, logic, multi-step reasoning"Let's think step by step..." forces explicit reasoning
ReActAgent tasks requiring toolsReason → Take action → Observe result → Repeat
Self-consistencyAccuracy-critical reasoningSample 5 CoT responses, take majority vote answer
Structured outputWhen you need JSON/specific formatSpecify exact JSON schema in prompt + use instructor lib
The Golden Rule of Prompting

Specific beats vague. "Write a Python function that takes a list of dicts with 'name' and 'score' keys, sorts by score descending, and returns the top 3" is 10× better than "sort a list." Give the model your exact constraints, format requirements, and edge cases upfront.

Topic 04

Fine-tuning: LoRA & QLoRA

LORA

Train 0.1% of parameters, get 99% of the result

Full fine-tuning updates all 7-70 billion parameters — requiring 80-320GB of GPU RAM. LoRA (Low-Rank Adaptation) injects small trainable matrices into the attention layers while freezing the original weights.

The key insight: weight updates during fine-tuning tend to be low-rank — meaning the update matrix ΔW can be decomposed as two small matrices: ΔW = A × B where A is d×r and B is r×d, with r ≪ d (r=8 or 16 typically).

W_new = W_frozen + A × B    |    Trainable params: 2 × d × r (vs d²)
# LoRA fine-tuning with Hugging Face PEFT from peft import LoraConfig, get_peft_model from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1") lora_config = LoraConfig( r=16, # rank lora_alpha=32, # scaling factor target_modules=["q_proj", "v_proj"], # which layers to adapt lora_dropout=0.1, task_type="CAUSAL_LM" ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 4,194,304 || all params: 7,241,748,480 || trainable%: 0.06%
QLORA

70B fine-tuning on 2 consumer GPUs

QLoRA combines 4-bit quantization with LoRA. The frozen base model is quantized to 4-bit (NF4 format), reducing memory 4×. LoRA adapters remain in full precision. This enables fine-tuning a 70B model on two consumer RTX 4090 GPUs (48GB total VRAM).

When NOT to fine-tune

Fine-tune only when you need a consistent style or format, or when dealing with highly specialized domain knowledge. For most tasks, good prompt engineering gets you 80-90% of the way there at zero cost. Fine-tuning to "inject knowledge" rarely works — use RAG instead.

Topic 05

Building Production LLM Applications

LayerTechnology choicesPurpose
LLM ProviderOpenAI, Anthropic, Groq, Together.aiThe model itself
OrchestrationLangChain, LlamaIndex, LangGraphChains, agents, RAG pipelines
Vector DBPinecone, Chroma, Weaviate, pgvectorSemantic search for RAG
EvaluationRAGAS, LangSmith, PromptFooMeasure quality at scale
ObservabilityLangFuse, HeliconeLatency, cost, error tracking
CachingRedis, GPTCacheCut cost on repeated queries
Production checklist
  • Implement retry logic with exponential backoff — LLM APIs rate-limit aggressively
  • Cache frequent LLM calls (semantic caching — even similar queries hit cache)
  • Set per-user token budgets to prevent abuse
  • Log every LLM call: prompt, response, latency, cost, user ID
  • Have fallback models: if OpenAI is down → try Claude → try Groq
  • Evaluate with LLM-as-judge on a fixed test set weekly to catch regressions

Topic 06

Hallucinations, Safety & Reliability

LLMs hallucinate with confidence — they'll invent citations, describe events that never happened, and provide wrong medical advice in a reassuring tone. This is a fundamental limitation of the architecture, not a bug that will be "fixed."

Mitigation strategies:

  • RAG: Ground every response in retrieved documents. Force the model to cite sources
  • Lower temperature: T=0 for factual tasks, higher T for creative tasks
  • Constitutional AI / system prompts: Explicit rules about what the model should and shouldn't do
  • Human-in-the-loop: For high-stakes decisions (medical, legal, financial), always require human review
  • Uncertainty quantification: Ask the model to express confidence ("I'm not sure about this, but...")
The economics of LLMs

Cost = (input tokens × input price) + (output tokens × output price). Output tokens are 3-5× more expensive than input. Constrain output length explicitly. Use smaller, faster models (GPT-4o-mini, Haiku) for simple tasks and reserve large models (GPT-4o, Claude Opus) for complex reasoning. A 10× cost difference often gives only 5-10% quality improvement.