Back to Tracker

Generative AI & LLMs

LLM Architecture · Prompt Engineering · LoRA/QLoRA · Production Apps

Advanced Module2 Weeks4 LessonsPrepflix AI Roadmap
LLM Fundamentals

How LLMs Generate Text

  1. Tokenize input → sequence of token IDs
  2. Embed tokens → dense vectors
  3. Pass through N Transformer decoder blocks
  4. Project to vocabulary size → logits
  5. Apply temperature + sample next token
  6. Append to sequence → repeat until EOS
Temperature: p_i = exp(logit_i/T) / Σexp(logit_j/T)

T→0: greedy (deterministic), T=1: standard, T>1: creative/random

Key Parameters

Context WindowMax tokens: GPT-4: 128k, Gemini 1.5: 1M
Temperature0=deterministic, 0.7=balanced, 1+=creative
Top-p (nucleus)Sample from top 90% probability mass
Top-kSample from top k most likely tokens
Max tokensMax output length
Stop sequencesTokens that end generation early
Scaling Laws: Chinchilla paper: optimal training uses ~20 tokens per parameter. Llama-3 70B trained on 15T tokens — massively overtrained relative to Chinchilla.
Prompt Engineering Mastery

Core Techniques

Zero-shot"Classify this review as positive/negative."
Few-shotProvide 3-5 examples of input→output pairs
Chain-of-Thought"Let's think step by step…" improves reasoning
Self-ConsistencyGenerate N reasoning paths, majority vote
ReActReason → Action → Observation loop
Tree of ThoughtExplore multiple reasoning branches

Prompt Structure Best Practices

System Prompt

Define persona, constraints, output format, tone. "You are an expert Python engineer. Always provide runnable code with type hints."

Structured Output

Request JSON with schema definition. Use function calling / tool use for reliable structured output. Pydantic + instructor library for automatic parsing.

Golden Rule: Specific beats vague. "Write a Python function that takes a list of dicts with 'name' and 'score' keys and returns top 3 by score" beats "sort a list."
Fine-tuning & PEFT
Full Fine-tuning

Update all parameters. Expensive (100s of GPU hours), risk of catastrophic forgetting. Use only when you have large, high-quality domain data.

LoRA

Add trainable low-rank matrices to attention weights. 0.1-1% of parameters. No additional inference latency — merge into base model when done.

r=16, α=32, target: q,v,k,o projections
QLoRA

4-bit quantized base model + LoRA. Enables 70B fine-tuning on 1-2 GPUs. Uses NF4 quantization + double quantization for memory efficiency.

Instruction Fine-tuning Data Format

{ "instruction": "Summarize the following article in 3 bullet points:", "input": "[article text]", "output": "• Key point 1\n• Key point 2\n• Key point 3" } # Alpaca format — needs ~1000-50000 examples depending on task # Quality > Quantity: 1000 excellent examples beats 100k mediocre ones
Building Production LLM Applications

LLM App Stack

LLM ProviderOpenAI, Anthropic, Groq, Together.ai
OrchestrationLangChain, LlamaIndex, LangGraph
Vector DBPinecone, Chroma, Weaviate, Qdrant
EvaluationLangSmith, RAGAS, PromptFoo
ServingFastAPI, vLLM, TGI (for self-hosted)
MonitoringLangFuse, Helicone (latency, cost, errors)

Production Checklist

  • Implement retry logic with exponential backoff
  • Cache frequent LLM calls (semantic caching)
  • Set token budget limits per user/request
  • Log all LLM calls for debugging and evaluation
  • Implement fallback models (OpenAI → Claude)
  • Rate limit users to prevent abuse
  • Evaluate with LLM-as-judge on test set weekly
  • A/B test prompts systematically
Cost = Input tokens × in_price + Output tokens × out_price. Output tokens are 3-5× more expensive. Minimize by constraining output length and using smaller models where possible.