Generative AI & LLMs — Prepflix Cheat Sheet

LLM Fundamentals

How LLMs Generate Text

Tokenize input → sequence of token IDs
Embed tokens → dense vectors
Pass through N Transformer decoder blocks
Project to vocabulary size → logits
Apply temperature + sample next token
Append to sequence → repeat until EOS

Temperature: p_i = exp(logit_i/T) / Σexp(logit_j/T)

T→0: greedy (deterministic), T=1: standard, T>1: creative/random

Key Parameters

Context WindowMax tokens: GPT-4: 128k, Gemini 1.5: 1M

Temperature0=deterministic, 0.7=balanced, 1+=creative

Top-p (nucleus)Sample from top 90% probability mass

Top-kSample from top k most likely tokens

Max tokensMax output length

Stop sequencesTokens that end generation early

Scaling Laws: Chinchilla paper: optimal training uses ~20 tokens per parameter. Llama-3 70B trained on 15T tokens — massively overtrained relative to Chinchilla.

Prompt Engineering Mastery

Core Techniques

Zero-shot"Classify this review as positive/negative."

Few-shotProvide 3-5 examples of input→output pairs

Chain-of-Thought"Let's think step by step…" improves reasoning

Self-ConsistencyGenerate N reasoning paths, majority vote

ReActReason → Action → Observation loop

Tree of ThoughtExplore multiple reasoning branches

Prompt Structure Best Practices

System Prompt

Define persona, constraints, output format, tone. "You are an expert Python engineer. Always provide runnable code with type hints."

Structured Output

Request JSON with schema definition. Use function calling / tool use for reliable structured output. Pydantic + instructor library for automatic parsing.

Golden Rule: Specific beats vague. "Write a Python function that takes a list of dicts with 'name' and 'score' keys and returns top 3 by score" beats "sort a list."

Fine-tuning & PEFT

Full Fine-tuning

Update all parameters. Expensive (100s of GPU hours), risk of catastrophic forgetting. Use only when you have large, high-quality domain data.

LoRA

Add trainable low-rank matrices to attention weights. 0.1-1% of parameters. No additional inference latency — merge into base model when done.

r=16, α=32, target: q,v,k,o projections

QLoRA

4-bit quantized base model + LoRA. Enables 70B fine-tuning on 1-2 GPUs. Uses NF4 quantization + double quantization for memory efficiency.

Instruction Fine-tuning Data Format

{ "instruction": "Summarize the following article in 3 bullet points:", "input": "[article text]", "output": "• Key point 1\n• Key point 2\n• Key point 3" } # Alpaca format — needs ~1000-50000 examples depending on task # Quality > Quantity: 1000 excellent examples beats 100k mediocre ones

Building Production LLM Applications

LLM App Stack

LLM ProviderOpenAI, Anthropic, Groq, Together.ai

OrchestrationLangChain, LlamaIndex, LangGraph

Vector DBPinecone, Chroma, Weaviate, Qdrant

EvaluationLangSmith, RAGAS, PromptFoo

ServingFastAPI, vLLM, TGI (for self-hosted)

MonitoringLangFuse, Helicone (latency, cost, errors)

Production Checklist

Implement retry logic with exponential backoff
Cache frequent LLM calls (semantic caching)
Set token budget limits per user/request
Log all LLM calls for debugging and evaluation
Implement fallback models (OpenAI → Claude)
Rate limit users to prevent abuse
Evaluate with LLM-as-judge on test set weekly
A/B test prompts systematically

Cost = Input tokens × in_price + Output tokens × out_price. Output tokens are 3-5× more expensive. Minimize by constraining output length and using smaller models where possible.