Generative AI & LLMs: Architecture to Production

01How LLMs Generate Text

02Scaling Laws & RLHF

03Prompt Engineering

04LoRA & QLoRA

05Production LLM Apps

06Hallucinations & Safety

Topic 01

How LLMs Generate Text

An LLM doesn't "know" things the way humans do. It's a function that takes tokens (text chunks) as input and outputs a probability distribution over the next token. Generation is sampling from this distribution, one token at a time.

Tokenization: "Hello world" → [15496, 995] (token IDs, not words)
Embedding: Each token ID → 4096-dimensional vector
Transformer blocks: 32-96 layers of self-attention + feed-forward
Output projection: 4096-dim → 50,000-dim (vocabulary size) logits
Sampling: Apply temperature, sample next token, append to sequence, repeat

P(token_i) = softmax(logits / T)_i | T = temperature

Temperature controls randomness: T=0 always picks the highest probability token (deterministic, boring). T=1 is balanced. T=2 is chaotic and creative. Think of it like spice level — T=0 is plain, T=2 is extra hot.

Why LLMs hallucinate

LLMs don't retrieve facts from memory — they generate text that "sounds right" based on patterns in training data. There's no fact database being queried. When asked about an obscure topic, the model generates statistically plausible-sounding text, which may be factually wrong. This is why RAG (Retrieval-Augmented Generation) is so important — ground the model in real documents.

Topic 02

Scaling Laws, RLHF & Instruction Tuning

SCALING

The Chinchilla scaling law

The Chinchilla paper (DeepMind, 2022) found that model size and training tokens should scale together: optimal training uses roughly 20 tokens per parameter. A 70B parameter model should train on ~1.4 trillion tokens.

LLaMA 3 and Mistral were trained with many more tokens than optimal (15T+ tokens for 8B parameter models). This "overtraining" produces smaller models that perform like much larger ones — great for inference cost.

RLHF

Why raw pretrained models need human feedback

A model pretrained on internet text learns to be a statistical mirror of the internet — including toxic content, bias, and harmful information. RLHF (Reinforcement Learning from Human Feedback) fine-tunes the model to be helpful, harmless, and honest:

Collect human preference data: show two responses, ask which is better
Train a Reward Model to predict human preferences
Use PPO (a reinforcement learning algorithm) to optimize the LLM to maximize reward

Topic 03

Prompt Engineering Mastery

Technique	When to use	Example
Zero-shot	Simple, well-defined tasks	"Classify as positive/negative: 'This movie was great!'"
Few-shot	Model needs format examples	Provide 3-5 input→output pairs before your query
Chain-of-Thought	Math, logic, multi-step reasoning	"Let's think step by step..." forces explicit reasoning
ReAct	Agent tasks requiring tools	Reason → Take action → Observe result → Repeat
Self-consistency	Accuracy-critical reasoning	Sample 5 CoT responses, take majority vote answer
Structured output	When you need JSON/specific format	Specify exact JSON schema in prompt + use instructor lib

The Golden Rule of Prompting

Specific beats vague. "Write a Python function that takes a list of dicts with 'name' and 'score' keys, sorts by score descending, and returns the top 3" is 10× better than "sort a list." Give the model your exact constraints, format requirements, and edge cases upfront.

Topic 04

Fine-tuning: LoRA & QLoRA

LORA

Train 0.1% of parameters, get 99% of the result

Full fine-tuning updates all 7-70 billion parameters — requiring 80-320GB of GPU RAM. LoRA (Low-Rank Adaptation) injects small trainable matrices into the attention layers while freezing the original weights.

The key insight: weight updates during fine-tuning tend to be low-rank — meaning the update matrix ΔW can be decomposed as two small matrices: ΔW = A × B where A is d×r and B is r×d, with r ≪ d (r=8 or 16 typically).

W_new = W_frozen + A × B | Trainable params: 2 × d × r (vs d²)

# LoRA fine-tuning with Hugging Face PEFT
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")

lora_config = LoraConfig(
    r=16,             # rank
    lora_alpha=32,    # scaling factor
    target_modules=["q_proj", "v_proj"],  # which layers to adapt
    lora_dropout=0.1,
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 7,241,748,480 || trainable%: 0.06%
            

QLORA

70B fine-tuning on 2 consumer GPUs

QLoRA combines 4-bit quantization with LoRA. The frozen base model is quantized to 4-bit (NF4 format), reducing memory 4×. LoRA adapters remain in full precision. This enables fine-tuning a 70B model on two consumer RTX 4090 GPUs (48GB total VRAM).

When NOT to fine-tune

Fine-tune only when you need a consistent style or format, or when dealing with highly specialized domain knowledge. For most tasks, good prompt engineering gets you 80-90% of the way there at zero cost. Fine-tuning to "inject knowledge" rarely works — use RAG instead.

Topic 05

Building Production LLM Applications

Layer	Technology choices	Purpose
LLM Provider	OpenAI, Anthropic, Groq, Together.ai	The model itself
Orchestration	LangChain, LlamaIndex, LangGraph	Chains, agents, RAG pipelines
Vector DB	Pinecone, Chroma, Weaviate, pgvector	Semantic search for RAG
Evaluation	RAGAS, LangSmith, PromptFoo	Measure quality at scale
Observability	LangFuse, Helicone	Latency, cost, error tracking
Caching	Redis, GPTCache	Cut cost on repeated queries

Production checklist

Implement retry logic with exponential backoff — LLM APIs rate-limit aggressively
Cache frequent LLM calls (semantic caching — even similar queries hit cache)
Set per-user token budgets to prevent abuse
Log every LLM call: prompt, response, latency, cost, user ID
Have fallback models: if OpenAI is down → try Claude → try Groq
Evaluate with LLM-as-judge on a fixed test set weekly to catch regressions

Topic 06

Hallucinations, Safety & Reliability

LLMs hallucinate with confidence — they'll invent citations, describe events that never happened, and provide wrong medical advice in a reassuring tone. This is a fundamental limitation of the architecture, not a bug that will be "fixed."

Mitigation strategies:

RAG: Ground every response in retrieved documents. Force the model to cite sources
Lower temperature: T=0 for factual tasks, higher T for creative tasks
Constitutional AI / system prompts: Explicit rules about what the model should and shouldn't do
Human-in-the-loop: For high-stakes decisions (medical, legal, financial), always require human review
Uncertainty quantification: Ask the model to express confidence ("I'm not sure about this, but...")

The economics of LLMs

Cost = (input tokens × input price) + (output tokens × output price). Output tokens are 3-5× more expensive than input. Constrain output length explicitly. Use smaller, faster models (GPT-4o-mini, Haiku) for simple tasks and reserve large models (GPT-4o, Claude Opus) for complex reasoning. A 10× cost difference often gives only 5-10% quality improvement.

Generative AI & LLMs
Architecture, Prompt Engineering, LoRA & Production

Contents