NLP & Transformers: BERT, GPT, RAG & AI Agents

01The NLP Pipeline

02Why Text is Hard for Computers

03Tokenization Strategies

04Word2Vec & GloVe Embeddings

05RNNs and LSTMs

06Attention Mechanism

07Transformer Architecture

08BERT vs GPT

09Fine-tuning with Hugging Face

10RAG: Retrieval-Augmented Generation

11AI Agents & ReAct

12Interview Tips

Natural Language Processing (NLP) is the field that makes computers understand, interpret, and generate human language. For most of AI history, NLP was a graveyard of clever heuristics and hand-crafted rules. Then word embeddings came along, then RNNs, and then — in 2017 — a paper titled "Attention is All You Need" changed everything. Today, every major language model from ChatGPT to Gemini to Claude is built on the Transformer architecture described in that paper.

In this guide, we'll build your understanding from the ground up: starting with how raw text gets processed, building up through embeddings and sequence models, and finishing with the modern stack of Transformers, fine-tuning, RAG, and AI Agents. Every concept is connected to the next one.

Section 1

The NLP Pipeline: Raw Text to Prediction

Before any model sees text, that text goes through a processing pipeline. Think of it like a factory assembly line — raw material enters one end, and a structured representation comes out the other.

The five stages are: Raw Text (the original string) → Tokenization (split into tokens) → Embedding (convert tokens to vectors) → Model (process vectors through layers) → Output (prediction, generated text, or embeddings). Understanding each stage is crucial because bugs can creep in at any point.

Section 2

Why Text is Hard for Computers

Computers love numbers. Text, unfortunately, is not numbers. Worse, text is deeply ambiguous in ways that are effortless for humans but brutal for machines.

Example: Ambiguity in Language

Consider the word "bank":

"I walked along the river bank" — it's geography
"I need to go to the bank to deposit money" — it's a financial institution
"You can bank on me" — it means "rely on"

Same four letters, three completely different meanings. Humans resolve this automatically using context. Teaching a machine to do this was the core problem of NLP for decades.

Other challenges include: morphology ("run", "running", "ran" are the same concept but different strings), syntax (word order changes meaning), coreference ("The trophy didn't fit in the suitcase because it was too big" — what is "it"?), and pragmatics (sarcasm, idioms, cultural context).

Key Insight

The history of NLP is essentially a story of finding better ways to represent meaning. From one-hot encoding (useless, loses all relationships) to word vectors (captures analogy) to contextual embeddings (same word, different meaning based on context). Each generation solved the previous generation's biggest failure.

Section 3

Tokenization: How Text Gets Broken Down

3.1

Whitespace Tokenization

The naive approach: split on spaces. "I love NLP" becomes ["I", "love", "NLP"]. Works for simple cases but completely fails with "don't" (is it "don't" or "do" + "n't"?), punctuation, and non-English languages like Chinese which have no spaces at all.

3.2

Byte-Pair Encoding (BPE)

BPE is the algorithm used by GPT-2, GPT-3, GPT-4, and many others. The idea is elegant: start with individual characters, then iteratively merge the most frequent pairs into new tokens. After enough merges, common words become single tokens while rare words get split into subword pieces.

BPE in Action

Start: ["l", "o", "w", "e", "r"] appears frequently alongside ["l", "o", "w"]

BPE merges "l"+"o" → "lo", then "lo"+"w" → "low", then "low"+"e" → "lowe", then "lowe"+"r" → "lower"

Result: "lower" is one token. But "lowering" might be ["lower", "ing"]. The model handles rare and new words gracefully by decomposing them.

3.3

SentencePiece

Used by models like T5 and many multilingual models. Unlike BPE which requires pre-tokenized whitespace-split text, SentencePiece treats the input as a raw stream of characters including spaces (represented as a special character ▁). This makes it language-agnostic — it works for Japanese, Chinese, Arabic, and any other language without modification.

Common Mistake

Always use the same tokenizer that was used to train the model. BERT uses WordPiece, GPT-2 uses BPE, T5 uses SentencePiece. Mixing them is like translating a book into French and then trying to read it with a Spanish dictionary — nothing will line up.

Section 4

Word2Vec and GloVe: Words as Points in Space

4.1

The Core Idea of Word Embeddings

Instead of representing "king" as an integer ID, what if we represent it as a 300-dimensional vector of real numbers? The key insight is: words that appear in similar contexts should have similar vectors. This is the distributional hypothesis — "You shall know a word by the company it keeps."

Word2Vec achieves this by training a tiny neural network to predict: given the word "king", what words are likely to appear nearby? The hidden layer weights become the embeddings. The actual prediction task is discarded — we only keep the weights. This is called a self-supervised learning task: no human labels needed, just raw text.

The Famous King Analogy

After training on billions of words, Word2Vec learns geometric relationships:

king − man + woman ≈ queen

This isn't magic — it's geometry. The vector difference between "king" and "man" captures the concept of royalty. Add that direction to "woman" and you land near "queen". Other analogies: Paris − France + Italy ≈ Rome, walking − walk + swim ≈ swimming.

4.2

Word2Vec: Two Training Approaches

CBOW (Continuous Bag of Words): Given context words ["The", "cat", "on", "the", "mat"], predict the center word "sat". Faster to train, works better for frequent words.

Skip-gram: Given the center word "sat", predict context words. Slower, but works better for rare words and produces better embeddings overall. Most implementations use Skip-gram with negative sampling (SGNS).

4.3

GloVe: Global Vectors

Word2Vec looks at local windows of text — it only sees words within a few positions of each target word. GloVe (Global Vectors for Word Representation, from Stanford) takes a different approach: build a giant co-occurrence matrix counting how often every word pair appears together across the entire corpus, then factorize this matrix into word vectors.

In practice, GloVe and Word2Vec produce similarly useful embeddings. GloVe tends to be slightly better at linear analogies because it explicitly optimizes for the ratio of co-occurrence probabilities.

4.4

The Fatal Flaw: Static Embeddings

Both Word2Vec and GloVe give every word exactly one vector, forever. This means "bank" always gets the same vector regardless of context. This is a fundamental limitation that contextual models like BERT were designed to fix.

Static Embedding Problem

"I went to the bank to deposit money" and "I sat by the bank of the river" — with Word2Vec, "bank" gets the same vector in both sentences. BERT gives "bank" a completely different vector in each context. This is why contextual embeddings are so much more powerful for understanding tasks.

Section 5

RNNs and LSTMs: Sequential Reading Machines

5.1

Recurrent Neural Networks (RNNs)

An RNN reads text sequentially — one word at a time — while maintaining a "hidden state" that acts like short-term memory. Think of it like reading a book: as you read each word, your understanding updates. When you finish a sentence, your mental state captures its meaning.

h_t = tanh(W_h * h_(t-1) + W_x * x_t + b)

At each time step t, the new hidden state h_t depends on the previous hidden state h_(t-1) and the current input x_t. This is the recurrence relation that gives RNNs their name. The final hidden state h_T theoretically summarizes the entire sequence.

5.2

The Vanishing Gradient Problem

Here's the critical failure of basic RNNs: during backpropagation, gradients must travel back through every time step. If the gradient gets multiplied by a number less than 1 at each step (which the tanh activation tends to cause), it shrinks exponentially. After 50 steps, the gradient is essentially zero — the model has completely forgotten the beginning of the sequence.

Think of it like a game of telephone: you whisper a message through 50 people. By the end, the original message is unrecognizable. RNNs have the same problem — they simply cannot learn long-range dependencies like "The keys that were on the table near the window in the living room of the old house [were/was] rusty" (which "were" or "was" — depends on "keys" 20 words back).

5.3

LSTMs: Long Short-Term Memory

LSTMs (Hochreiter & Schmidhuber, 1997) solve the vanishing gradient problem with a brilliant architectural trick: a separate "cell state" that runs alongside the hidden state, with explicit mechanisms (gates) controlling what information flows through.

The Three LSTM Gates

Forget gate: "Should I erase anything from my memory?" — sigmoid(W_f * [h_{t-1}, x_t]). Output near 0 = forget, near 1 = keep. Example: when reading a new paragraph, forget the previous paragraph's topic.
Input gate: "What new information should I store?" — sigmoid(W_i * [...]) controls which values to update. tanh(W_c * [...]) creates candidate values to add.
Output gate: "What should I output from my memory right now?" — controls which parts of the cell state become the hidden state h_t passed to the next layer or output.

The cell state flows through with only linear interactions (addition and multiplication by gates), allowing gradients to flow much more easily over long distances. LSTMs can handle sequences of hundreds of tokens where vanilla RNNs fail completely.

Key Insight

Despite being incredibly clever, LSTMs are still fundamentally sequential — you can't process word 50 until you've processed words 1 through 49. This makes them slow to train and limits their ability to capture very long-range dependencies. The Transformer architecture eliminates this bottleneck entirely.

Section 6

Attention Mechanism: The Game Changer

6.1

The Core Intuition

Instead of forcing the model to compress an entire sentence into a single hidden state (the bottleneck problem), attention allows the model to look at ALL positions in the input simultaneously when producing each output token — and weight them by relevance.

Imagine you're translating "The animal didn't cross the street because it was too tired." When you write the French word for "it" (which must agree in gender with what "it" refers to), your brain immediately looks back and connects "it" to "animal". Attention gives models this same ability — a direct connection between any two positions.

6.2

Self-Attention: Scaled Dot-Product

In self-attention, every word asks: "How relevant is every other word to me?" Each input token is projected into three vectors:

Query (Q): "What am I looking for?"
Key (K): "What do I offer?"
Value (V): "What is my actual content?"

Attention(Q, K, V) = softmax(QKᵀ / √d_k) × V

For each query, we compute dot products with all keys (measuring relevance), scale by √d_k to prevent exploding gradients, apply softmax to get a probability distribution (attention weights), then take a weighted sum of the values. The result: each token gets a new representation that's a blend of all other tokens, weighted by how relevant they are.

Attention Visualization — "The bank near the river"

6.3

Multi-Head Attention

Running attention once gives you one "perspective" on the relationships. Multi-head attention runs attention h times in parallel (typically h=8 or h=12), each with different learned projection matrices. This lets different heads capture different types of relationships simultaneously:

Head 1 might focus on syntactic relationships (subject-verb agreement)
Head 2 might track coreference (pronoun → antecedent)
Head 3 might capture semantic similarity
Head 4 might attend to positional proximity

The outputs of all heads are concatenated and projected back to the model dimension. You get the richness of h different analysis lenses for the price of one forward pass.

Section 7

The Transformer Architecture

The Transformer, introduced in "Attention is All You Need" (Vaswani et al., 2017), replaces recurrence entirely with attention. The result: a model that can process all tokens in parallel (huge training speedup) and captures long-range dependencies directly (no vanishing gradient over distance).

7.1

Positional Encoding: Adding Order Back

Attention is inherently order-agnostic — "cat sat mat" and "mat sat cat" produce identical attention patterns. That's a problem since word order matters enormously in language. The solution: add positional information to the embeddings before feeding them to the Transformer.

The original paper uses sinusoidal functions at different frequencies — even positions and odd positions get different formulas. The result is a unique position vector for every position that smoothly interpolates and generalizes to sequences longer than those seen during training. Modern models often use learned positional embeddings or Rotary Position Embeddings (RoPE) instead.

7.2

Encoder vs Decoder

Component	Encoder	Decoder
Purpose	Understand input text	Generate output text
Attention type	Full bidirectional self-attention	Masked self-attention + cross-attention to encoder
Can see	All positions (past + future)	Only past positions (causal mask)
Example models	BERT, RoBERTa	GPT-2, GPT-3, GPT-4
Best for	Classification, NER, Q&A extraction	Text generation, summarization
T5/BART	Encoder-decoder: encoder reads input, decoder generates output

7.3

A Single Transformer Layer

Each Transformer layer has the same structure:

Multi-Head Self-Attention — each token attends to all other tokens
Add & Norm — residual connection (x + attention_output) followed by LayerNorm. Residual connections are critical for training deep networks — gradients flow directly back without degrading.
Feed-Forward Network — two linear layers with a ReLU/GELU in between. Applied independently to each position. This is where most of the "knowledge" is stored (the FFN layers are much larger than the attention layers).
Add & Norm again — another residual + LayerNorm

This layer is stacked N times (BERT-base: 12, GPT-3: 96, GPT-4: rumored 120). More layers = more capacity to learn complex patterns.

Section 8

BERT vs GPT: Understanding vs Generation

8.1

BERT: Bidirectional Encoder Representations from Transformers

BERT (Google, 2018) uses only the Transformer encoder. During pretraining, it learns from two tasks:

Masked Language Modeling (MLM): 15% of tokens are randomly masked. BERT must predict the masked tokens from context. Crucially, since it's an encoder, it can look at tokens on BOTH sides of the mask simultaneously — this is why it's "bidirectional".
Next Sentence Prediction (NSP): Given two sentences, predict if the second follows the first. (Later research found NSP isn't very useful — RoBERTa drops it entirely.)

BERT produces rich contextual embeddings — "bank" in a financial sentence vs "bank" in a river sentence gets completely different vectors. This made BERT state-of-the-art on almost every NLP benchmark when it was released.

8.2

GPT: Generative Pretrained Transformer

GPT (OpenAI) uses only the Transformer decoder (without the cross-attention to an encoder — it's a decoder-only model). During pretraining, it performs next token prediction: given tokens 1 through t, predict token t+1. The causal mask ensures it can only attend to previous tokens, making this autoregressive generation possible.

GPT-2 (2019) showed that a sufficiently large language model could generate coherent text so convincingly that OpenAI initially withheld the full model citing misuse concerns. GPT-3 (175B parameters, 2020) showed remarkable few-shot learning — giving it 3 examples is often enough to adapt to a new task without any weight updates. GPT-4 (2023) incorporated vision and significantly improved reasoning.

Aspect	BERT	GPT-4	T5
Architecture	Encoder-only	Decoder-only	Encoder-Decoder
Training objective	Masked LM	Next token prediction	Text-to-text (mask spans)
Attention direction	Bidirectional	Causal (left-to-right)	Both
Best tasks	Classification, NER, QA	Generation, chat, code	Translation, summarization
Inference cost	Low (1 forward pass)	High (autoregressive)	Medium
Context window	512 tokens	128K tokens	Varies

Section 9

Fine-Tuning with Hugging Face

Hugging Face is the GitHub of AI models. They host over 500,000 models and provide the transformers library that makes fine-tuning state-of-the-art models accessible in less than 20 lines of code. The workflow is: load pretrained model → tokenize your dataset → fine-tune → save and push to Hub.

# Fine-tune BERT for sentiment classification
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

# 1. Load tokenizer and model (pretrained weights)
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# 2. Load and tokenize dataset
dataset = load_dataset("imdb")
def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=512)
tokenized = dataset.map(tokenize, batched=True)

# 3. Define training arguments
args = TrainingArguments(
    output_dir="./bert-imdb",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    learning_rate=2e-5,   # small LR: don't destroy pretrained weights
)

# 4. Train
trainer = Trainer(model=model, args=args, train_dataset=tokenized["train"], eval_dataset=tokenized["test"])
trainer.train()

# 5. Save and push to Hub
model.save_pretrained("./bert-imdb-final")
model.push_to_hub("my-username/bert-imdb")  # optional
            

Key Insight: Why Fine-Tuning Works So Well

During pretraining on billions of tokens, BERT learns general language understanding — grammar, semantics, world knowledge. Fine-tuning only adjusts these weights slightly (with a very small learning rate like 2e-5) to adapt to your specific task. You're not teaching the model language from scratch — you're giving it a narrow new skill. This is why you need only thousands of labeled examples, not millions.

Common Mistake: Learning Rate

Using a learning rate that's too high (like 1e-3) during fine-tuning causes "catastrophic forgetting" — the model rapidly overwrites its pretrained knowledge and performs worse than a linear classifier. Stick to 1e-5 to 5e-5 for fine-tuning transformers. Also: always warm up the learning rate for the first 10% of training steps.

Section 10

RAG: Retrieval-Augmented Generation

Large language models bake knowledge into their weights during pretraining. This creates two problems: (1) they can't know about events after their training cutoff, and (2) they hallucinate confidently — if they don't know something, they make it up rather than saying "I don't know".

RAG (Retrieval-Augmented Generation, Facebook/Meta 2020) solves this elegantly: instead of relying on memorized knowledge, retrieve relevant documents at query time and include them in the context. The model's job changes from "remember and answer" to "read and answer".

10.1

The RAG Pipeline in Detail

Offline (Indexing): Chunk your documents into pieces of ~256-512 tokens. Embed each chunk using an embedding model (e.g., text-embedding-3-small from OpenAI, or BAAI/bge-m3 open-source). Store the vectors in a vector database (Pinecone, Weaviate, pgvector, Chroma).

Online (Querying): Embed the user's query using the same embedding model. Run Approximate Nearest Neighbor (ANN) search to retrieve the top-K most similar chunks. Inject those chunks into the LLM prompt as context. The LLM reads the retrieved context and generates a grounded answer.

RAG vs Fine-Tuning: When to Use Which

Use RAG when: knowledge changes frequently (product docs, news, internal wikis), you need citations/sources, documents are too large to fit in context
Use Fine-tuning when: you need a specific writing style or output format, you have thousands of task-specific examples, you want to reduce latency (smaller fine-tuned model)
Use both when: you want the format of fine-tuning + the freshness of RAG

Section 11

AI Agents and the ReAct Loop

A single LLM call can answer questions, but it can't check a live stock price, run Python code, or book a flight. AI Agents solve this by giving the LLM access to tools and having it decide, step by step, which tool to use.

11.1

The ReAct Framework: Reason + Act

ReAct (Yao et al., 2022) interleaves reasoning traces and action calls. The agent generates a thought ("I need to check the weather in London"), then an action (call weather_api("London")), receives an observation ({"temp": "12°C", "weather": "cloudy"}), generates another thought, and continues until it has a final answer.

11.2

Agent Memory Systems

In-context memory: The conversation history in the LLM's context window. Simple but limited by context length.
External memory (vector store): Long-term storage of past interactions, retrieved by semantic similarity. The agent can "remember" conversations from weeks ago.
Episodic memory: Structured logs of past actions and their outcomes. Helps agents avoid repeating mistakes.
Tool state: State maintained in external tools (databases, files, APIs).

Key Insight: The Power of Tool Use

An LLM with tools is exponentially more capable than an LLM without. A Python REPL tool alone transforms a language model into a perfect calculator and data analyst. A web search tool eliminates the knowledge cutoff problem. Function calling (structured tool use) is the key primitive — LangChain, LlamaIndex, and the Anthropic/OpenAI APIs all build on this foundation.

Section 12

Interview Tips for NLP & Transformers

Explain attention without equations first. Say: "Every token looks at every other token and decides how much to borrow from them." Then add the Q/K/V formulation.
Know the BERT vs GPT distinction cold. Bidirectional vs causal, MLM vs next-token prediction, understanding vs generation — this comes up in almost every ML interview.
Know why we need positional encoding — attention is order-agnostic and we must explicitly add position information.
Be able to sketch the RAG pipeline — retrieval → augmentation → generation, with the offline indexing step.
For fine-tuning questions: Always mention learning rate (2e-5), overfitting risk with small datasets, and when you'd use LoRA instead of full fine-tuning (discussed in Module 7).
Vanishing gradient question: Explain both the RNN problem AND how Transformers avoid it (no sequential recurrence, residual connections).

Common Interview Question

Q: Why does BERT use [MASK] tokens but GPT doesn't?

A: BERT is an encoder trained with masked language modeling — it needs to see the full bidirectional context, so masking random tokens forces it to learn from both left and right context simultaneously. GPT is a decoder trained with causal (autoregressive) language modeling — it predicts the NEXT token given previous tokens, which naturally matches its generation use case. Masking would break this sequential dependency.

Contents