Embeddings and Semantic Search: How to Build Search That Understands Meaning

01What Are Embeddings? 02OpenAI Embeddings API 03Cosine Similarity Explained 04Semantic Search from Scratch 05ChromaDB Vector Database 06Vector DB vs Brute Force 07Use Cases & Patterns 08Interview: Explain Embeddings

Concept 01

What Are Embeddings? The Coordinates in Concept Space Analogy

Imagine you're watching Netflix. Every movie can be described by coordinates in a multi-dimensional space. Two dimensions you might use: "how action-packed is it?" and "how funny is it?" A comedy like The Hangover sits at (low action, high comedy). An action film like Mad Max sits at (high action, low comedy). The Dark Knight sits at (high action, medium comedy).

Now imagine instead of 2 dimensions, you have 1,536 dimensions (the size of OpenAI's text-embedding-3-small output). Each dimension captures some aspect of meaning — seriousness, formality, technical depth, emotional tone, topic domain, and thousands of other subtle qualities. A piece of text becomes a point in this 1,536-dimensional space.

The magic: texts that mean similar things end up near each other in this space. "Python tutorial for beginners" and "how to learn Python" will produce vectors that are very close together, even though they share no keywords. "Corporate merger announcement" and "acquisition press release" will also cluster together.

Intuition Check

An embedding model converts text → a list of ~1,500 numbers. That list of numbers is the embedding (or "vector"). The model was trained so that semantically similar text produces vectors that point in similar directions. It's coordinates on a map where meaning determines geography.

Concept 02

OpenAI Embeddings API — Getting Your First Vectors

from openai import OpenAI
from typing import Union
import numpy as np

client = OpenAI()

def get_embedding(
    text: Union[str, list[str]],
    model: str = "text-embedding-3-small",
) -> Union[list[float], list[list[float]]]:
    """
    Get embeddings for one or more texts.
    text-embedding-3-small: 1536 dims, $0.02/1M tokens
    text-embedding-3-large: 3072 dims, $0.13/1M tokens — better quality
    text-embedding-ada-002: 1536 dims, legacy but still widely used
    """
    # Normalize input — API accepts a list
    texts = [text] if isinstance(text, str) else text

    # Clean the texts — newlines hurt embedding quality
    texts = [t.replace("\n", " ").strip() for t in texts]

    response = client.embeddings.create(
        model=model,
        input=texts,
        # dimensions=512,  # Optional: reduce dimensions for storage/speed tradeoff
        encoding_format="float",  # "float" or "base64"
    )

    embeddings = [item.embedding for item in response.data]

    # Print some metadata
    print(f"Model: {response.model}")
    print(f"Embedding dimensions: {len(embeddings[0])}")
    print(f"Tokens used: {response.usage.total_tokens}")

    return embeddings[0] if isinstance(text, str) else embeddings

# Single embedding
embedding = get_embedding("How do I reset my password?")
print(f"Vector shape: {len(embedding)} dimensions")
print(f"First 5 values: {embedding[:5]}")

# Batch embeddings — more efficient, fewer API calls
texts = [
    "How do I reset my password?",
    "I forgot my login credentials",
    "What are the system requirements?",
    "How do I cancel my subscription?",
]
embeddings = get_embedding(texts)
print(f"Got {len(embeddings)} embeddings")

Cost Awareness

Embeddings are cheap but not free. At $0.02/1M tokens for text-embedding-3-small, embedding 1 million support tickets (~500 tokens each) costs about $10. But embedding every user search query in real-time adds up. Cache embeddings for documents; embed queries in real-time.

Concept 03

Cosine Similarity — How We Measure "How Similar Are These?"

Given two vectors, we need a way to measure how similar they are. We use cosine similarity: the cosine of the angle between the two vectors.

Cosine similarity = 1.0 — vectors point in the exact same direction. Texts are semantically identical.
Cosine similarity = 0.9+ — very similar meaning. "How do I reset my password?" and "password reset instructions" would score here.
Cosine similarity = 0.7–0.9 — related but different. "Python tutorial" and "Python documentation."
Cosine similarity = 0.0 — completely unrelated. "How to bake bread" and "quantum physics equations."
Cosine similarity = -1.0 — opposite meaning (rarely occurs in practice).

import numpy as np

def cosine_similarity(vec_a: list[float], vec_b: list[float]) -> float:
    """
    Compute cosine similarity between two embedding vectors.
    Returns a value between -1 and 1. Higher = more similar.
    """
    a = np.array(vec_a)
    b = np.array(vec_b)

    # Dot product divided by product of magnitudes
    dot_product = np.dot(a, b)
    magnitude_a = np.linalg.norm(a)
    magnitude_b = np.linalg.norm(b)

    if magnitude_a == 0 or magnitude_b == 0:
        return 0.0

    return float(dot_product / (magnitude_a * magnitude_b))

# Test it with real embeddings
query = "I can't log into my account"
documents = [
    "How to reset your password",
    "Login troubleshooting guide",
    "How to upgrade your subscription",
    "Our refund policy",
    "Sign in issues and fixes",
]

query_embedding = get_embedding(query)
doc_embeddings = get_embedding(documents)

# Compute similarity scores
scores = []
for doc, doc_emb in zip(documents, doc_embeddings):
    similarity = cosine_similarity(query_embedding, doc_emb)
    scores.append((similarity, doc))

# Sort by similarity (highest first)
scores.sort(reverse=True)

print(f"Query: '{query}'")
print("\nRanked results:")
for score, doc in scores:
    print(f"  {score:.4f}  {doc}")

Concept 04

Semantic Search from Scratch — The Complete Pipeline

import numpy as np
from dataclasses import dataclass
from typing import Optional

@dataclass
class Document:
    id: str
    text: str
    metadata: dict
    embedding: Optional[list[float]] = None

class SemanticSearchEngine:
    """
    Simple semantic search engine using numpy.
    Great for corpora up to ~50,000 documents.
    Use a vector database (ChromaDB, Pinecone) for larger corpora.
    """

    def __init__(self, embedding_model: str = "text-embedding-3-small"):
        self.model = embedding_model
        self.documents: list[Document] = []
        self._embeddings_matrix: Optional[np.ndarray] = None

    def index(self, documents: list[Document], batch_size: int = 100):
        """
        Phase 1: Embed all documents and build the search index.
        This is the expensive one-time operation.
        """
        print(f"Indexing {len(documents)} documents...")
        self.documents = documents

        # Embed in batches (API has a batch size limit)
        all_embeddings = []
        for i in range(0, len(documents), batch_size):
            batch = documents[i:i + batch_size]
            texts = [doc.text for doc in batch]
            batch_embeddings = get_embedding(texts)

            if isinstance(batch_embeddings[0], float):
                # Single item returned as flat list
                all_embeddings.append(batch_embeddings)
            else:
                all_embeddings.extend(batch_embeddings)

            print(f"  Embedded {min(i + batch_size, len(documents))}/{len(documents)}")

        # Build numpy matrix for fast similarity computation
        self._embeddings_matrix = np.array(all_embeddings)

        # Normalize all vectors (makes cosine similarity a simple dot product)
        norms = np.linalg.norm(self._embeddings_matrix, axis=1, keepdims=True)
        self._embeddings_matrix = self._embeddings_matrix / norms

        print("Index built successfully.")

    def search(self, query: str, top_k: int = 5) -> list[dict]:
        """
        Phase 2: Search the index.
        Embed query → compute similarities → return top-k results.
        """
        if self._embeddings_matrix is None:
            raise RuntimeError("Call index() before search()")

        # Embed the query
        query_embedding = np.array(get_embedding(query))
        query_embedding = query_embedding / np.linalg.norm(query_embedding)

        # Compute similarities (dot product because vectors are normalized)
        similarities = self._embeddings_matrix @ query_embedding

        # Get top-k indices
        top_indices = np.argsort(similarities)[::-1][:top_k]

        results = []
        for idx in top_indices:
            doc = self.documents[idx]
            results.append({
                "id": doc.id,
                "text": doc.text,
                "score": float(similarities[idx]),
                "metadata": doc.metadata,
            })

        return results

# Build and use the search engine
support_docs = [
    Document("1", "How to reset your password: Go to settings, click Security", {"category": "auth"}),
    Document("2", "Login issues: If you can't sign in, try clearing your browser cache", {"category": "auth"}),
    Document("3", "Billing and invoices: Download invoices from Account > Billing", {"category": "billing"}),
    Document("4", "How to cancel your subscription at any time", {"category": "billing"}),
    Document("5", "API rate limits: You can make 100 requests per minute", {"category": "api"}),
    Document("6", "Getting started with the API: Generate your API key in Settings", {"category": "api"}),
]

engine = SemanticSearchEngine()
engine.index(support_docs)

# Search
results = engine.search("I forgot my password and can't get in", top_k=3)
for r in results:
    print(f"Score: {r['score']:.4f} | {r['text'][:60]}...")

Concept 05

ChromaDB — A Production Vector Database

import chromadb
from chromadb.utils import embedding_functions

# Initialize ChromaDB
# - In-memory: for testing/development
# - Persistent: for production (saves to disk)

# Development (in-memory)
chroma_client = chromadb.Client()

# Production (persistent)
# chroma_client = chromadb.PersistentClient(path="./chroma_db")

# Use OpenAI embeddings (ChromaDB will handle embedding automatically)
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.environ.get("OPENAI_API_KEY"),
    model_name="text-embedding-3-small",
)

# Create a collection (like a table in a relational DB)
collection = chroma_client.create_collection(
    name="support_docs",
    embedding_function=openai_ef,
    metadata={"hnsw:space": "cosine"},  # Use cosine distance
)

# Add documents — ChromaDB handles embedding automatically
collection.add(
    documents=[
        "How to reset your password via the settings page",
        "Login troubleshooting: clear cache, check caps lock, try incognito",
        "How to download your invoices from the billing section",
        "Cancel subscription anytime from Account > Subscription",
        "API rate limits are 100 requests per minute per key",
    ],
    metadatas=[
        {"category": "auth", "source": "help_center"},
        {"category": "auth", "source": "help_center"},
        {"category": "billing", "source": "help_center"},
        {"category": "billing", "source": "help_center"},
        {"category": "api", "source": "docs"},
    ],
    ids=["doc1", "doc2", "doc3", "doc4", "doc5"],
)

print(f"Collection has {collection.count()} documents")

# Query — ChromaDB embeds your query and finds nearest neighbors
results = collection.query(
    query_texts=["I can't log in to my account"],
    n_results=3,
    # Optional: filter by metadata
    # where={"category": "auth"},
    include=["documents", "metadatas", "distances"],
)

print("\nSearch results:")
for i, (doc, meta, dist) in enumerate(zip(
    results["documents"][0],
    results["metadatas"][0],
    results["distances"][0],
)):
    similarity = 1 - dist  # Convert distance to similarity
    print(f"{i+1}. Score: {similarity:.4f} | Category: {meta['category']}")
    print(f"   {doc[:80]}...")

# Update a document
collection.update(
    ids=["doc1"],
    documents=["Updated: Reset your password by clicking 'Forgot Password' on login page"],
)

# Delete a document
# collection.delete(ids=["doc5"])

Concept 06

Vector DB vs Brute Force — When Each Approach Makes Sense

Approach	Corpus Size	Latency	Infrastructure	When to Use
Brute force (numpy)	< 50k docs	10–100ms	Just numpy	Prototypes, small corpora, in-memory use cases
ChromaDB (local)	< 1M docs	1–10ms	File on disk	Single-server apps, self-hosted, medium scale
Pinecone / Weaviate	100M+ docs	< 10ms	Managed cloud	High scale, multi-tenant, production SaaS
pgvector (Postgres)	< 5M docs	5–50ms	Postgres extension	Already using Postgres, want unified storage

Decision Rule

Start with brute force (numpy). When you hit 50k+ documents or when search latency becomes a problem, migrate to ChromaDB. When you hit 1M+ documents or need multi-tenant isolation, move to a managed vector database like Pinecone. Premature vector database adoption is a common over-engineering mistake.

Concept 07

Three Core Use Cases: Search, Recommendations, Duplicate Detection

class EmbeddingUseCases:
    """Demonstrates the three core embedding use cases."""

    def __init__(self):
        self._embedding_cache = {}

    def embed_cached(self, text: str) -> list[float]:
        """Cache embeddings to avoid redundant API calls."""
        if text not in self._embedding_cache:
            self._embedding_cache[text] = get_embedding(text)
        return self._embedding_cache[text]

    def semantic_search(self, query: str, corpus: list[str], top_k: int = 3) -> list[tuple]:
        """Use case 1: Find documents relevant to a query."""
        query_emb = np.array(self.embed_cached(query))
        corpus_embs = np.array([self.embed_cached(t) for t in corpus])

        # Normalize
        query_emb /= np.linalg.norm(query_emb)
        corpus_embs /= np.linalg.norm(corpus_embs, axis=1, keepdims=True)

        similarities = corpus_embs @ query_emb
        top_indices = np.argsort(similarities)[::-1][:top_k]
        return [(corpus[i], float(similarities[i])) for i in top_indices]

    def find_similar_items(self, item: str, catalog: list[str], top_k: int = 5) -> list[tuple]:
        """Use case 2: Recommendation — find items similar to a given item."""
        return self.semantic_search(item, catalog, top_k + 1)[1:]  # Exclude self

    def find_duplicates(
        self,
        texts: list[str],
        similarity_threshold: float = 0.95
    ) -> list[tuple[str, str, float]]:
        """
        Use case 3: Find near-duplicate texts.
        Returns pairs of (text1, text2, similarity_score).
        """
        embeddings = np.array([self.embed_cached(t) for t in texts])
        norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
        embeddings = embeddings / norms

        # Compute all pairwise similarities
        similarity_matrix = embeddings @ embeddings.T

        duplicates = []
        for i in range(len(texts)):
            for j in range(i + 1, len(texts)):
                similarity = float(similarity_matrix[i, j])
                if similarity >= similarity_threshold:
                    duplicates.append((texts[i], texts[j], similarity))

        return sorted(duplicates, key=lambda x: x[2], reverse=True)

# Demo: duplicate detection
uc = EmbeddingUseCases()
support_tickets = [
    "I can't log in to my account",
    "Unable to sign in — keeps saying wrong password",
    "Login not working for me",
    "How do I upgrade my plan?",
    "I want to switch to a higher tier subscription",
]

dupes = uc.find_duplicates(support_tickets, threshold=0.85)
print("Near-duplicate tickets:")
for t1, t2, score in dupes:
    print(f"  [{score:.3f}] '{t1[:40]}' ≈ '{t2[:40]}'")

Interview Answer: Explain Embeddings

"An embedding is a dense numerical representation of text — a list of floating-point numbers — generated by a neural network trained to place semantically similar texts near each other in vector space. When I embed 'password reset help' and 'forgot my login,' both phrases become high-dimensional vectors that are very close together as measured by cosine similarity, even though they share no exact words. This is the key advantage over keyword search, which would find no overlap between those two phrases. Embeddings power semantic search, recommendations, duplicate detection, and are a core primitive in RAG systems where we retrieve relevant documents before generating a response."

CP-05 Summary

Embeddings convert text to coordinates in high-dimensional concept space
OpenAI's text-embedding-3-small gives 1536-dimensional vectors at low cost
Cosine similarity measures how similar two vectors are (1.0 = identical meaning)
Semantic search pipeline: embed corpus → store → embed query → find nearest
ChromaDB handles embedding + storage + retrieval in one library
Use brute force (numpy) up to ~50k docs; ChromaDB up to ~1M docs
Three use cases: semantic search, recommendations, duplicate detection

Embeddings and Semantic Search: How to Build Search That Understands Meaning
From 'coordinates in concept space' to a working search engine

Table of Contents