← Back to Blog

Dense Retrieval from Scratch

Why Keyword Matching Fails

Search for "how to fix a leaking faucet" in a document collection. BM25 faithfully retrieves every document containing the words "leaking," "faucet," and "fix." But the best repair guide in your collection is titled "Plumbing Repair: Stopping Drips from Bathroom Fixtures" — and it shares zero keywords with your query. Traditional search misses it entirely.

This is the vocabulary mismatch problem, and it's been the Achilles' heel of keyword-based search for decades. Users say "fix," documents say "repair." Users say "leaking faucet," documents say "dripping fixture." Synonyms, paraphrases, and conceptual matches all fall through the cracks of exact string matching.

For forty years, the search community's workhorse has been BM25 (Best Matching 25), a refinement of TF-IDF that scores documents by how many query terms they contain, weighted by how rare those terms are across the corpus. It's fast, interpretable, and requires zero training data. Let's build one:

import math
from collections import Counter

class BM25:
    def __init__(self, docs, k1=1.5, b=0.75):
        self.docs = [d.lower().split() for d in docs]
        self.k1, self.b = k1, b
        self.avgdl = sum(len(d) for d in self.docs) / len(self.docs)
        self.df = Counter()           # document frequency per term
        for doc in self.docs:
            for term in set(doc):
                self.df[term] += 1
        self.N = len(self.docs)

    def score(self, query, doc_idx):
        doc = self.docs[doc_idx]
        tf = Counter(doc)
        dl = len(doc)
        s = 0.0
        for term in query.lower().split():
            if term not in tf:
                continue
            idf = math.log((self.N - self.df[term] + 0.5) / (self.df[term] + 0.5) + 1)
            numerator = tf[term] * (self.k1 + 1)
            denominator = tf[term] + self.k1 * (1 - self.b + self.b * dl / self.avgdl)
            s += idf * numerator / denominator
        return s

    def search(self, query, top_k=3):
        scores = [(i, self.score(query, i)) for i in range(self.N)]
        return sorted(scores, key=lambda x: -x[1])[:top_k]

# The vocabulary mismatch problem in action
corpus = [
    "Plumbing repair: stopping drips from bathroom fixtures",
    "How to fix a leaking kitchen faucet step by step",
    "Best bathroom renovation ideas for small spaces",
    "Emergency pipe burst: shutting off the main water valve",
    "Guide for replacing worn rubber washers in tap valves",
]

bm25 = BM25(corpus)
results = bm25.search("how to fix a leaking faucet")
for idx, score in results:
    print(f"  [{score:.2f}] {corpus[idx]}")
# [7.66] How to fix a leaking kitchen faucet step by step
# [0.00] Plumbing repair: stopping drips from bathroom fixtures  <-- zero!
# [0.00] Best bathroom renovation ideas for small spaces

BM25 nails the document with exact keyword overlap, but the plumbing repair guide — arguably the most relevant result — ties for last with a score of zero because it shares no terms with the query. On the MS MARCO passage retrieval benchmark, BM25 achieves an MRR@10 of about 0.19. Dense retrieval models reach 0.33+, a 74% relative improvement. The secret? They encode text into dense vector representations where semantic similarity corresponds to vector proximity, making "leaking faucet" and "dripping fixture" neighbors in embedding space.

The Bi-Encoder Architecture

Dense retrieval works by encoding queries and documents into fixed-size vectors, then measuring similarity via dot product or cosine distance. The dominant architecture is the bi-encoder: two separate encoders — one for queries (Eq), one for documents (Ed) — that map text to the same vector space.

The scoring function is elegantly simple:

score(q, d) = Eq(q) · Ed(d)

Why bi-encoders instead of cross-encoders? A cross-encoder concatenates query and document as a single input — [query; SEP; document] — and processes them jointly through all transformer layers. This allows rich token-level interactions between query and document, making cross-encoders more accurate. But there's a devastating catch: to score N documents, you need N forward passes. That's fine for reranking 100 candidates, but impossibly slow for searching millions of documents.

Bi-encoders decouple the problem. Document embeddings are computed offline — encode all your documents once and store their vectors. At query time, you compute a single query vector and find nearest neighbors using fast vector search. One forward pass for the query, then a dot product against pre-computed vectors. That's the speed trick that makes dense retrieval practical at scale.

In practice, both encoders are often initialized from the same pretrained model (like BERT) and may even share weights. A common design choice is mean pooling — averaging all token embeddings to produce the fixed-size vector — which the Sentence-BERT paper found outperforms CLS token pooling for sentence-level tasks.

import numpy as np

class BiEncoder:
    """Minimal bi-encoder with random projection (conceptual demo)."""
    def __init__(self, vocab_size=5000, embed_dim=64, hidden_dim=128):
        rng = np.random.RandomState(42)
        self.embed = rng.randn(vocab_size, embed_dim) * 0.02
        self.W1 = rng.randn(embed_dim, hidden_dim) * 0.02
        self.b1 = np.zeros(hidden_dim)
        self.W2 = rng.randn(hidden_dim, embed_dim) * 0.02
        self.b2 = np.zeros(embed_dim)

    def tokenize(self, text):
        return [hash(w) % 5000 for w in text.lower().split()]

    def encode(self, text):
        tokens = self.tokenize(text)
        token_vecs = self.embed[tokens]           # (T, embed_dim)
        pooled = token_vecs.mean(axis=0)          # mean pooling
        h = np.maximum(0, pooled @ self.W1 + self.b1)   # ReLU
        out = h @ self.W2 + self.b2
        return out / (np.linalg.norm(out) + 1e-8)       # L2 normalize

    def score(self, query, doc):
        q_vec = self.encode(query)
        d_vec = self.encode(doc)
        return float(q_vec @ d_vec)               # dot product

    def search(self, query, docs, top_k=3):
        q_vec = self.encode(query)
        d_vecs = np.array([self.encode(d) for d in docs])
        scores = d_vecs @ q_vec                   # batch dot product
        top_idx = np.argsort(-scores)[:top_k]
        return [(i, scores[i]) for i in top_idx]

# Without training, the bi-encoder produces near-random rankings.
# The magic happens when we train it with contrastive learning...

This bi-encoder produces garbage rankings because its weights are random. The architecture is just the scaffold — the real magic is in how we train it. That brings us to contrastive learning.

Training with Contrastive Learning

The training objective for dense retrieval is intuitive: given a query q, a relevant document d+, and a set of irrelevant documents {d}, push the query embedding close to its positive document and far from the negatives.

The loss function that makes this happen is InfoNCE (Noise Contrastive Estimation):

L = −log( exp(q · d+ / τ) / ∑i exp(q · di / τ) )

This is a softmax over similarity scores — the positive document's score in the numerator, all documents (positive + negatives) in the denominator. The temperature τ controls the sharpness: smaller τ makes the distribution peakier, forcing the model to make harder distinctions.

The brilliant insight from DPR (Dense Passage Retrieval, Karpukhin et al. 2020) was in-batch negatives. Given a batch of B (query, positive_document) pairs, every other query's positive document serves as a negative for the current query. From B pairs you get B positives and B×(B−1) negatives — quadratic negatives from linear data. This is why batch size matters enormously for dense retrieval training: batch size 128 gives 127 negatives per query for free.

def infonce_loss(q_vecs, d_vecs, temperature=0.05):
    """
    InfoNCE loss with in-batch negatives.
    q_vecs: (B, dim) query embeddings
    d_vecs: (B, dim) positive document embeddings
    Each q_vecs[i] pairs with d_vecs[i] (positive).
    All other d_vecs[j!=i] are negatives for query i.
    """
    # All-pairs similarity matrix: (B, B)
    sim_matrix = q_vecs @ d_vecs.T / temperature

    # Labels: diagonal entries are the positives
    labels = np.arange(len(q_vecs))

    # Numerically stable log-sum-exp (subtract row max to prevent overflow)
    row_max = np.max(sim_matrix, axis=1, keepdims=True)
    log_sum_exp = np.log(np.sum(np.exp(sim_matrix - row_max), axis=1)) + row_max.squeeze()
    positive_scores = sim_matrix[labels, labels]
    loss = -np.mean(positive_scores - log_sum_exp)
    return loss

# Gradient of InfoNCE pushes positive pairs closer,
# negative pairs apart. The similarity matrix encodes
# B^2 comparisons from just B data pairs.

# Example: batch of 4 query-document pairs
rng = np.random.RandomState(7)
q = rng.randn(4, 64)
d = rng.randn(4, 64)
q = q / np.linalg.norm(q, axis=1, keepdims=True)
d = d / np.linalg.norm(d, axis=1, keepdims=True)

print(f"Loss (random embeddings): {infonce_loss(q, d):.3f}")
# Loss (random embeddings): 3.712
# High because tau=0.05 amplifies random similarities.
# Perfect alignment would push loss toward 0.

With random embeddings at τ = 1.0, the loss would start near log(B) ≈ 1.39 — uniform probability across B candidates. With a sharper temperature like τ = 0.05, the loss starts much higher (3.7 here) because the temperature amplifies random similarities, creating spurious confidence. As training progresses, the loss decreases toward zero, meaning the model assigns most probability mass to the correct document for each query.

Temperature is critical. At τ = 1.0, the softmax is soft and the model gets easy gradients. At τ = 0.01, the distribution is extremely peaked, demanding near-perfect discrimination. DPR used τ = 1.0 with dot product (no explicit temperature); Sentence-BERT typically uses τ = 0.05 with cosine similarity. The right temperature depends on how hard your negatives are — which leads us to the next section.

Hard Negative Mining

Here's the problem with in-batch negatives: most of them are too easy. If your query is about plumbing repair, a random batch negative might be about deep learning frameworks — trivially distinguishable. The model scores it low with no effort, generating near-zero gradient. Easy negatives waste training compute.

Hard negatives are documents that are superficially relevant but not actually the answer. BM25 provides natural hard negatives: top BM25 results that aren't labeled as relevant share lots of keywords with the query but aren't the right document. They force the model to learn deeper semantic understanding beyond keyword overlap.

DPR's recipe includes one BM25 hard negative per query alongside the in-batch negatives. Later work pushed this further:

There's a danger: train too aggressively on hard negatives and you get representation collapse — all embeddings cluster into a small region, destroying the structure of the embedding space. The fix is balancing hard and easy negatives, and monitoring embedding diversity during training.

def mine_hard_negatives(retriever, queries, pos_docs, corpus, top_k=10):
    """
    Iterative hard negative mining: use the current model
    to find near-miss documents as hard negatives.
    """
    hard_negs = []
    for q, pos in zip(queries, pos_docs):
        results = retriever.search(q, corpus, top_k=top_k)
        # Filter out the actual positive document
        negs = [corpus[idx] for idx, _ in results if corpus[idx] != pos]
        hard_negs.append(negs[:3])  # keep top-3 hard negatives
    return hard_negs

# Training loop with iterative hard negative mining:
# Round 1: Train with random/in-batch negatives
# Round 2: Mine hard negatives with round-1 model, retrain
# Round 3: Mine harder negatives with round-2 model, retrain
# Each round, the negatives get harder and the model improves.
#
# ANCE repeats this loop until convergence, refreshing the
# negative index every few thousand training steps.

# Example mining iteration
queries = ["how to fix a leaking faucet",
           "best programming language for beginners",
           "symptoms of vitamin D deficiency"]
positives = ["plumbing repair: stopping drips from fixtures",
             "python is ideal for newcomers to coding",
             "fatigue and bone pain from low vitamin D levels"]
corpus = positives + [
    "kitchen renovation cost estimates 2024",
    "javascript framework comparison react vue angular",
    "common cold vs flu: how to tell the difference",
    "how to replace a bathroom faucet cartridge",
    "learning python: first steps in programming",
    "vitamin supplements: what you need to know",
]

# After a few rounds, the model learns that
# "replace a faucet cartridge" is NOT the same as "fix a leak"
# even though they share "faucet" -- that's the power of hard negatives

The progression from random negatives to BM25 hard negatives to model-mined hard negatives mirrors a curriculum: the model first learns coarse-grained distinctions (sports vs. medicine), then fine-grained ones (faucet repair vs. faucet replacement). Each mining round forces the model to carve finer boundaries in embedding space.

Approximate Nearest Neighbor Search

Training gives us great embeddings. But now we have 10 million document vectors, each 768-dimensional, and we need to find the top-10 nearest neighbors for a query vector. Brute-force search requires 10 million dot products — feasible once, but too slow at thousands of queries per second.

Three families of approximate nearest neighbor (ANN) algorithms dominate:

Let's build product quantization from scratch. The core idea: instead of computing the full dot product, approximate it by looking up pre-computed distances in codebooks.

class ProductQuantizer:
    def __init__(self, dim=64, n_subvectors=8, n_centroids=256):
        self.m = n_subvectors
        self.k = n_centroids
        self.sub_dim = dim // n_subvectors
        self.codebooks = None       # (m, k, sub_dim)
        self.codes = None           # (N, m) -- encoded corpus

    def train(self, vectors, n_iter=20):
        """Learn codebooks via k-means on each sub-vector slice."""
        rng = np.random.RandomState(42)
        N, D = vectors.shape
        self.codebooks = np.zeros((self.m, self.k, self.sub_dim))
        for i in range(self.m):
            sub = vectors[:, i*self.sub_dim:(i+1)*self.sub_dim]
            # Mini k-means
            centroids = sub[rng.choice(N, self.k, replace=False)]
            for _ in range(n_iter):
                dists = np.linalg.norm(sub[:, None] - centroids[None], axis=2)
                assignments = dists.argmin(axis=1)
                for c in range(self.k):
                    mask = assignments == c
                    if mask.any():
                        centroids[c] = sub[mask].mean(axis=0)
            self.codebooks[i] = centroids

    def encode(self, vectors):
        """Encode vectors as sequences of centroid indices."""
        N = len(vectors)
        self.codes = np.zeros((N, self.m), dtype=np.int32)
        for i in range(self.m):
            sub = vectors[:, i*self.sub_dim:(i+1)*self.sub_dim]
            dists = np.linalg.norm(sub[:, None] - self.codebooks[i][None], axis=2)
            self.codes[:, i] = dists.argmin(axis=1)

    def search(self, query, top_k=5):
        """Asymmetric distance: exact query vs quantized documents."""
        # Precompute distance table: query sub-vector to each centroid
        dist_table = np.zeros((self.m, self.k))
        for i in range(self.m):
            q_sub = query[i*self.sub_dim:(i+1)*self.sub_dim]
            dist_table[i] = np.linalg.norm(q_sub - self.codebooks[i], axis=1)
        # Sum lookup distances for each document
        approx_dists = np.zeros(len(self.codes))
        for i in range(self.m):
            approx_dists += dist_table[i, self.codes[:, i]]
        return np.argsort(approx_dists)[:top_k]

# In production: 768-dim float32 vectors (3072 bytes) compress to
# ~96 bytes with m=96 sub-vectors and 256 centroids -- 32x savings.
# Our demo uses dim=64 and m=8 to keep the code short.
# The distance table trick avoids decompressing documents entirely.

The key insight is asymmetric distance computation: the query vector is kept exact (it's just one vector), while documents are compressed. For each sub-vector slice, we pre-compute distances from the query sub-vector to all centroids, creating a lookup table. Then scoring a document is just m table lookups and a sum — vastly cheaper than a full dot product.

Try It: Embedding Space Explorer

Documents are colored dots clustered by topic. Type a query or pick an example — see how Dense Retrieval (semantic neighbors) compares to BM25 (keyword matches).

fix a leaking faucet · best language for beginners · tired all the time · how does a car engine work

ColBERT and Late Interaction

Single-vector bi-encoders compress an entire document into one fixed-size vector. That's a severe bottleneck — a 500-word passage reduced to 768 numbers. Fine-grained information about individual words and phrases gets lost in the compression.

ColBERT (Contextualized Late Interaction over BERT, Khattab & Zaharia 2020) takes a different approach: represent each text as a bag of contextualized token embeddings. A query with 10 tokens produces 10 embeddings; a document with 200 tokens produces 200 embeddings.

The scoring function is MaxSim: for each query token, find its maximum similarity to any document token, then sum across all query tokens:

score(q, d) = ∑i maxj qi · dj

This is a "late interaction" model — queries and documents are encoded independently (like bi-encoders), but interaction happens at the token level during scoring (like cross-encoders). It's a middle ground: richer than single-vector, faster than full cross-attention.

def maxsim_score(q_embeddings, d_embeddings):
    """
    ColBERT MaxSim scoring.
    q_embeddings: (n_q, dim) -- one embedding per query token
    d_embeddings: (n_d, dim) -- one embedding per document token

    For each query token, find the max similarity to any doc token.
    Sum these max similarities across all query tokens.
    """
    # Token-token similarity matrix: (n_q, n_d)
    sim_matrix = q_embeddings @ d_embeddings.T

    # MaxSim: max over document tokens for each query token
    max_sims = sim_matrix.max(axis=1)   # (n_q,)
    return float(max_sims.sum())

# Compare single-vector vs MaxSim
rng = np.random.RandomState(42)

# Simulate encoding "fix leaking faucet" (3 tokens)
q_tokens = rng.randn(3, 64)
q_tokens = q_tokens / np.linalg.norm(q_tokens, axis=1, keepdims=True)

# Document 1: "plumbing repair stopping drips fixtures" (5 tokens)
# -- semantically similar but different words
d1_tokens = q_tokens[0:1] * 0.7 + rng.randn(5, 64) * 0.3
d1_tokens = d1_tokens / np.linalg.norm(d1_tokens, axis=1, keepdims=True)

# Document 2: random unrelated document (5 tokens)
d2_tokens = rng.randn(5, 64)
d2_tokens = d2_tokens / np.linalg.norm(d2_tokens, axis=1, keepdims=True)

print(f"MaxSim(query, related_doc): {maxsim_score(q_tokens, d1_tokens):.3f}")
print(f"MaxSim(query, random_doc):  {maxsim_score(q_tokens, d2_tokens):.3f}")
# MaxSim preserves fine-grained token matches that
# single-vector compression would average away

The trade-off is storage: ColBERT stores per-token embeddings for every document, which is much more expensive than one vector per document. ColBERTv2 (Santhanam et al. 2022) addresses this with residual compression — storing each token embedding as a quantized residual from its nearest centroid, reducing storage 6–10x while maintaining retrieval quality.

Try It: Contrastive Training Playground

Watch contrastive learning reshape embedding space. Queries (circles) and positive documents (squares) are connected by green lines. Click Train Step to run a gradient update — positives pull together, negatives push apart.

Loss: -- | Recall@1: -- | Step: 0

Evaluation and the Modern Landscape

How do we measure retrieval quality? The standard metrics for ranked retrieval are:

The benchmark landscape has evolved rapidly:

Modern embedding models address the generalization gap with massive multi-source training. E5 (Wang et al. 2022) uses weakly-supervised contrastive pre-training on 1 billion text pairs scraped from the web. BGE trains on diverse sources with instruction-aware retrieval. These models achieve MRR@10 above 0.40 on MS MARCO and strong zero-shot performance on BEIR.

The current best practice is hybrid retrieval: combine BM25 scores with dense retrieval scores (often via simple weighted sum or reciprocal rank fusion). This captures both exact keyword matches and semantic similarity, outperforming either approach alone. Most production RAG systems — from Google Search to enterprise document retrieval — use some form of hybrid retrieval feeding into a cross-encoder reranker.

From Keywords to Meaning

We started with a simple problem — BM25 can't match "leaking faucet" to "dripping fixture" — and built our way through the entire dense retrieval stack: bi-encoders that compress text into vectors, contrastive learning that shapes embedding space by pushing positives together and negatives apart, hard negative mining that forces ever-finer distinctions, product quantization that makes billion-scale search feasible, and ColBERT's late interaction that preserves token-level detail.

The field continues to evolve fast. Instruction-tuned embedding models, learned sparse representations (SPLADE), and multi-vector approaches are blurring the line between sparse and dense. But the core insight remains: by learning to map text into a geometric space where meaning corresponds to proximity, we transformed search from string matching into genuine language understanding. Every time you ask a question and get a relevant answer — even when you use completely different words — dense retrieval is probably why.

References & Further Reading