Dense Retrieval from Scratch
Why Keyword Matching Fails
Search for "how to fix a leaking faucet" in a document collection. BM25 faithfully retrieves every document containing the words "leaking," "faucet," and "fix." But the best repair guide in your collection is titled "Plumbing Repair: Stopping Drips from Bathroom Fixtures" — and it shares zero keywords with your query. Traditional search misses it entirely.
This is the vocabulary mismatch problem, and it's been the Achilles' heel of keyword-based search for decades. Users say "fix," documents say "repair." Users say "leaking faucet," documents say "dripping fixture." Synonyms, paraphrases, and conceptual matches all fall through the cracks of exact string matching.
For forty years, the search community's workhorse has been BM25 (Best Matching 25), a refinement of TF-IDF that scores documents by how many query terms they contain, weighted by how rare those terms are across the corpus. It's fast, interpretable, and requires zero training data. Let's build one:
import math
from collections import Counter
class BM25:
def __init__(self, docs, k1=1.5, b=0.75):
self.docs = [d.lower().split() for d in docs]
self.k1, self.b = k1, b
self.avgdl = sum(len(d) for d in self.docs) / len(self.docs)
self.df = Counter() # document frequency per term
for doc in self.docs:
for term in set(doc):
self.df[term] += 1
self.N = len(self.docs)
def score(self, query, doc_idx):
doc = self.docs[doc_idx]
tf = Counter(doc)
dl = len(doc)
s = 0.0
for term in query.lower().split():
if term not in tf:
continue
idf = math.log((self.N - self.df[term] + 0.5) / (self.df[term] + 0.5) + 1)
numerator = tf[term] * (self.k1 + 1)
denominator = tf[term] + self.k1 * (1 - self.b + self.b * dl / self.avgdl)
s += idf * numerator / denominator
return s
def search(self, query, top_k=3):
scores = [(i, self.score(query, i)) for i in range(self.N)]
return sorted(scores, key=lambda x: -x[1])[:top_k]
# The vocabulary mismatch problem in action
corpus = [
"Plumbing repair: stopping drips from bathroom fixtures",
"How to fix a leaking kitchen faucet step by step",
"Best bathroom renovation ideas for small spaces",
"Emergency pipe burst: shutting off the main water valve",
"Guide for replacing worn rubber washers in tap valves",
]
bm25 = BM25(corpus)
results = bm25.search("how to fix a leaking faucet")
for idx, score in results:
print(f" [{score:.2f}] {corpus[idx]}")
# [7.66] How to fix a leaking kitchen faucet step by step
# [0.00] Plumbing repair: stopping drips from bathroom fixtures <-- zero!
# [0.00] Best bathroom renovation ideas for small spaces
BM25 nails the document with exact keyword overlap, but the plumbing repair guide — arguably the most relevant result — ties for last with a score of zero because it shares no terms with the query. On the MS MARCO passage retrieval benchmark, BM25 achieves an MRR@10 of about 0.19. Dense retrieval models reach 0.33+, a 74% relative improvement. The secret? They encode text into dense vector representations where semantic similarity corresponds to vector proximity, making "leaking faucet" and "dripping fixture" neighbors in embedding space.
The Bi-Encoder Architecture
Dense retrieval works by encoding queries and documents into fixed-size vectors, then measuring similarity via dot product or cosine distance. The dominant architecture is the bi-encoder: two separate encoders — one for queries (Eq), one for documents (Ed) — that map text to the same vector space.
The scoring function is elegantly simple:
score(q, d) = Eq(q) · Ed(d)
Why bi-encoders instead of cross-encoders? A cross-encoder concatenates query and document as a single input — [query; SEP; document] — and processes them jointly through all transformer layers. This allows rich token-level interactions between query and document, making cross-encoders more accurate. But there's a devastating catch: to score N documents, you need N forward passes. That's fine for reranking 100 candidates, but impossibly slow for searching millions of documents.
Bi-encoders decouple the problem. Document embeddings are computed offline — encode all your documents once and store their vectors. At query time, you compute a single query vector and find nearest neighbors using fast vector search. One forward pass for the query, then a dot product against pre-computed vectors. That's the speed trick that makes dense retrieval practical at scale.
In practice, both encoders are often initialized from the same pretrained model (like BERT) and may even share weights. A common design choice is mean pooling — averaging all token embeddings to produce the fixed-size vector — which the Sentence-BERT paper found outperforms CLS token pooling for sentence-level tasks.
import numpy as np
class BiEncoder:
"""Minimal bi-encoder with random projection (conceptual demo)."""
def __init__(self, vocab_size=5000, embed_dim=64, hidden_dim=128):
rng = np.random.RandomState(42)
self.embed = rng.randn(vocab_size, embed_dim) * 0.02
self.W1 = rng.randn(embed_dim, hidden_dim) * 0.02
self.b1 = np.zeros(hidden_dim)
self.W2 = rng.randn(hidden_dim, embed_dim) * 0.02
self.b2 = np.zeros(embed_dim)
def tokenize(self, text):
return [hash(w) % 5000 for w in text.lower().split()]
def encode(self, text):
tokens = self.tokenize(text)
token_vecs = self.embed[tokens] # (T, embed_dim)
pooled = token_vecs.mean(axis=0) # mean pooling
h = np.maximum(0, pooled @ self.W1 + self.b1) # ReLU
out = h @ self.W2 + self.b2
return out / (np.linalg.norm(out) + 1e-8) # L2 normalize
def score(self, query, doc):
q_vec = self.encode(query)
d_vec = self.encode(doc)
return float(q_vec @ d_vec) # dot product
def search(self, query, docs, top_k=3):
q_vec = self.encode(query)
d_vecs = np.array([self.encode(d) for d in docs])
scores = d_vecs @ q_vec # batch dot product
top_idx = np.argsort(-scores)[:top_k]
return [(i, scores[i]) for i in top_idx]
# Without training, the bi-encoder produces near-random rankings.
# The magic happens when we train it with contrastive learning...
This bi-encoder produces garbage rankings because its weights are random. The architecture is just the scaffold — the real magic is in how we train it. That brings us to contrastive learning.
Training with Contrastive Learning
The training objective for dense retrieval is intuitive: given a query q, a relevant document d+, and a set of irrelevant documents {d−}, push the query embedding close to its positive document and far from the negatives.
The loss function that makes this happen is InfoNCE (Noise Contrastive Estimation):
L = −log( exp(q · d+ / τ) / ∑i exp(q · di / τ) )
This is a softmax over similarity scores — the positive document's score in the numerator, all documents (positive + negatives) in the denominator. The temperature τ controls the sharpness: smaller τ makes the distribution peakier, forcing the model to make harder distinctions.
The brilliant insight from DPR (Dense Passage Retrieval, Karpukhin et al. 2020) was in-batch negatives. Given a batch of B (query, positive_document) pairs, every other query's positive document serves as a negative for the current query. From B pairs you get B positives and B×(B−1) negatives — quadratic negatives from linear data. This is why batch size matters enormously for dense retrieval training: batch size 128 gives 127 negatives per query for free.
def infonce_loss(q_vecs, d_vecs, temperature=0.05):
"""
InfoNCE loss with in-batch negatives.
q_vecs: (B, dim) query embeddings
d_vecs: (B, dim) positive document embeddings
Each q_vecs[i] pairs with d_vecs[i] (positive).
All other d_vecs[j!=i] are negatives for query i.
"""
# All-pairs similarity matrix: (B, B)
sim_matrix = q_vecs @ d_vecs.T / temperature
# Labels: diagonal entries are the positives
labels = np.arange(len(q_vecs))
# Numerically stable log-sum-exp (subtract row max to prevent overflow)
row_max = np.max(sim_matrix, axis=1, keepdims=True)
log_sum_exp = np.log(np.sum(np.exp(sim_matrix - row_max), axis=1)) + row_max.squeeze()
positive_scores = sim_matrix[labels, labels]
loss = -np.mean(positive_scores - log_sum_exp)
return loss
# Gradient of InfoNCE pushes positive pairs closer,
# negative pairs apart. The similarity matrix encodes
# B^2 comparisons from just B data pairs.
# Example: batch of 4 query-document pairs
rng = np.random.RandomState(7)
q = rng.randn(4, 64)
d = rng.randn(4, 64)
q = q / np.linalg.norm(q, axis=1, keepdims=True)
d = d / np.linalg.norm(d, axis=1, keepdims=True)
print(f"Loss (random embeddings): {infonce_loss(q, d):.3f}")
# Loss (random embeddings): 3.712
# High because tau=0.05 amplifies random similarities.
# Perfect alignment would push loss toward 0.
With random embeddings at τ = 1.0, the loss would start near log(B) ≈ 1.39 — uniform probability across B candidates. With a sharper temperature like τ = 0.05, the loss starts much higher (3.7 here) because the temperature amplifies random similarities, creating spurious confidence. As training progresses, the loss decreases toward zero, meaning the model assigns most probability mass to the correct document for each query.
Temperature is critical. At τ = 1.0, the softmax is soft and the model gets easy gradients. At τ = 0.01, the distribution is extremely peaked, demanding near-perfect discrimination. DPR used τ = 1.0 with dot product (no explicit temperature); Sentence-BERT typically uses τ = 0.05 with cosine similarity. The right temperature depends on how hard your negatives are — which leads us to the next section.
Hard Negative Mining
Here's the problem with in-batch negatives: most of them are too easy. If your query is about plumbing repair, a random batch negative might be about deep learning frameworks — trivially distinguishable. The model scores it low with no effort, generating near-zero gradient. Easy negatives waste training compute.
Hard negatives are documents that are superficially relevant but not actually the answer. BM25 provides natural hard negatives: top BM25 results that aren't labeled as relevant share lots of keywords with the query but aren't the right document. They force the model to learn deeper semantic understanding beyond keyword overlap.
DPR's recipe includes one BM25 hard negative per query alongside the in-batch negatives. Later work pushed this further:
- ANCE (Xiong et al. 2021): Use the retrieval model itself to find hard negatives. Train, retrieve, mine hard negatives from the model's own top results, retrain. Each iteration produces harder negatives as the model improves.
- Knowledge distillation: Use a cross-encoder (slow but accurate) to score thousands of candidates, then train the bi-encoder to match those soft scores. The student learns from the teacher's nuanced relevance judgments.
There's a danger: train too aggressively on hard negatives and you get representation collapse — all embeddings cluster into a small region, destroying the structure of the embedding space. The fix is balancing hard and easy negatives, and monitoring embedding diversity during training.
def mine_hard_negatives(retriever, queries, pos_docs, corpus, top_k=10):
"""
Iterative hard negative mining: use the current model
to find near-miss documents as hard negatives.
"""
hard_negs = []
for q, pos in zip(queries, pos_docs):
results = retriever.search(q, corpus, top_k=top_k)
# Filter out the actual positive document
negs = [corpus[idx] for idx, _ in results if corpus[idx] != pos]
hard_negs.append(negs[:3]) # keep top-3 hard negatives
return hard_negs
# Training loop with iterative hard negative mining:
# Round 1: Train with random/in-batch negatives
# Round 2: Mine hard negatives with round-1 model, retrain
# Round 3: Mine harder negatives with round-2 model, retrain
# Each round, the negatives get harder and the model improves.
#
# ANCE repeats this loop until convergence, refreshing the
# negative index every few thousand training steps.
# Example mining iteration
queries = ["how to fix a leaking faucet",
"best programming language for beginners",
"symptoms of vitamin D deficiency"]
positives = ["plumbing repair: stopping drips from fixtures",
"python is ideal for newcomers to coding",
"fatigue and bone pain from low vitamin D levels"]
corpus = positives + [
"kitchen renovation cost estimates 2024",
"javascript framework comparison react vue angular",
"common cold vs flu: how to tell the difference",
"how to replace a bathroom faucet cartridge",
"learning python: first steps in programming",
"vitamin supplements: what you need to know",
]
# After a few rounds, the model learns that
# "replace a faucet cartridge" is NOT the same as "fix a leak"
# even though they share "faucet" -- that's the power of hard negatives
The progression from random negatives to BM25 hard negatives to model-mined hard negatives mirrors a curriculum: the model first learns coarse-grained distinctions (sports vs. medicine), then fine-grained ones (faucet repair vs. faucet replacement). Each mining round forces the model to carve finer boundaries in embedding space.
Approximate Nearest Neighbor Search
Training gives us great embeddings. But now we have 10 million document vectors, each 768-dimensional, and we need to find the top-10 nearest neighbors for a query vector. Brute-force search requires 10 million dot products — feasible once, but too slow at thousands of queries per second.
Three families of approximate nearest neighbor (ANN) algorithms dominate:
- Product Quantization (PQ): Compress each vector by splitting it into sub-vectors and quantizing each to its nearest centroid from a learned codebook. A 768-dim vector becomes a sequence of short codes, reducing memory 10–100x while enabling fast approximate distance computation.
- HNSW (Hierarchical Navigable Small World): Build a multi-layer graph where each vector connects to its approximate nearest neighbors. Search navigates from sparse upper layers (long-range jumps) down to dense bottom layers (fine-grained search). Achieves millisecond search over billions of vectors.
- IVF (Inverted File Index): Partition the vector space into clusters using k-means. At query time, probe only the nearest clusters instead of all vectors. Combines naturally with PQ (IVFPQ) for both speed and compression.
Let's build product quantization from scratch. The core idea: instead of computing the full dot product, approximate it by looking up pre-computed distances in codebooks.
class ProductQuantizer:
def __init__(self, dim=64, n_subvectors=8, n_centroids=256):
self.m = n_subvectors
self.k = n_centroids
self.sub_dim = dim // n_subvectors
self.codebooks = None # (m, k, sub_dim)
self.codes = None # (N, m) -- encoded corpus
def train(self, vectors, n_iter=20):
"""Learn codebooks via k-means on each sub-vector slice."""
rng = np.random.RandomState(42)
N, D = vectors.shape
self.codebooks = np.zeros((self.m, self.k, self.sub_dim))
for i in range(self.m):
sub = vectors[:, i*self.sub_dim:(i+1)*self.sub_dim]
# Mini k-means
centroids = sub[rng.choice(N, self.k, replace=False)]
for _ in range(n_iter):
dists = np.linalg.norm(sub[:, None] - centroids[None], axis=2)
assignments = dists.argmin(axis=1)
for c in range(self.k):
mask = assignments == c
if mask.any():
centroids[c] = sub[mask].mean(axis=0)
self.codebooks[i] = centroids
def encode(self, vectors):
"""Encode vectors as sequences of centroid indices."""
N = len(vectors)
self.codes = np.zeros((N, self.m), dtype=np.int32)
for i in range(self.m):
sub = vectors[:, i*self.sub_dim:(i+1)*self.sub_dim]
dists = np.linalg.norm(sub[:, None] - self.codebooks[i][None], axis=2)
self.codes[:, i] = dists.argmin(axis=1)
def search(self, query, top_k=5):
"""Asymmetric distance: exact query vs quantized documents."""
# Precompute distance table: query sub-vector to each centroid
dist_table = np.zeros((self.m, self.k))
for i in range(self.m):
q_sub = query[i*self.sub_dim:(i+1)*self.sub_dim]
dist_table[i] = np.linalg.norm(q_sub - self.codebooks[i], axis=1)
# Sum lookup distances for each document
approx_dists = np.zeros(len(self.codes))
for i in range(self.m):
approx_dists += dist_table[i, self.codes[:, i]]
return np.argsort(approx_dists)[:top_k]
# In production: 768-dim float32 vectors (3072 bytes) compress to
# ~96 bytes with m=96 sub-vectors and 256 centroids -- 32x savings.
# Our demo uses dim=64 and m=8 to keep the code short.
# The distance table trick avoids decompressing documents entirely.
The key insight is asymmetric distance computation: the query vector is kept exact (it's just one vector), while documents are compressed. For each sub-vector slice, we pre-compute distances from the query sub-vector to all centroids, creating a lookup table. Then scoring a document is just m table lookups and a sum — vastly cheaper than a full dot product.
Try It: Embedding Space Explorer
Documents are colored dots clustered by topic. Type a query or pick an example — see how Dense Retrieval (semantic neighbors) compares to BM25 (keyword matches).
ColBERT and Late Interaction
Single-vector bi-encoders compress an entire document into one fixed-size vector. That's a severe bottleneck — a 500-word passage reduced to 768 numbers. Fine-grained information about individual words and phrases gets lost in the compression.
ColBERT (Contextualized Late Interaction over BERT, Khattab & Zaharia 2020) takes a different approach: represent each text as a bag of contextualized token embeddings. A query with 10 tokens produces 10 embeddings; a document with 200 tokens produces 200 embeddings.
The scoring function is MaxSim: for each query token, find its maximum similarity to any document token, then sum across all query tokens:
score(q, d) = ∑i maxj qi · dj
This is a "late interaction" model — queries and documents are encoded independently (like bi-encoders), but interaction happens at the token level during scoring (like cross-encoders). It's a middle ground: richer than single-vector, faster than full cross-attention.
def maxsim_score(q_embeddings, d_embeddings):
"""
ColBERT MaxSim scoring.
q_embeddings: (n_q, dim) -- one embedding per query token
d_embeddings: (n_d, dim) -- one embedding per document token
For each query token, find the max similarity to any doc token.
Sum these max similarities across all query tokens.
"""
# Token-token similarity matrix: (n_q, n_d)
sim_matrix = q_embeddings @ d_embeddings.T
# MaxSim: max over document tokens for each query token
max_sims = sim_matrix.max(axis=1) # (n_q,)
return float(max_sims.sum())
# Compare single-vector vs MaxSim
rng = np.random.RandomState(42)
# Simulate encoding "fix leaking faucet" (3 tokens)
q_tokens = rng.randn(3, 64)
q_tokens = q_tokens / np.linalg.norm(q_tokens, axis=1, keepdims=True)
# Document 1: "plumbing repair stopping drips fixtures" (5 tokens)
# -- semantically similar but different words
d1_tokens = q_tokens[0:1] * 0.7 + rng.randn(5, 64) * 0.3
d1_tokens = d1_tokens / np.linalg.norm(d1_tokens, axis=1, keepdims=True)
# Document 2: random unrelated document (5 tokens)
d2_tokens = rng.randn(5, 64)
d2_tokens = d2_tokens / np.linalg.norm(d2_tokens, axis=1, keepdims=True)
print(f"MaxSim(query, related_doc): {maxsim_score(q_tokens, d1_tokens):.3f}")
print(f"MaxSim(query, random_doc): {maxsim_score(q_tokens, d2_tokens):.3f}")
# MaxSim preserves fine-grained token matches that
# single-vector compression would average away
The trade-off is storage: ColBERT stores per-token embeddings for every document, which is much more expensive than one vector per document. ColBERTv2 (Santhanam et al. 2022) addresses this with residual compression — storing each token embedding as a quantized residual from its nearest centroid, reducing storage 6–10x while maintaining retrieval quality.
Try It: Contrastive Training Playground
Watch contrastive learning reshape embedding space. Queries (circles) and positive documents (squares) are connected by green lines. Click Train Step to run a gradient update — positives pull together, negatives push apart.
Evaluation and the Modern Landscape
How do we measure retrieval quality? The standard metrics for ranked retrieval are:
- Recall@k: What fraction of relevant documents appear in the top k results? The fundamental measure of retrieval completeness.
- MRR@k (Mean Reciprocal Rank): The average of 1/rank for the first relevant result. If the answer is at position 3, the reciprocal rank is 1/3.
- NDCG@k (Normalized Discounted Cumulative Gain): Like MRR but handles multiple relevance levels (highly relevant, somewhat relevant, not relevant) with logarithmic position discounting.
The benchmark landscape has evolved rapidly:
- MS MARCO (8.8 million passages, web queries): The standard in-domain benchmark. Dense retrievers surpassed BM25 here in 2020.
- Natural Questions (Wikipedia QA): DPR's original evaluation, now a standard dense retrieval benchmark.
- BEIR (18 diverse datasets): The zero-shot generalization benchmark. This is where things get humbling — BM25 remains competitive on many BEIR tasks because dense models trained on MS MARCO don't always transfer to different domains.
Modern embedding models address the generalization gap with massive multi-source training. E5 (Wang et al. 2022) uses weakly-supervised contrastive pre-training on 1 billion text pairs scraped from the web. BGE trains on diverse sources with instruction-aware retrieval. These models achieve MRR@10 above 0.40 on MS MARCO and strong zero-shot performance on BEIR.
The current best practice is hybrid retrieval: combine BM25 scores with dense retrieval scores (often via simple weighted sum or reciprocal rank fusion). This captures both exact keyword matches and semantic similarity, outperforming either approach alone. Most production RAG systems — from Google Search to enterprise document retrieval — use some form of hybrid retrieval feeding into a cross-encoder reranker.
From Keywords to Meaning
We started with a simple problem — BM25 can't match "leaking faucet" to "dripping fixture" — and built our way through the entire dense retrieval stack: bi-encoders that compress text into vectors, contrastive learning that shapes embedding space by pushing positives together and negatives apart, hard negative mining that forces ever-finer distinctions, product quantization that makes billion-scale search feasible, and ColBERT's late interaction that preserves token-level detail.
The field continues to evolve fast. Instruction-tuned embedding models, learned sparse representations (SPLADE), and multi-vector approaches are blurring the line between sparse and dense. But the core insight remains: by learning to map text into a geometric space where meaning corresponds to proximity, we transformed search from string matching into genuine language understanding. Every time you ask a question and get a relevant answer — even when you use completely different words — dense retrieval is probably why.
References & Further Reading
- Karpukhin et al. — Dense Passage Retrieval for Open-Domain Question Answering (2020) — the DPR paper that popularized bi-encoder retrieval with in-batch negatives
- Xiong et al. — Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval (2021) — ANCE and iterative hard negative mining
- Khattab & Zaharia — ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction (2020) — the MaxSim late interaction model
- Reimers & Gurevych — Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (2019) — established mean pooling and siamese training for sentence embeddings
- Jégou et al. — Product Quantization for Nearest Neighbor Search (2011) — the foundational PQ paper
- Malkov & Yashunin — Efficient and Robust Approximate Nearest Neighbor Using Hierarchical Navigable Small World Graphs (2018) — HNSW, the most widely deployed ANN algorithm
- Wang et al. — Text Embeddings by Weakly-Supervised Contrastive Pre-training (2022) — E5 embedding model, trained on 1B text pairs
- Thakur et al. — BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models (2021) — the standard cross-domain IR benchmark
- Santhanam et al. — ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction (2022) — residual compression for practical ColBERT deployment
- Robertson & Zaragoza — The Probabilistic Relevance Framework: BM25 and Beyond (2009) — the definitive BM25 reference