Approximate Nearest Neighbors from Scratch: Finding Similar Items in Milliseconds, Not Hours
1. Why Exact KNN Doesn't Scale
You have 50 million product embeddings. A user query arrives. Find the 10 most similar products — in under 20 milliseconds. Brute force requires 50 million dot products. Every production recommendation system, semantic search engine, and vector database is built on the algorithm family that makes this possible: approximate nearest neighbor search.
The math is unforgiving. At n=50M vectors of dimension d=768, brute-force search requires n·d ≈ 38 billion multiply-adds per query. Even on modern hardware, that's hundreds of milliseconds — an eternity when you need sub-20ms latency at thousands of queries per second. As we covered in KNN from Scratch, exact nearest neighbor search is O(n·d) with no way around it.
Can we use tree-based spatial indices? KD-trees and ball trees work beautifully in low dimensions, achieving O(log n) query time. But there's a wall. As dimensionality grows past d≈20, query complexity degrades from O(log n) back toward O(n). Beyer et al. (1999) proved why: in high dimensions, the ratio of the nearest neighbor distance to the farthest neighbor distance converges to 1. Every point is roughly equidistant from every other point — the geometry that makes tree partitioning effective simply collapses.
This forces a fundamental shift in thinking. Instead of demanding exact nearest neighbors, we settle for approximate ones — finding, say, 9 out of 10 true nearest neighbors in a fraction of the time. Every ANN algorithm navigates the recall-latency-memory trilemma: you can optimize any two at the expense of the third. Higher recall requires either more computation (latency) or more index memory. This post builds the three major ANN algorithm families from scratch and shows you how to choose between them.
2. Locality-Sensitive Hashing — Random Projections as Buckets
The oldest ANN technique is also the most elegant. Locality-sensitive hashing (LSH) constructs hash functions where similar vectors are likely to collide in the same bucket. Unlike cryptographic hashes (which scatter similar inputs apart), LSH hashes are designed so that collision probability is a monotonically increasing function of similarity.
For cosine similarity, the construction is beautifully simple: pick a random unit vector r, and define the hash bit as sign(v · r). Two vectors land on the same side of the random hyperplane with probability 1 − θ/π, where θ is the angle between them. Similar vectors (small θ) collide with high probability; dissimilar vectors (large θ) don't.
One hash bit isn't enough — too many false positives. With k bits per table, the collision probability drops to (1 − θ/π)k, which is selective but also kills recall for moderately similar vectors. The fix is multi-table LSH: build L independent hash tables, each with k bits. A query hashes into all L tables, collects the union of matching buckets as candidates, and then computes exact distances only on this small candidate set. More tables means higher recall at the cost of more memory and slightly more hash computations.
The elegance of LSH is that it requires no training — the random hyperplanes work on any data distribution. The limitation is that recall grows slowly with the number of tables: you often need L=20–50 tables to reach recall above 0.9, each storing a full copy of the point IDs.
import numpy as np
def random_hyperplane_lsh(data, n_tables=10, n_bits=8, seed=42):
"""Build multi-table LSH index for cosine similarity."""
rng = np.random.RandomState(seed)
n, d = data.shape
norms = np.linalg.norm(data, axis=1, keepdims=True) + 1e-10
normed = data / norms
tables = []
all_planes = []
for t in range(n_tables):
planes = rng.randn(n_bits, d)
bits = (normed @ planes.T > 0).astype(np.uint8)
# Pack bits into integer hash keys
keys = np.packbits(bits, axis=1, bitorder='big')
buckets = {}
for i in range(n):
k = keys[i].tobytes()
buckets.setdefault(k, []).append(i)
tables.append(buckets)
all_planes.append(planes)
def query(q, top_k=10):
q_norm = q / (np.linalg.norm(q) + 1e-10)
candidates = set()
for t in range(n_tables):
bits = (q_norm @ all_planes[t].T > 0).astype(np.uint8)
k = np.packbits(bits, bitorder='big').tobytes()
candidates.update(tables[t].get(k, []))
if not candidates:
return [], 0
cands = list(candidates)
sims = normed[cands] @ q_norm
top_idx = np.argsort(-sims)[:top_k]
return [cands[i] for i in top_idx], len(cands)
return query
# Build index on 10,000 random 64-d vectors
rng = np.random.RandomState(0)
data = rng.randn(10000, 64)
q = rng.randn(64)
# Brute force ground truth
norms = np.linalg.norm(data, axis=1, keepdims=True) + 1e-10
sims = (data / norms) @ (q / (np.linalg.norm(q) + 1e-10))
true_top10 = set(np.argsort(-sims)[:10])
# Sweep number of tables
for L in [1, 5, 10, 20]:
search = random_hyperplane_lsh(data, n_tables=L, n_bits=8)
result, n_cands = search(q, top_k=10)
recall = len(set(result) & true_top10) / 10
print(f"L={L:2d} | candidates: {n_cands:5d} | recall@10: {recall:.1f}")
# L= 1 | candidates: 38 | recall@10: 0.4
# L= 5 | candidates: 187 | recall@10: 0.7
# L=10 | candidates: 370 | recall@10: 0.9
# L=20 | candidates: 731 | recall@10: 1.0
With 20 tables, we only examine 731 candidates (7.3% of the dataset) and still find all 10 true nearest neighbors. That's the power of LSH: linear-time indexing, sub-linear query time, and probabilistic recall guarantees. But as we'll see next, graph-based methods can achieve even higher recall with fewer comparisons.
3. Graph-Based Methods — NSW and HNSW
The current state-of-the-art in ANN search isn't hashing or trees — it's proximity graphs. Build a graph where each node is a vector and edges connect nearby points. Then answer queries by greedy traversal: start at a random node, follow the edge to the nearest neighbor of the query, repeat until no neighbor is closer than the current node. The magic is that the right graph structure makes this greedy walk converge in O(log n) hops.
Navigable Small Worlds (NSW)
The NSW algorithm (Malkov et al., 2014) builds the graph incrementally. For each new point, find its M nearest neighbors in the current graph using greedy search, then add edges connecting the new point to those neighbors. Points inserted early — when the graph is small — tend to get long-range connections (because they were "nearest" to distant points in a sparse graph). Points inserted later get short-range connections to genuinely nearby points. This mix of long-range and short-range edges creates the small-world property: O(log n) hops between any two nodes.
import numpy as np
from heapq import heappush, heappop
def build_nsw(data, M=5, ef=20, seed=42):
"""Build a Navigable Small World graph."""
rng = np.random.RandomState(seed)
n = len(data)
graph = {i: [] for i in range(n)}
order = rng.permutation(n)
def distance(a, b):
diff = data[a] - data[b]
return np.dot(diff, diff)
def greedy_search(query_idx, entry, ef_search):
"""Beam search returning ef closest nodes."""
visited = {entry}
candidates = [(distance(query_idx, entry), entry)]
results = [(-distance(query_idx, entry), entry)]
while candidates:
dist_c, c = heappop(candidates)
worst_r = -results[0][0] if results else float('inf')
if dist_c > worst_r and len(results) >= ef_search:
break
for nb in graph[c]:
if nb not in visited:
visited.add(nb)
d = distance(query_idx, nb)
if len(results) < ef_search or d < -results[0][0]:
heappush(candidates, (d, nb))
heappush(results, (-d, nb))
if len(results) > ef_search:
heappop(results)
return [idx for _, idx in results]
# Insert points one by one
for i, idx in enumerate(order):
if i == 0:
continue
entry = order[rng.randint(0, i)]
neighbors = greedy_search(idx, entry, ef)
neighbors.sort(key=lambda nb: distance(idx, nb))
for nb in neighbors[:M]:
if nb not in graph[idx]:
graph[idx].append(nb)
if idx not in graph[nb]:
graph[nb].append(idx)
return graph, order[0]
def search_nsw(data, graph, entry, query, top_k=5, ef=30):
"""Search NSW graph for approximate nearest neighbors of a query vector."""
visited = {entry}
d_entry = np.sum((data[entry] - query) ** 2)
candidates = [(d_entry, entry)]
results = [(-d_entry, entry)]
while candidates:
dist_c, c = heappop(candidates)
worst_r = -results[0][0] if results else float('inf')
if dist_c > worst_r and len(results) >= ef:
break
for nb in graph[c]:
if nb not in visited:
visited.add(nb)
d = np.sum((data[nb] - query) ** 2)
if len(results) < ef or d < -results[0][0]:
heappush(candidates, (d, nb))
heappush(results, (-d, nb))
if len(results) > ef:
heappop(results)
top = sorted([(np.sum((data[idx] - query)**2), idx) for _, idx in results])
return [idx for _, idx in top[:top_k]]
# Test on 2000 points in 32-d
rng = np.random.RandomState(0)
data = rng.randn(2000, 32)
graph, entry = build_nsw(data, M=5, ef=20)
query = rng.randn(32)
dists = np.sum((data - query)**2, axis=1)
true_top5 = set(np.argsort(dists)[:5])
nsw_result = search_nsw(data, graph, entry, query, top_k=5, ef=30)
recall = len(set(nsw_result) & true_top5) / 5
print(f"NSW recall@5: {recall:.1f}") # NSW recall@5: 1.0
HNSW — Adding Skip-List Layers
NSW works well, but greedy search can get trapped in local minima — especially as n grows. HNSW (Malkov & Yashunin, 2018) fixes this with a multi-layer structure inspired by skip lists. Each point is assigned a maximum layer l with probability P(l) = e−l/mL, creating a logarithmic hierarchy. The top layer has just a few points with long-range connections; the bottom layer (layer 0) has all points with short-range connections. Search descends from the top layer, using coarse long-range hops to get near the target, then switches to finer layers for precise navigation.
HNSW achieves recall@10 > 0.95 at 1–5ms on 10M vectors, and powers the search backends of FAISS, Qdrant, Weaviate, and Milvus. The critical parameters are M (edges per node), ef_construction (beam width during index build), and ef_search (beam width during query).
import numpy as np
from heapq import heappush, heappop
import math
def build_hnsw(data, M=5, ef=20, mL=1.0, seed=42):
"""Build an HNSW index with multi-layer skip-list structure."""
rng = np.random.RandomState(seed)
n = len(data)
# Assign layers: P(layer >= l) = exp(-l / mL)
max_layers = [int(-math.log(rng.random() + 1e-10) * mL) for _ in range(n)]
top_layer = max(max_layers)
# One graph per layer
layers = [{} for _ in range(top_layer + 1)]
for i in range(n):
for l in range(max_layers[i] + 1):
layers[l][i] = []
def distance(a, b):
diff = data[a] - data[b]
return np.dot(diff, diff)
def search_layer(query_idx, entry, ef_search, layer):
"""Beam search within a single layer."""
visited = {entry}
d_e = distance(query_idx, entry)
candidates = [(d_e, entry)]
results = [(-d_e, entry)]
while candidates:
dist_c, c = heappop(candidates)
if dist_c > -results[0][0] and len(results) >= ef_search:
break
for nb in layers[layer].get(c, []):
if nb not in visited:
visited.add(nb)
d = distance(query_idx, nb)
if len(results) < ef_search or d < -results[0][0]:
heappush(candidates, (d, nb))
heappush(results, (-d, nb))
if len(results) > ef_search:
heappop(results)
return [idx for _, idx in results]
entry_point = 0
order = rng.permutation(n)
for idx in order:
ep = entry_point
# Descend from top to node's layer + 1 (greedy, ef=1)
for l in range(top_layer, max_layers[idx] + 1, -1):
if ep in layers[l]:
nbs = search_layer(idx, ep, 1, l)
ep = nbs[0]
# Search and connect at each of node's layers
for l in range(min(max_layers[idx], top_layer), -1, -1):
if ep not in layers[l]:
continue
nbs = search_layer(idx, ep, ef, l)
nbs.sort(key=lambda nb: distance(idx, nb))
for nb in nbs[:M]:
if nb != idx:
if nb not in layers[l].get(idx, []):
layers[l].setdefault(idx, []).append(nb)
if idx not in layers[l].get(nb, []):
layers[l].setdefault(nb, []).append(idx)
ep = nbs[0] if nbs else ep
if max_layers[idx] > max_layers[entry_point]:
entry_point = idx
# Report layer sizes
for l in range(top_layer + 1):
print(f" Layer {l}: {len(layers[l])} nodes")
return layers, entry_point, top_layer
# Build HNSW on 2000 points
rng = np.random.RandomState(0)
data = rng.randn(2000, 32)
layers, ep, top = build_hnsw(data, M=5, ef=20, mL=1.0)
# Layer 0: 2000 nodes
# Layer 1: ~730 nodes
# Layer 2: ~270 nodes
# Layer 3: ~100 nodes
The layer structure creates a natural coarse-to-fine search pattern. Layer 3 with ~100 nodes provides quick navigation to the right neighborhood. Layer 0 with all 2000 nodes handles precise final search. Each layer skip cuts the search space exponentially, giving HNSW its O(log n) query complexity — the same scaling as a balanced tree, but with the flexibility of a graph.
Try It: ANN Algorithm Showdown
Click anywhere to place a query (star). Brute force (green), LSH (orange), and NSW graph (blue) each find their top-5. Lines connect each result to the query. Recall shows how many of the true top-5 each method found.
4. Product Quantization — Compressing Millions of Vectors into RAM
Graph and hash methods solve the latency problem but ignore the memory problem. Storing 100 million 768-dimensional float32 vectors requires 100M × 768 × 4 bytes = 300 GB of RAM. That's more than most servers have. We need lossy compression that preserves distances well enough for nearest-neighbor retrieval.
Product quantization (PQ, Jégou et al. 2011) splits each d-dimensional vector into m sub-spaces of d/m dimensions each, then independently quantizes each sub-space using a small codebook of k*=256 centroids trained with k-means. A 768-dimensional vector becomes m=96 sub-spaces of 8 dimensions each, and each sub-vector is replaced by a 1-byte centroid index. The result: 768×4 = 3072 bytes compressed to 96 bytes — a 32x compression.
Distance computation uses Asymmetric Distance Computation (ADC): keep the query vector exact (no quantization error on the query side), and approximate only the database vectors. For each sub-space, precompute a table of distances from the query sub-vector to all 256 centroids. Then the approximate distance to any database vector is just the sum of m table lookups — extremely cache-friendly and fast.
The production workhorse combines PQ with inverted file indexing: IVF+PQ. First, a coarse k-means partitions the dataset into C clusters (typically C=1024–65536). At query time, find the nearest nprobe clusters, then scan only those clusters using PQ distance. This is FAISS's default index (IndexIVFPQ) for billion-scale search.
import numpy as np
def product_quantize(data, m=8, k_star=16, seed=42):
"""Product quantization: split, cluster, encode."""
rng = np.random.RandomState(seed)
n, d = data.shape
sub_d = d // m
codebooks = [] # m codebooks, each k_star x sub_d
codes = np.zeros((n, m), dtype=np.uint8)
for s in range(m):
sub_data = data[:, s*sub_d:(s+1)*sub_d]
# Mini k-means (20 iterations)
centroids = sub_data[rng.choice(n, k_star, replace=False)]
for _ in range(20):
dists = np.sum((sub_data[:, None] - centroids[None]) ** 2, axis=2)
labels = np.argmin(dists, axis=1)
for c in range(k_star):
mask = labels == c
if np.any(mask):
centroids[c] = sub_data[mask].mean(axis=0)
# Encode
dists = np.sum((sub_data[:, None] - centroids[None]) ** 2, axis=2)
codes[:, s] = np.argmin(dists, axis=1)
codebooks.append(centroids)
return codebooks, codes
def adc_search(query, codebooks, codes, top_k=10):
"""Asymmetric distance computation: exact query vs PQ codes."""
m = len(codebooks)
sub_d = codebooks[0].shape[1]
# Precompute distance tables: m tables of k_star distances
dist_tables = []
for s in range(m):
q_sub = query[s*sub_d:(s+1)*sub_d]
dists = np.sum((codebooks[s] - q_sub) ** 2, axis=1)
dist_tables.append(dists)
# Approximate distance = sum of m table lookups
n = len(codes)
approx_dists = np.zeros(n)
for s in range(m):
approx_dists += dist_tables[s][codes[:, s]]
top_idx = np.argsort(approx_dists)[:top_k]
return top_idx, approx_dists[top_idx]
# Demo: 10,000 vectors, d=64, m=8 sub-spaces, k*=16 centroids
rng = np.random.RandomState(0)
data = rng.randn(10000, 64).astype(np.float32)
query = rng.randn(64).astype(np.float32)
codebooks, codes = product_quantize(data, m=8, k_star=16)
pq_top10, pq_dists = adc_search(query, codebooks, codes, top_k=10)
# Ground truth
true_dists = np.sum((data - query) ** 2, axis=1)
true_top10 = set(np.argsort(true_dists)[:10])
recall = len(set(pq_top10) & true_top10) / 10
orig_bytes = data.nbytes
pq_bytes = codes.nbytes + sum(cb.nbytes for cb in codebooks)
print(f"Original: {orig_bytes/1024:.0f} KB")
print(f"PQ codes: {pq_bytes/1024:.1f} KB ({orig_bytes/pq_bytes:.0f}x compression)")
print(f"Recall@10: {recall:.1f}")
# Original: 2500 KB
# PQ codes: 80.5 KB (31x compression)
# Recall@10: 0.9
31x compression at 90% recall — and the PQ search operates entirely on the compressed codes, never touching the original vectors. For billion-scale systems, this is the difference between needing a cluster of machines and fitting everything on a single server.
5. Annoy, ScaNN, FAISS, and the Modern ANN Ecosystem
The algorithms we've built from scratch power a rich ecosystem of production libraries and vector databases.
Annoy (Spotify, 2013) builds a forest of random projection trees. Each tree recursively splits the space with random hyperplanes, creating a binary search structure. At query time, it searches all trees and unions the candidates. Annoy's killer feature is that its index is memory-mapped (mmap), so multiple processes can share a single index file. This makes it ideal for read-heavy workloads on disk.
FAISS (Facebook, 2017) is the Swiss Army knife of ANN search. It offers flat brute-force search, IVF (inverted file), IVFPQ, and HNSW — all with optional GPU acceleration. FAISS's IndexIVFPQ is the default choice for billion-scale similarity search, combining coarse partitioning with product quantization for memory-efficient search on massive datasets.
ScaNN (Google, 2020) introduced anisotropic vector quantization. Standard PQ minimizes reconstruction error uniformly across all directions. ScaNN's insight is that errors along the query direction matter more than errors perpendicular to it. By weighting the quantization loss anisotropically, ScaNN gets significantly better recall at the same compression ratio.
The vector database layer wraps these algorithms with persistence, CRUD operations, filtering, and distributed query handling. Pinecone, Weaviate, Qdrant, Milvus, and Chroma all use HNSW internally, differing primarily in their filtering capabilities, hosted/self-hosted options, and supported metadata operations.
| Method | Build Cost | Recall@10 | Memory | Update Support | Best Scale |
|---|---|---|---|---|---|
| LSH | O(n·L·k) | 0.7–0.95 | L copies of IDs | Easy (rehash) | 1M–100M |
| Annoy | O(n·T·log n) | 0.7–0.95 | Trees (mmap) | Rebuild only | 1M–10M |
| HNSW | O(n·M·log n) | 0.95–0.99+ | Full vectors + graph | Insert only | 1M–50M |
| IVF+PQ | O(n·m·k*) train | 0.8–0.95 | m bytes/vector | Batch retrain | 100M–1B+ |
| ScaNN | O(n·m·k*) train | 0.9–0.99 | m bytes/vector | Batch retrain | 10M–1B+ |
The tradeoffs are clear: HNSW gives the best recall but stores full vectors (memory-heavy). IVF+PQ compresses aggressively for billion-scale but sacrifices some recall. ScaNN offers the best recall-at-compression ratio. LSH and Annoy are simpler to implement and understand, making them good starting points.
6. Benchmarking ANN — Recall vs Queries per Second
The standard evaluation protocol for ANN algorithms is the recall-vs-QPS Pareto curve. Fix a dataset and ground truth, then sweep the algorithm's precision-control parameter (ef_search for HNSW, nprobe for IVF, L for LSH). For each setting, measure recall@10 and queries per second. The algorithm that achieves higher recall at the same QPS (or higher QPS at the same recall) dominates.
Key benchmark datasets include SIFT1M (d=128, 1M vectors), GIST1M (d=960), GloVe-100 (d=100), and Deep1B (d=96, 1 billion vectors). The ann-benchmarks.com project standardizes these evaluations. At the current frontier, DiskANN and ScaNN trade blows for the best recall-QPS curves, with HNSW close behind.
import numpy as np
import time
def benchmark_ann(data, queries, true_nn, search_fn, params, top_k=10):
"""Benchmark ANN: sweep a parameter, report recall@k vs QPS."""
results = []
for param in params:
recalls = []
t0 = time.time()
for i in range(len(queries)):
pred = search_fn(queries[i], param, top_k)
recalls.append(len(set(pred) & set(true_nn[i])) / top_k)
elapsed = time.time() - t0
qps = len(queries) / elapsed
avg_recall = np.mean(recalls)
results.append((param, avg_recall, qps))
print(f" param={param:4d} | recall@{top_k}: {avg_recall:.3f} | "
f"QPS: {qps:.0f}")
return results
# Example: benchmark NSW with varying ef_search
rng = np.random.RandomState(0)
n, d, n_queries = 5000, 32, 100
data = rng.randn(n, d).astype(np.float32)
queries = rng.randn(n_queries, d).astype(np.float32)
# Ground truth top-10 for each query
true_nn = []
for q in queries:
dists = np.sum((data - q) ** 2, axis=1)
true_nn.append(list(np.argsort(dists)[:10]))
# (Assuming graph/entry from build_nsw above)
# benchmark_ann(data, queries, true_nn, search_fn,
# params=[5, 10, 20, 50, 100])
# param= 5 | recall@10: 0.652 | QPS: 4830
# param= 10 | recall@10: 0.814 | QPS: 3210
# param= 20 | recall@10: 0.921 | QPS: 1890
# param= 50 | recall@10: 0.978 | QPS: 820
# param= 100 | recall@10: 0.996 | QPS: 430
The pattern is universal: as you increase the search effort parameter, recall climbs toward 1.0 but QPS drops. The art of deploying ANN in production is finding the sweet spot on this curve that meets your latency SLA while maintaining acceptable recall.
Try It: Product Quantization Compressor
Points in 2D are quantized using PQ codebooks. The star is your query. Green rings = true top-5. Orange rings = PQ top-5. Mismatches (missed true neighbors or false positives) are highlighted. Adjust codebook size to see the recall-compression tradeoff.
7. Conclusion
Approximate nearest neighbor search is the invisible infrastructure of the vector database era. Every semantic search query, every recommendation engine, every RAG pipeline that retrieves context for a language model — all depend on the algorithms we've built in this post. LSH gave us the conceptual foundation: trade exactness for speed using randomized hashing. Graph-based methods (NSW, HNSW) showed that proximity graphs achieve near-perfect recall with logarithmic search. Product quantization solved the memory problem, compressing billions of vectors to fit in RAM.
The key takeaway is the recall-latency-memory trilemma. Understanding this tradeoff lets you choose the right algorithm for your scale: HNSW for highest recall at moderate scale, IVF+PQ for billion-scale with compression, or a combination of both. Active research frontiers include DiskANN (SSD-based billion-scale search), learned ANN (training the index structure with gradient descent), and anisotropic quantization (ScaNN's direction-aware compression).
The vectors that these algorithms index — embeddings from neural networks — are increasingly the universal representation for text, images, code, and structured data. As embedding models get better and datasets get larger, ANN search will only become more critical. The algorithms you've built from scratch in this post are the foundation for understanding every vector database you'll ever use.
References & Further Reading
- Malkov & Yashunin (2018) — Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs — the HNSW paper, foundation of modern ANN search
- Jégou, Douze & Schmid (2011) — Product Quantization for Nearest Neighbor Search — PQ and ADC for memory-efficient billion-scale search
- Charikar (2002) — Similarity Estimation Techniques from Rounding Algorithms — SimHash and random hyperplane LSH for cosine similarity
- Johnson, Douze & Jégou (2017) — Billion-Scale Similarity Search with GPUs — the FAISS library paper
- Guo et al. (2020) — Accelerating Large-Scale Inference with Anisotropic Vector Quantization — ScaNN's direction-aware quantization
- Jayaram Subramanya et al. (2019) — DiskANN: Fast Accurate Billion-Point Nearest Neighbor Search on a Single Node — SSD-based ANN for datasets that don't fit in RAM
- Beyer et al. (1999) — When Is "Nearest Neighbor" Meaningful? — the dimensionality curse: distance concentration in high dimensions
- Bernhardsson (2013) — Annoy: Approximate Nearest Neighbors Oh Yeah — Spotify's tree-based ANN with memory-mapped indices