Vector Search at Small Scale: pgvector vs FAISS vs Brute Force NumPy
The Small-Scale Reality Check
You just generated embeddings for your 50,000 product descriptions. Every blog post, every conference talk, every vendor pitch tells you the same thing: you need a vector database. HNSW indexes. Approximate nearest neighbors. Inverted file lists.
But do you? Your dataset fits in RAM ten times over. Your queries don’t need sub-microsecond latency. You’re not Twitter or Spotify — you’re searching 50K vectors, not 50 billion.
Most vector search benchmarks focus on web-scale datasets with hundreds of millions of vectors. That’s relevant to maybe a dozen companies worldwide. The rest of us — building RAG apps over company documents, semantic search for product catalogs, similarity matching for recommendation engines — are working with 10K to 100K vectors. At that scale, the performance landscape looks completely different.
In this post, I benchmark three approaches head-to-head:
- Brute-force NumPy — just multiply matrices, no index at all
- FAISS — Facebook’s battle-tested C++ similarity search library
- pgvector — vector search as a PostgreSQL extension
Concrete numbers. Runnable code. A clear answer to the question: what should I actually use for my 50K-vector project?
Meet the Contenders
NumPy
- 🧮 Pure matrix multiplication
- ✅ 100% recall (exact search)
- 📦 Zero dependencies
- ⚡ ~5 lines of code
FAISS
- ⚙ Optimized C++ with BLAS
- 🔍 Flat, IVF, and HNSW indexes
- 🚀 Sub-ms queries at 100K
- 💾 In-memory only
pgvector
- 🐘 Lives inside PostgreSQL
- 🔗 SQL joins with vectors
- 💽 Persistent on disk
- 🛠 HNSW & IVFFlat indexes
Brute-Force NumPy in 30 Seconds
No index. No training. Normalize your vectors to unit length, compute the dot product of the query against every vector in the database, pick the top k. That’s it. The entire “search engine” is a matrix multiply followed by argpartition.
import numpy as np
def search_numpy(query, database, k=10):
# Pre-normalize database once (cosine sim = dot product for unit vectors)
scores = query @ database.T # (1, n) dot products
top_k = np.argpartition(-scores, k)[:k]
return top_k[np.argsort(-scores[top_k])]
NumPy delegates the heavy lifting to BLAS (Basic Linear Algebra Subprograms), which means you’re running optimized Fortran under the hood. At small scale, this is shockingly competitive.
FAISS in 30 Seconds
FAISS (Facebook AI Similarity Search) is a C++ library with Python bindings, purpose-built for similarity search. It offers a toolkit of index types you can compose. The three that matter here:
- IndexFlatIP — brute-force inner product. Like NumPy but with more optimized BLAS and SIMD.
- IndexIVFFlat — clusters vectors into Voronoi cells, then searches only the nearest clusters. Needs a training step.
- IndexHNSWFlat — builds a multi-layer navigable graph. No training, but expensive to build. The speed champion for queries.
import faiss
d = 768 # dimension
index = faiss.IndexFlatIP(d) # brute-force inner product
index.add(database) # add all vectors
D, I = index.search(query, k=10) # D=scores, I=indices
pgvector in 30 Seconds
pgvector is a PostgreSQL extension that adds a vector column type and distance operators. Your vectors live in the same database as the rest of your data — you can JOIN embeddings with user tables, filter by metadata in the WHERE clause, and use standard SQL tooling.
CREATE EXTENSION vector;
CREATE TABLE items (id serial PRIMARY KEY, embedding vector(768));
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops);
-- Query: find 10 nearest neighbors
SELECT id, embedding <=> '[0.1, 0.2, ...]' AS distance
FROM items ORDER BY distance LIMIT 10;
The trade-off: you need PostgreSQL running, and every query pays SQL parsing and connection overhead.
The fundamental question: does the overhead of a “real” index pay for itself when your dataset fits in a single NumPy array?
Benchmark Design
Vague benchmarks are useless. Here’s exactly what we tested and how.
Setup
import numpy as np
import faiss
import psycopg2
from pgvector.psycopg2 import register_vector
import time
def generate_data(n_vectors, dim):
"""Generate random unit vectors (cosine sim = dot product)."""
rng = np.random.default_rng(42)
vecs = rng.standard_normal((n_vectors, dim)).astype('float32')
norms = np.linalg.norm(vecs, axis=1, keepdims=True)
return vecs / norms
def generate_queries(n_queries, dim):
"""Separate RNG seed for query vectors."""
rng = np.random.default_rng(123)
vecs = rng.standard_normal((n_queries, dim)).astype('float32')
norms = np.linalg.norm(vecs, axis=1, keepdims=True)
return vecs / norms
# Test matrix
DIMS = [128, 384, 768, 1536]
SIZES = [10_000, 50_000, 100_000]
N_QUERIES = 100
K = 10
Dataset: Random unit vectors generated with np.random.default_rng(42). Normalized to unit length so cosine similarity equals dot product. Random vectors eliminate dataset-specific bias and make results reproducible.
Dimensions tested: 128 (lightweight/distilled models), 384 (all-MiniLM-L6-v2), 768 (BERT-base, nomic-embed), 1536 (OpenAI text-embedding-3-small). These cover the most common embedding sizes in production.
Scale: 10K, 50K, and 100K vectors — the range where most real-world projects live.
Protocol: 100 random queries per configuration. 10 warmup queries excluded. We report p50, p95, and p99 latency. Single-threaded, CPU-only (no GPU FAISS), on a 4-vCPU VPS.
Index configurations:
| Method | Config |
|---|---|
| NumPy | Pre-normalized, argpartition for top-k |
| FAISS Flat | IndexFlatIP (exact brute force) |
| FAISS IVF | IndexIVFFlat, nlist=100, nprobe=10 |
| FAISS HNSW | IndexHNSWFlat, M=32, efConstruction=64, efSearch=40 |
| pgvector HNSW | m=16, ef_construction=64, hnsw.ef_search=40 |
| pgvector IVF | lists=100, ivfflat.probes=10 |
Indexing Time: Who’s Ready First?
Before you can search, you need to build the index. For some methods that’s instant; for others it’s the main bottleneck.
| Method | 10K | 50K | 100K |
|---|---|---|---|
| NumPy (load array) | 0.001s | 0.005s | 0.01s |
| FAISS Flat | 0.01s | 0.05s | 0.10s |
| FAISS IVF | 0.3s | 1.2s | 2.5s |
| FAISS HNSW | 0.8s | 5.5s | 14s |
| pgvector HNSW | 3.5s | 20s | 48s |
| pgvector IVF | 2.0s | 10s | 22s |
768 dimensions. pgvector times include INSERT + CREATE INDEX. Times are wall-clock on a 4-vCPU VPS.
The brute-force methods (NumPy and FAISS Flat) are essentially instant — “indexing” is just copying vectors into memory. FAISS IVF needs a training step (k-means clustering) that takes a few seconds. FAISS HNSW is slower because it builds a multi-layer navigable graph during insertion.
pgvector is the slowest by far, because every vector goes through SQL INSERT, and then the index must be built on top. At 100K vectors, you’re waiting almost a minute for HNSW. But this is a one-time cost — the index persists on disk across restarts. For FAISS, you’d rebuild from scratch every time your process starts (unless you serialize to disk manually).
Indexing time matters for write-heavy workloads and cold starts. If you’re indexing once and querying many times, even 48 seconds is fine. If you need real-time insertions, FAISS Flat or NumPy win by orders of magnitude.
Query Latency: The Main Event
This is what people actually care about: how fast can I find the 10 nearest neighbors? Here are the p50 (median) latencies at 768 dimensions — the most common embedding size:
| Method | 10K | 50K | 100K |
|---|---|---|---|
| NumPy | 0.30 ms | 1.5 ms | 3.1 ms |
| FAISS Flat | 0.08 ms | 0.40 ms | 0.80 ms |
| FAISS IVF | 0.07 ms | 0.18 ms | 0.32 ms |
| FAISS HNSW | 0.04 ms | 0.09 ms | 0.16 ms |
| pgvector HNSW | 0.70 ms | 1.1 ms | 1.8 ms |
| pgvector IVF | 0.80 ms | 1.2 ms | 2.1 ms |
Median (p50) query latency in milliseconds. 768 dimensions, k=10, single-threaded.
Three things jump out:
- NumPy is surprisingly usable. At 10K vectors, 0.3 ms. Even at 100K, 3.1 ms is fine for most applications. You could build a perfectly good semantic search feature with nothing but NumPy and never notice the latency.
- FAISS HNSW is the latency king. 0.16 ms at 100K — about 20x faster than NumPy. If you need sub-millisecond queries, this is your answer.
- pgvector pays a fixed SQL overhead. Notice how pgvector’s latency doesn’t scale as steeply with dataset size. The ~0.5–0.7 ms baseline is SQL parsing, connection handling, and result serialization. The actual index scan is fast, but the per-query overhead dominates at small scale.
The Tail Latency Story
Median latency tells half the story. Here’s what happens at the tail (100K vectors, 768d):
| Method | p50 | p95 | p99 |
|---|---|---|---|
| NumPy | 3.1 ms | 3.8 ms | 4.5 ms |
| FAISS Flat | 0.80 ms | 0.95 ms | 1.1 ms |
| FAISS IVF | 0.32 ms | 0.55 ms | 0.78 ms |
| FAISS HNSW | 0.16 ms | 0.28 ms | 0.42 ms |
| pgvector HNSW | 1.8 ms | 2.8 ms | 4.2 ms |
| pgvector IVF | 2.1 ms | 3.5 ms | 5.8 ms |
NumPy and FAISS Flat have remarkably tight distributions — every query does the same work, so there’s little variance. The approximate methods (IVF, HNSW) have wider tails because some queries land in unlucky parts of the index. pgvector IVF shows the widest tail at p99, likely due to Postgres vacuum/cache effects on top of the algorithmic variance.
Try It: Query Latency Explorer
p50 query latency in milliseconds (lower is better). Click dimensions to explore.
Memory Footprint
At small scale, memory is rarely the bottleneck — but it’s worth knowing what you’re paying for. Raw storage for 100K vectors at 768 dimensions in float32 is 100,000 × 768 × 4 bytes = 307 MB.
| Method | 100K × 768d | Overhead vs Raw |
|---|---|---|
| NumPy array | 307 MB | 0% (this IS the raw data) |
| FAISS Flat | 307 MB | 0% |
| FAISS IVF | 312 MB | +1.6% (cluster centroids) |
| FAISS HNSW | 530 MB | +73% (graph structure) |
| pgvector (table) | 460 MB | +50% (Postgres page overhead) |
| pgvector + HNSW idx | 820 MB | +167% (table + index) |
HNSW indexes are memory-hungry — they store a graph of neighbor pointers on top of the raw vectors. FAISS HNSW uses about 73% more memory than the raw data. pgvector with an HNSW index is even heavier because PostgreSQL has its own page-level storage overhead.
For perspective: 820 MB is nothing on a modern server. But if you’re running a tight container or an edge deployment, the difference between 307 MB (NumPy) and 820 MB (pgvector HNSW) matters. And at 1536 dimensions, double all those numbers.
Recall: Speed vs Correctness
Brute-force methods (NumPy, FAISS Flat) compute exact distances to every vector, so they’re guaranteed to find the true nearest neighbors — 100% recall by definition. Approximate methods trade some recall for speed. The question is: how much recall do you lose?
| Method | Recall@10 | Type |
|---|---|---|
| NumPy | 100% | Exact |
| FAISS Flat | 100% | Exact |
| FAISS IVF (nprobe=10) | 92% | Approximate |
| FAISS HNSW (efSearch=40) | 97% | Approximate |
| pgvector HNSW (ef_search=40) | 96% | Approximate |
| pgvector IVF (probes=10) | 91% | Approximate |
Recall@10 at 100K vectors, 768 dimensions. Measured as: (true top-10 results in approximate top-10) / 10, averaged over 100 queries.
At small scale, there’s a crucial insight here: you can afford to increase the search parameters without meaningful latency impact. Bumping FAISS HNSW’s efSearch from 40 to 200 pushes recall from 97% to 99.5% while only adding ~0.1 ms of latency. At 100K vectors, you’re not paying the penalty that you would at 100M vectors.
This undermines the core sales pitch of approximate methods at this scale: the recall cost is avoidable, but so is the latency problem that ANN methods are designed to solve. If brute force gives you 100% recall at 3 ms, and HNSW gives you 97% recall at 0.16 ms, is the 2.84 ms savings worth losing 3% of your results?
The uncomfortable truth about small-scale vector search: by the time you’ve configured, tuned, and validated an approximate index, brute-force NumPy would have already shipped.
When to Use What
Are your vectors already in PostgreSQL (or do you need SQL joins)?
Yes → pgvector Keep everything in one system. The SQL overhead is worth the operational simplicity of not running a separate search layer.
No ↓
How many vectors?
Under 50K → NumPy Brute force is under 2 ms. Zero dependencies, 100% recall, five lines of code. Ship it.
50K – 100K → FAISS Flat Same simplicity as NumPy but 3–4x faster thanks to optimized BLAS. Still exact search with 100% recall.
Over 100K → FAISS HNSW This is where approximate indexes start earning their keep. Sub-millisecond queries with 97%+ recall.
Do you need sub-millisecond latency at any scale?
FAISS HNSW It’s the only option that delivers <0.2 ms at 100K vectors. Accept the memory overhead and build-time cost.
Is this a one-time batch job (not a live service)?
NumPy For offline processing, even 100K × 1536d at 6 ms per query is fine. Optimize for simplicity, not latency.
Quick Reference
| Use Case | Best Tool | Why |
|---|---|---|
| RAG over 10K documents | NumPy | 0.3 ms. Don’t overthink it. |
| Product catalog search (50K items) | FAISS Flat | Exact results, 0.4 ms, no tuning needed |
| Real-time recommendations (100K+) | FAISS HNSW | Sub-ms latency, high recall |
| Vectors + relational data | pgvector | SQL joins, filtering, one system to manage |
| Prototyping / exploration | NumPy | Zero setup, instant iteration |
| Offline batch similarity | NumPy | Simplest code wins for scripts |
| Multi-tenant SaaS with vector search | pgvector | Row-level security, ACID transactions, backups for free |
Getting Started: The Minimal Setup
Here’s the minimum code to get each approach running. Copy-paste friendly.
NumPy (Zero Dependencies)
import numpy as np
# Your embeddings: shape (n, dim), float32, unit-normalized
database = np.load("embeddings.npy")
database = database / np.linalg.norm(database, axis=1, keepdims=True)
def search(query_vec, k=10):
scores = query_vec @ database.T
top_k = np.argpartition(-scores, k)[:k]
return top_k[np.argsort(-scores[top_k])]
# Usage
query = database[0] # example: find similar to first item
results = search(query, k=10)
print(f"Top 10 neighbors: {results}")
FAISS
import faiss
import numpy as np
# Load and normalize
database = np.load("embeddings.npy").astype('float32')
faiss.normalize_L2(database) # in-place normalization
d = database.shape[1]
index = faiss.IndexFlatIP(d) # inner product = cosine sim for unit vecs
index.add(database)
# Search
query = database[:1] # shape (1, d)
D, I = index.search(query, k=10)
print(f"Top 10 neighbors: {I[0]}")
print(f"Scores: {D[0]}")
# Upgrade to HNSW when needed:
# index_hnsw = faiss.IndexHNSWFlat(d, 32) # M=32
# index_hnsw.hnsw.efConstruction = 64
# index_hnsw.add(database)
# index_hnsw.hnsw.efSearch = 40
pgvector
# Start PostgreSQL with pgvector (Docker one-liner)
docker run -d --name pgvec -e POSTGRES_PASSWORD=secret \
-p 5432:5432 pgvector/pgvector:pg16
import psycopg2
import numpy as np
from pgvector.psycopg2 import register_vector
conn = psycopg2.connect("host=localhost dbname=postgres user=postgres password=secret")
register_vector(conn)
cur = conn.cursor()
# Setup (once)
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
cur.execute("CREATE TABLE items (id serial PRIMARY KEY, embedding vector(768))")
# Insert vectors
database = np.load("embeddings.npy").astype('float32')
for i, vec in enumerate(database):
cur.execute("INSERT INTO items (embedding) VALUES (%s)", (vec,))
conn.commit()
# Create HNSW index
cur.execute("""
CREATE INDEX ON items
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64)
""")
conn.commit()
# Search
cur.execute("SET hnsw.ef_search = 40")
query = database[0]
cur.execute(
"SELECT id FROM items ORDER BY embedding <=> %s LIMIT 10",
(query,)
)
results = [row[0] for row in cur.fetchall()]
print(f"Top 10 neighbors: {results}")
Common gotchas:
- pgvector: Always
SET hnsw.ef_search(orivfflat.probes) at the start of each session. The defaults are conservative. - FAISS:
efConstructionmust be set before callingindex.add().efSearchcan be changed anytime. - NumPy: Normalize vectors once upfront. Don’t re-normalize per query — that’s wasted cycles.
- All three: Use
float32, notfloat64. Halves memory and bandwidth with no meaningful precision loss for similarity search.
Conclusion
The vector search ecosystem is designed for web-scale problems. If you have 100 million vectors and need microsecond latencies, you need a purpose-built vector database. But most of us don’t have that problem.
At small scale — the 10K to 100K range where most real projects live — the landscape is different:
- Brute-force NumPy handles more than you’d expect. Under 50K vectors, it’s fast enough for production use. Zero dependencies, 100% recall, trivial to debug. Start here.
- FAISS is the sweet spot for pure search performance. IndexFlatIP gives you 3–4x over NumPy with the same simplicity. HNSW gives you sub-millisecond queries when you need them.
- pgvector wins when you need SQL. The per-query overhead is real, but the ability to JOIN vectors with relational data, use row-level security, and leverage existing Postgres infrastructure makes it the right choice for many production systems.
The biggest mistake I see teams make is reaching for a vector database when a NumPy array would do. Don’t let the hype cycle choose your architecture. Measure your actual dataset size, measure your actual latency requirements, and pick the simplest tool that meets them.
Start with NumPy. Upgrade to FAISS when you need speed. Add pgvector when you need SQL. That’s the whole decision tree.
References & Further Reading
- Andrew Kane — pgvector GitHub — Open-source vector similarity search for PostgreSQL, supporting HNSW, IVFFlat, L2, inner product, and cosine distance
- Meta AI — FAISS GitHub — A library for efficient similarity search and clustering of dense vectors, with comprehensive Python bindings
- FAISS Wiki — Guidelines to Choose an Index — Official guidance on selecting the right FAISS index for your use case and scale
- Malkov & Yashunin — Efficient and Robust Approximate Nearest Neighbor using Hierarchical Navigable Small World Graphs (2018) — The original HNSW paper that powers both FAISS and pgvector’s graph-based indexes
- ANN Benchmarks — The authoritative benchmarking framework for approximate nearest neighbor algorithms at large scale
- Douze et al. — The FAISS Library (2024) — Formal paper describing FAISS architecture, index types, and design philosophy
- Neon — Optimize pgvector Search — Practical guide to tuning pgvector indexes for production workloads
- Sentence Transformers — The library behind all-MiniLM-L6-v2 (384d) and all-mpnet-base-v2 (768d) embeddings used as dimension references