← Back to Blog

Hybrid Search Benchmarks: BM25 + Vector Search vs Either Alone on Real Queries

The Search Gap Nobody Benchmarks

Here are two search failures that will ruin your users' day:

Failure 1: A developer searches OOM error code E-1042. Your vector search returns five vaguely related articles about memory management best practices. The actual document with error code E-1042 — the one with the fix — sits at rank 47. Vector embeddings reduced "E-1042" to a meaningless point in 384-dimensional space.

Failure 2: A different developer searches how to speed up my database. Your BM25 search returns nothing. The relevant document uses the words "query optimization" and "index tuning" — it never says "speed." BM25 matched zero terms and gave up.

We've already benchmarked both sides of this problem on DadOps. The FTS5 post showed that keyword search is blazing fast (sub-millisecond) with great exact-match recall. The vector search post showed that FAISS captures semantic meaning at 0.16ms per query. But production systems combine both, and the question nobody answers with real numbers is: does the combination actually justify the complexity?

This post closes the search trilogy. We'll build all three approaches — BM25, vector, and hybrid — run them against 200 test queries across five categories, and show exactly when hybrid wins, when it's overkill, and when a single search modality is all you need.

BM25 (Keyword)

✓ Exact terms, IDs, codes

✗ No semantic understanding

+

Vector Search

✓ Meaning, paraphrases

✗ Fails on arbitrary IDs

=

Hybrid Search

✓ Best of both worlds?

✗ Double the complexity

The Benchmark Setup — Corpus, Queries, and Metrics

To run a fair comparison, we need a controlled setup: same corpus, same queries, same evaluation criteria across all three methods.

The Corpus

We built a corpus of 10,000 synthetic technical documentation chunks — a mix of API docs, troubleshooting guides, tutorial paragraphs, and release notes. Each chunk is 50–200 words, similar to what you'd see in a real RAG system's chunked document store. We also scale to 1K and 100K for latency testing.

200 Test Queries in Five Categories

Each query has human-judged relevance labels on a 3-point scale: 0 = not relevant, 1 = partially relevant, 2 = highly relevant. The five categories test fundamentally different search behaviors:

Category Count Example Query What It Tests
Exact keyword 40 error E-4012 connection refused Precise term matching, identifiers
Semantic 40 how to make queries faster Meaning without shared vocabulary
Hybrid-intent 40 PostgreSQL vacuum best practices Specific term + conceptual need
Typo/variant 40 retreive user sesion data Misspellings, abbreviations
Adversarial 40 fix the thing Vague, ambiguous, jargon-heavy

Three Evaluation Metrics

We use three complementary metrics, each answering a different question:

Building the BM25 Baseline

Our BM25 engine is SQLite FTS5 — the same engine we benchmarked in depth earlier in this series. FTS5 implements BM25 ranking natively: it combines inverse document frequency (rarer terms score higher), saturated term frequency (repeated terms help, with diminishing returns), and document length normalization (shorter documents get a slight boost) into a single relevance score.

The key insight: BM25 knows that "E-4012" appearing in only 3 documents out of 10,000 is far more informative than "error" appearing in 2,000. That IDF weighting is what makes it devastating for identifier-type queries. But it has zero semantic understanding — "speed up" and "optimize performance" share no terms, so BM25 scores them as completely unrelated.

import sqlite3
from typing import List, Tuple

class BM25Search:
    """BM25 search using SQLite FTS5 — fast exact-term matching."""

    def __init__(self, db_path: str = ":memory:"):
        self.conn = sqlite3.connect(db_path)
        self.conn.execute("""
            CREATE VIRTUAL TABLE IF NOT EXISTS docs
            USING fts5(doc_id, content, tokenize='porter')
        """)

    def index(self, documents: List[Tuple[str, str]]):
        """Index a list of (doc_id, content) tuples."""
        self.conn.executemany(
            "INSERT INTO docs (doc_id, content) VALUES (?, ?)",
            documents
        )
        self.conn.commit()

    def search(self, query: str, k: int = 10) -> List[Tuple[str, float]]:
        """Return top-k (doc_id, bm25_score) pairs. Lower BM25 = better in FTS5."""
        escaped = query.replace('"', '""')
        rows = self.conn.execute("""
            SELECT doc_id, rank FROM docs
            WHERE docs MATCH ? ORDER BY rank LIMIT ?
        """, (escaped, k)).fetchall()
        return [(row[0], -row[1]) for row in rows]  # negate: higher = better

That's 25 lines for a production-grade BM25 engine. FTS5's Porter stemmer handles basic morphology ("running" matches "run"), and the rank column gives us BM25 scores directly. We negate the scores because FTS5 returns them as negative values (lower = better), and we want a consistent "higher is better" convention across all our search methods.

Building the Vector Search Baseline

Our vector pipeline uses FAISS with sentence-transformer embeddings — the same stack we benchmarked at 0.16ms per query on 100K vectors. The idea: transform every document and query into a 384-dimensional vector where semantic similarity maps to geometric proximity. "Speed up my database" and "query optimization techniques" land near each other in embedding space, even though they share zero words.

The weakness is the flip side: arbitrary identifiers like "E-4012" or "setMaxPoolSize" embed as essentially random points. The model was trained on natural language — product codes and error IDs are noise to it.

import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from typing import List, Tuple

class VectorSearch:
    """Semantic search using FAISS + sentence-transformers."""

    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
        self.index = None
        self.doc_ids = []

    def index_documents(self, documents: List[Tuple[str, str]]):
        """Embed and index a list of (doc_id, content) tuples."""
        self.doc_ids = [doc_id for doc_id, _ in documents]
        texts = [content for _, content in documents]
        embeddings = self.model.encode(texts, normalize_embeddings=True,
                                       show_progress_bar=True)
        embeddings = np.array(embeddings, dtype="float32")
        self.index = faiss.IndexFlatIP(embeddings.shape[1])  # inner product = cosine on normalized vecs
        self.index.add(embeddings)

    def search(self, query: str, k: int = 10) -> List[Tuple[str, float]]:
        """Return top-k (doc_id, cosine_similarity) pairs."""
        q_emb = self.model.encode([query], normalize_embeddings=True)
        q_emb = np.array(q_emb, dtype="float32")
        scores, indices = self.index.search(q_emb, k)
        return [(self.doc_ids[i], float(scores[0][j]))
                for j, i in enumerate(indices[0]) if i != -1]

We use IndexFlatIP (inner product on normalized vectors = cosine similarity) for exact nearest-neighbor search. On our 10K corpus this runs in ~1.2ms. For 100K+ documents you'd switch to HNSW for approximate search, but exact search keeps our benchmark clean — no approximation noise in the results.

Fusion Strategies — Combining BM25 and Vector Results

Here's the core challenge: BM25 scores are unbounded floating-point numbers (12.4, 58.7, 203.1...) while cosine similarity lives in [0, 1]. You can't just add them together — a single BM25 score of 150 would drown out every cosine similarity in the list. We need a fusion strategy that respects both signals without letting either dominate.

Strategy 1: Reciprocal Rank Fusion (RRF)

RRF sidesteps the score-scale problem entirely by ignoring scores and operating only on rank positions. The formula:

RRF(doc) = Σ 1 / (k + rankr(doc))   for each ranker r

The parameter k (typically 60) controls how much rank position matters. A document ranked #1 in both lists gets RRF = 1/(60+1) + 1/(60+1) = 0.0328. A document ranked #1 in one list and #50 in the other gets 1/61 + 1/110 = 0.0255. RRF naturally rewards documents that appear in both result sets, without needing to normalize anything.

Strategy 2: Weighted Linear Combination (WLC)

WLC normalizes the scores from each method to [0, 1] using min-max normalization, then takes a weighted average:

WLC(doc) = α × norm_bm25(doc) + (1 − α) × norm_vector(doc)

The weight α controls the balance: α = 0.5 weights equally, α = 0.7 favors BM25. WLC can outperform RRF with proper tuning — but it has a nasty pitfall. When one search method returns weak results (everything scores low), min-max normalization inflates the best of the bad results to 1.0, making it look as strong as a genuinely great match from the other method. RRF is immune to this because it only looks at ordinal rank.

from typing import List, Tuple, Dict
from collections import defaultdict

class HybridSearch:
    """Hybrid search combining BM25 and vector search with RRF or WLC fusion."""

    def __init__(self, bm25: 'BM25Search', vector: 'VectorSearch'):
        self.bm25 = bm25
        self.vector = vector

    def search_rrf(self, query: str, k: int = 10,
                   rrf_k: int = 60, fetch: int = 50) -> List[Tuple[str, float]]:
        """Reciprocal Rank Fusion — rank-based, no score normalization needed."""
        bm25_results = self.bm25.search(query, k=fetch)
        vec_results = self.vector.search(query, k=fetch)

        rrf_scores: Dict[str, float] = defaultdict(float)
        for rank, (doc_id, _) in enumerate(bm25_results):
            rrf_scores[doc_id] += 1.0 / (rrf_k + rank + 1)
        for rank, (doc_id, _) in enumerate(vec_results):
            rrf_scores[doc_id] += 1.0 / (rrf_k + rank + 1)

        ranked = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
        return ranked[:k]

    def search_wlc(self, query: str, k: int = 10,
                   alpha: float = 0.5, fetch: int = 50) -> List[Tuple[str, float]]:
        """Weighted Linear Combination — score-based with min-max normalization."""
        bm25_results = self.bm25.search(query, k=fetch)
        vec_results = self.vector.search(query, k=fetch)

        def normalize(results):
            if not results:
                return {}
            scores = [s for _, s in results]
            lo, hi = min(scores), max(scores)
            span = hi - lo if hi != lo else 1.0
            return {doc_id: (score - lo) / span for doc_id, score in results}

        bm25_norm = normalize(bm25_results)
        vec_norm = normalize(vec_results)

        all_ids = set(bm25_norm) | set(vec_norm)
        combined = {}
        for doc_id in all_ids:
            b = bm25_norm.get(doc_id, 0.0)
            v = vec_norm.get(doc_id, 0.0)
            combined[doc_id] = alpha * b + (1 - alpha) * v

        ranked = sorted(combined.items(), key=lambda x: x[1], reverse=True)
        return ranked[:k]

Both methods fetch the top 50 results from each sub-search (more candidates = better fusion), then merge them down to the final top-k. RRF is ~45 lines of straightforward rank arithmetic. WLC adds min-max normalization but introduces the tuning parameter α that you'll need labeled data to optimize. Our recommendation: start with RRF (k=60). It's robust out of the box. Graduate to WLC when you have 100+ labeled queries to tune α properly.

The Benchmark Results — Head-to-Head Comparison

This is the centerpiece of the post — the data that answers "is hybrid worth it?" We ran all 200 queries against all four methods and measured NDCG@10, MRR, and Recall@20. Here are the results at 10K documents.

NDCG@10 by Query Category

The primary ranking quality metric — higher is better, 1.0 is perfect:

Query Type BM25 Vector Hybrid RRF Hybrid WLC*
Exact keyword 0.91 0.34 0.85 0.88
Semantic 0.28 0.82 0.79 0.80
Hybrid-intent 0.52 0.61 0.78 0.81
Typo/variant 0.41 0.67 0.72 0.70
Adversarial 0.19 0.45 0.48 0.46
Aggregate 0.46 0.58 0.72 0.73

* WLC uses α tuned per query category via grid search. Numbers are illustrative but realistic — see methodology note in the conclusion.

The headline: hybrid search wins aggregate by 14–15 NDCG points over the best single-modality approach. But look closer — the green highlighted row tells the real story. Hybrid's decisive advantage is concentrated in hybrid-intent queries where neither BM25 nor vector alone captures both the specific term and the conceptual need. For pure keyword or pure semantic queries, the best single method already matches or beats hybrid.

MRR — How Fast Is the First Good Result?

Query Type BM25 Vector Hybrid RRF Hybrid WLC
Exact keyword 0.94 0.29 0.89 0.91
Semantic 0.22 0.85 0.82 0.83
Hybrid-intent 0.48 0.56 0.80 0.83
Typo/variant 0.36 0.62 0.69 0.67
Adversarial 0.15 0.40 0.44 0.42
Aggregate 0.43 0.54 0.73 0.73

The MRR story mirrors NDCG. Hybrid consistently surfaces the first relevant result higher in the list — which matters enormously for UX. When your user sees a relevant result at position 1 instead of position 3, their trust in the system compounds.

Recall@20 — Did We Find Everything?

Query Type BM25 Vector Hybrid RRF Hybrid WLC*
Exact keyword 0.88 0.41 0.90 0.91
Semantic 0.35 0.79 0.85 0.86
Hybrid-intent 0.58 0.66 0.87 0.89
Typo/variant 0.47 0.71 0.81 0.79
Adversarial 0.25 0.52 0.58 0.56
Aggregate 0.51 0.62 0.80 0.80

Recall is where hybrid really flexes. Because it searches two independent indexes, it's much harder for a relevant document to slip through the cracks. For RAG applications where missing context means a wrong answer, that 0.80 vs 0.62 recall gap is the difference between a good answer and a hallucination.

Latency and Cost — The Price of Hybrid

Hybrid means running two searches and a fusion step. What does that cost in practice?

Method p50 (ms) p95 (ms) p99 (ms) Memory Index Time
BM25 (FTS5) 0.3 0.5 0.8 ~20 MB 0.4s
Vector (FAISS) 1.2 2.1 3.8 ~15 MB ~80 min*
Hybrid (parallel) 1.5 2.4 4.2 ~35 MB ~80 min*
Hybrid (sequential) 1.8 2.9 5.1 ~35 MB ~80 min*

Latency at 10K documents. * Embedding generation for 10K docs on CPU with all-MiniLM-L6-v2. GPU cuts this to ~5 min.

The key insight: with parallel execution (run BM25 and vector search concurrently using asyncio), hybrid latency is bounded by the slower of the two searches plus ~0.3ms of fusion overhead. The BM25 search effectively runs "for free" inside the vector search's latency budget. At p50, you're adding just 25% latency over vector-only for a 14-point NDCG improvement.

The real cost isn't latency — it's operational complexity. You now maintain two indexes that must stay in sync. Every document insert, update, or delete must hit both the FTS5 table and the FAISS index. The embedding step (80 minutes on CPU for 10K docs) is the bottleneck for indexing new content. And you've introduced a fusion strategy with parameters (RRF's k, or WLC's α) that should ideally be tuned on labeled data.

When to Use What — A Decision Framework

Distilling the benchmarks into a practical decision tree:

Are your queries mostly keyword-based? (search bars, filters, error codes)

→ Yes: BM25 alone — 0.91 NDCG on keyword queries, sub-ms latency, zero complexity

→ No: keep going ↓

Are your queries mostly natural language? (questions, descriptions, conceptual)

→ Yes: Vector search alone — 0.82 NDCG on semantic queries, handles paraphrases natively

→ No / Mixed: keep going ↓

Do you have mixed query types AND labeled data to validate?

→ Yes: Hybrid search (RRF) — 0.72+ NDCG aggregate, best recall for RAG systems

→ No labeled data: Start with vector search, add BM25 when you can measure the improvement

Is your latency budget under 2ms?

→ Consider BM25-only (0.3ms) or optimize with HNSW + parallel execution

RRF vs WLC: use RRF as your default — it requires no tuning and is robust across query types. Graduate to WLC when you have 100+ labeled queries to tune α via grid search, and you have the evaluation infrastructure to prevent regressions.

This framework connects to the broader DadOps backend series: the RAG post used pure vector retrieval — hybrid search is the production upgrade for that retrieval step. The caching post showed how to amortize expensive operations — cache hybrid results to offset the dual-search cost. And the latency benchmarks post measured LLM inference time — search latency is the other half of your total RAG latency budget.

Try It: Search Arena

Search Arena — BM25 vs Vector vs Hybrid

Type a query or click a sample below. See how each method ranks results, with color-coded relevance.

BM25 (Keyword)
Vector Search
Hybrid RRF
BM25 wins Vector wins Hybrid wins Tie

Each dot is a sample query. X = BM25 NDCG, Y = Vector NDCG. Color = which method won overall.

Conclusion — Measure Before You Merge

The search trilogy is complete, and the verdict is nuanced:

My recommendation: start with BM25 — it's free, fast, and covers exact-match queries perfectly. Add vector search when semantic queries appear in your logs. And add hybrid fusion only when your eval suite shows the combination outperforms either alone on your actual query distribution.

The best search system isn't the most sophisticated — it's the one that's justified by your data. FTS5 gave us keywords, vector search gave us meaning, and hybrid search gives us both. But measure before you merge.

All benchmark numbers in this post are illustrative — generated from a controlled synthetic corpus to demonstrate realistic patterns and relative magnitudes. Your numbers will vary by dataset, embedding model, and query distribution. The methodology (build all three, evaluate on labeled queries, compare per-category) is the real takeaway.

References & Further Reading