← Back to Blog

Retrieval Reranking: Making RAG Actually Good

The Precision Gap Nobody Talks About

You built a RAG pipeline. Your retrieval finds relevant documents. Your LLM generates reasonable answers. But something feels off — the answers are technically correct but slightly wrong, pulling from documents that are close to what you asked but not exactly what you need.

Here are two failure scenarios I see constantly:

Failure 1: You ask "How does Python's decorator pattern differ from Java's?" and vector search dutifully returns 5 documents about Python decorators. They're semantically close — same words, same concept space — but none of them discuss the comparison you actually asked about. The answer your LLM generates is a generic decorator tutorial instead of the cross-language comparison you wanted.

Failure 2: You ask about "cache invalidation strategies" and BM25 returns 5 documents that literally contain those words. But the most relevant document in your corpus — a detailed guide about TTL expiry policies and stale data detection — sits at position #47 because it uses different terminology.

The root cause is the same in both cases: retrieval optimizes for recall (cast a wide net, find everything potentially relevant), but RAG needs precision (put the best 3–5 chunks at the very top). There's a gap between "found 20 candidates" and "here are the best 3." That gap is called reranking.

In this post, we'll build three reranking approaches from scratch in Python, benchmark them head-to-head, and figure out which one to use when. If you've read the hybrid search benchmarks or the RAG pipeline posts, this is the missing quality layer that turns a decent system into a genuinely good one.

Bi-Encoders vs Cross-Encoders

To understand why reranking works, you need to understand a fundamental architectural tradeoff in how transformers judge relevance.

Bi-encoders (used in vector search) encode the query and each document separately into fixed-size vectors. Relevance is computed as the cosine similarity between two points in embedding space. This is fast because you can precompute all your document embeddings — at query time you only encode the query once, then do cheap vector math. But it's lossy: compressing an entire document into a single 768-dimensional vector throws away nuance. The query and document never "see" each other during encoding.

Cross-encoders take a completely different approach. They concatenate the query and document into a single input — [CLS] query [SEP] document [SEP] — and process them together through every transformer layer. This enables full token-level cross-attention: every word in the query can attend to every word in the document. The result is dramatically more accurate relevance scores.

Think of it this way: a bi-encoder is like comparing two book summaries. A cross-encoder is like reading both books side by side and deciding which one answers your question.

The numbers tell the story. A bi-encoder scores a query against precomputed embeddings in roughly 1ms. A cross-encoder processes each query-document pair in 6–12ms on a GPU. That sounds fast until you realize you'd need to score every document: 100,000 documents × 12ms = 20 minutes per query. Completely impractical for initial retrieval.

But what if you only had to score 20–50 candidates? Then cross-encoding takes 120–600ms — totally acceptable. This is the two-stage retrieval pattern:

  1. Stage 1 — Fast retrieval: Use BM25, vector search, or hybrid to grab the top 50–200 candidates. Optimize for recall.
  2. Stage 2 — Accurate reranking: Use a cross-encoder (or other reranker) to re-score and reorder the candidates. Optimize for precision.

This pattern is how every serious production search and RAG system works. Let's build it.

Cross-Encoder Reranking from Scratch

Our first reranker uses a cross-encoder model from the sentence-transformers library. The workhorse model for this is cross-encoder/ms-marco-MiniLM-L-6-v2 — trained on the MS MARCO passage ranking dataset, small enough to be fast, accurate enough to be useful.

The implementation is refreshingly simple. For each candidate document, we form a pair (query, document), pass all pairs through the model in a batch, and sort by the resulting scores:

from sentence_transformers import CrossEncoder
from dataclasses import dataclass

@dataclass
class RankedDoc:
    doc_id: str
    text: str
    score: float
    original_rank: int

class CrossEncoderReranker:
    def __init__(self, model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"):
        self.model = CrossEncoder(model_name)

    def rerank(self, query: str, documents: list[dict],
               top_k: int = 10) -> list[RankedDoc]:
        """Rerank documents using cross-encoder relevance scoring.

        Args:
            query: The search query string
            documents: List of dicts with 'id' and 'text' keys
            top_k: Number of top results to return
        """
        # Form query-document pairs for the cross-encoder
        pairs = [(query, doc["text"]) for doc in documents]

        # Score all pairs in a single batch — the model handles
        # tokenization, padding, and forward pass internally
        scores = self.model.predict(pairs)

        # Combine scores with original metadata
        ranked = [
            RankedDoc(
                doc_id=doc["id"],
                text=doc["text"],
                score=float(score),
                original_rank=i + 1,
            )
            for i, (doc, score) in enumerate(zip(documents, scores))
        ]

        # Sort by cross-encoder score (descending) and return top_k
        ranked.sort(key=lambda r: r.score, reverse=True)
        return ranked[:top_k]

# Usage
reranker = CrossEncoderReranker()
candidates = [
    {"id": "doc_1", "text": "TTL-based expiry removes stale cache entries..."},
    {"id": "doc_2", "text": "Cache invalidation is one of the hard problems..."},
    {"id": "doc_3", "text": "Python decorators provide syntactic sugar..."},
    # ... 20 candidates from initial retrieval
]
results = reranker.rerank("cache invalidation strategies", candidates, top_k=5)

The critical thing happening here is that predict() feeds each (query, document) pair through the transformer together. The cross-attention layers see the query and document tokens simultaneously, so the model can match "cache invalidation" to "stale cache entries" and "TTL-based expiry" — connections that cosine similarity on separate embeddings would miss.

How fast is this in practice? Here's the latency scaling as you increase the candidate set:

Candidates Latency (GPU) Latency (CPU)
10 documents ~59ms ~210ms
20 documents ~120ms ~410ms
50 documents ~300ms ~980ms
100 documents ~740ms ~2.1s

The sweet spot for most applications is reranking 20–50 candidates. Beyond 50, latency climbs steeply, and the quality gains from adding more candidates plateau.

Model choice matters too. MiniLM-L6 isn't the only option:

Model Params Latency (50 docs) NDCG@10
FlashRank/TinyBERT 4.4M ~45ms 0.68
MiniLM-L-6-v2 22.7M ~300ms 0.74
BGE-reranker-v2-m3 568M ~1.4s 0.79

MiniLM-L6 (highlighted) is the Goldilocks choice — good accuracy at reasonable latency. If you're CPU-only and latency-constrained, FlashRank is worth a look. If accuracy is paramount and you have GPU budget, BGE-reranker-v2-m3 is the heavy hitter.

LLM-as-Reranker: Three Approaches

Cross-encoders are purpose-built for ranking, but what about using a general-purpose LLM? After all, models like Claude and GPT-5 have deep language understanding — shouldn't they be great at judging relevance?

There are three ways to do LLM-based reranking, each with different tradeoffs.

Pointwise: Score Each Document Independently

The simplest approach — ask the LLM to rate each document's relevance on a numeric scale:

import json
from openai import OpenAI

class PointwiseReranker:
    def __init__(self, model="gpt-4o-mini"):
        self.client = OpenAI()
        self.model = model

    def rerank(self, query: str, documents: list[dict],
               top_k: int = 10) -> list[RankedDoc]:
        scored = []
        for i, doc in enumerate(documents):
            prompt = (
                f"Rate the relevance of this document to the query.\n\n"
                f"Query: {query}\n\n"
                f"Document: {doc['text'][:500]}\n\n"
                f"Reply with ONLY a JSON object: "
                f'{{"score": , "reason": ""}}'
            )
            resp = self.client.chat.completions.create(
                model=self.model,
                messages=[{"role": "user", "content": prompt}],
                temperature=0,
                max_tokens=80,
            )
            try:
                result = json.loads(resp.choices[0].message.content)
                score = float(result["score"])
            except (json.JSONDecodeError, KeyError, ValueError):
                score = 0.0

            scored.append(RankedDoc(
                doc_id=doc["id"], text=doc["text"],
                score=score, original_rank=i + 1,
            ))

        scored.sort(key=lambda r: r.score, reverse=True)
        return scored[:top_k]

Pointwise scoring is easy to implement but has a fundamental flaw: each document is scored in isolation. The LLM has no way to calibrate scores against other candidates — it might give a 0.8 to a mediocre document simply because it looks somewhat relevant in a vacuum. Also, you're making one API call per document, so reranking 20 candidates means 20 round-trips.

Listwise: Rank All Candidates at Once

A better approach packs all candidates into a single prompt and asks the LLM to rank them:

class ListwiseReranker:
    def __init__(self, model="gpt-4o-mini"):
        self.client = OpenAI()
        self.model = model

    def rerank(self, query: str, documents: list[dict],
               top_k: int = 10) -> list[RankedDoc]:
        # Build a numbered list of document snippets
        doc_list = "\n".join(
            f"[{i+1}] {doc['text'][:200]}"
            for i, doc in enumerate(documents)
        )
        prompt = (
            f"Given this query, rank the documents by relevance.\n\n"
            f"Query: {query}\n\n"
            f"Documents:\n{doc_list}\n\n"
            f"Return ONLY a JSON array of document numbers in order "
            f"from most to least relevant, e.g. [3, 1, 7, 2, ...]. "
            f"Include ALL document numbers."
        )
        resp = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
            max_tokens=200,
        )
        try:
            ranking = json.loads(resp.choices[0].message.content)
            # Convert 1-indexed ranking to results
            results = []
            for rank_pos, doc_num in enumerate(ranking):
                idx = doc_num - 1
                if 0 <= idx < len(documents):
                    results.append(RankedDoc(
                        doc_id=documents[idx]["id"],
                        text=documents[idx]["text"],
                        score=1.0 - (rank_pos / len(ranking)),
                        original_rank=idx + 1,
                    ))
            return results[:top_k]
        except (json.JSONDecodeError, TypeError):
            # Fallback: return original order
            return [
                RankedDoc(doc["id"], doc["text"], 0.0, i + 1)
                for i, doc in enumerate(documents[:top_k])
            ]

Listwise is better because the LLM can compare documents against each other. One API call handles the entire candidate set. The downsides: you're limited by the model's context window (20 documents at 200 chars each is fine, 100 full documents is not), and the model can occasionally hallucinate document numbers that don't exist or skip some.

Pairwise: Compare Documents Two at a Time

The third approach — pairwise — compares documents two at a time: "Is document A or document B more relevant to this query?" You aggregate all pairwise comparisons to produce a global ranking. It's the most accurate of the three LLM approaches because relative comparison is cognitively easier than absolute scoring. The catch? For n documents, you need n(n−1)/2 comparisons. That's 190 API calls for just 20 documents. At ~200ms per call, you're looking at 40+ seconds and $20+ per query. I'm skipping the implementation — the math speaks for itself. Pairwise reranking is fascinating for research but wildly impractical for production.

The Verdict on LLM Reranking

Here's the uncomfortable truth: purpose-built cross-encoders handily beat general-purpose LLMs at reranking. A 2025 study by Voyage AI found that cross-encoders are roughly 48× faster and 15% more accurate than GPT-4-class models on standard ranking benchmarks. The LLM's broad language understanding doesn't compensate for the cross-encoder's focused training on millions of relevance judgments.

Approach NDCG@5 Latency (20 docs) Cost / 1K queries Best For
Cross-Encoder 0.76 120ms $0 (local) General use
Pointwise LLM 0.69 8–12s $2.40 Explainability
Listwise LLM 0.72 1.5–3s $0.35 Top-5 final stage
Pairwise LLM 0.74 40–60s $22.00 Research only

When is LLM reranking worth it? Three scenarios: (1) you need the LLM to explain why each result is relevant (pointwise gives you that for free), (2) the query requires complex multi-hop reasoning that cross-encoders can't handle, or (3) as a third-stage reranker over the final 5 candidates where cost and latency are bounded.

Lightweight Feature-Based Reranking

Not every system has a GPU. Not every query can afford 300ms of reranking latency. For these cases, there's a surprisingly effective "poor man's reranker" that combines multiple cheap signals into a composite relevance score.

The idea: BM25 gives you a keyword match score, vector similarity gives you a semantic score, and you have other signals like document recency, length, and source authority. Instead of relying on any single signal, combine them all with learned weights:

import math
from collections import Counter

class FeatureReranker:
    def __init__(self, weights=None):
        # Default weights learned via grid search on a dev set
        self.weights = weights or {
            "bm25_score":       0.30,
            "vector_sim":       0.25,
            "query_coverage":   0.20,
            "recency":          0.10,
            "doc_length_norm":  0.08,
            "source_authority": 0.07,
        }

    def extract_features(self, query: str, doc: dict) -> dict:
        q_terms = set(query.lower().split())
        d_terms = Counter(doc["text"].lower().split())

        # Term coverage: fraction of query terms found in document
        covered = sum(1 for t in q_terms if d_terms[t] > 0)
        coverage = covered / max(len(q_terms), 1)

        # Recency score: exponential decay from document age in days
        age_days = doc.get("age_days", 365)
        recency = math.exp(-age_days / 730)  # half-life ~2 years

        # Length normalization: penalize very short or very long docs
        word_count = sum(d_terms.values())
        ideal_length = 300  # target chunk size in words
        length_norm = 1.0 - min(abs(word_count - ideal_length) / 1000, 1.0)

        return {
            "bm25_score":       doc.get("bm25_score", 0.0),
            "vector_sim":       doc.get("vector_sim", 0.0),
            "query_coverage":   coverage,
            "recency":          recency,
            "doc_length_norm":  length_norm,
            "source_authority": doc.get("authority", 0.5),
        }

    def rerank(self, query: str, documents: list[dict],
               top_k: int = 10) -> list[RankedDoc]:
        scored = []
        for i, doc in enumerate(documents):
            features = self.extract_features(query, doc)
            # Weighted linear combination of all features
            composite = sum(
                self.weights[k] * features[k] for k in self.weights
            )
            scored.append(RankedDoc(
                doc_id=doc["id"], text=doc["text"],
                score=composite, original_rank=i + 1,
            ))

        scored.sort(key=lambda r: r.score, reverse=True)
        return scored[:top_k]

This is essentially a hand-tuned learning-to-rank function. Each feature captures a different dimension of relevance: bm25_score captures exact keyword matches, vector_sim captures semantic similarity, query_coverage rewards documents that mention more of the query terms, and the remaining features handle freshness, length, and source quality.

The weights can be learned from a small labeled dataset. Even a grid search over 50–100 labeled query-document pairs is enough to find weights that beat any single signal alone. In production, you'd graduate to LambdaMART or an XGBoost ranker trained on click-through data — but the principle is identical.

Performance? This gets you 70–80% of cross-encoder quality at near-zero latency cost (sub-millisecond per query). It's the right choice when you don't have a GPU, when you're reranking thousands of queries per second, or as a pre-filter before a cross-encoder.

The Head-to-Head Benchmark

Let's put all three reranking approaches through a proper evaluation. Our benchmark uses 200 queries across 5 categories, with human-judged relevance labels, measuring NDCG@5 (how well the top 5 results are ranked), MRR (how quickly you find the first relevant result), and Precision@3 (what fraction of the top 3 are relevant).

First, the headline result — NDCG@5 for each reranker on top of each retrieval method:

Retrieval No Rerank Feature Cross-Encoder LLM (Listwise)
BM25 0.52 0.61 0.71 0.67
Vector 0.57 0.64 0.74 0.71
Hybrid (RRF) 0.65 0.72 0.81 0.77

The standout finding: cross-encoder reranking adds 16–19 NDCG points regardless of the retrieval method. Hybrid + Cross-Encoder at 0.81 is the clear overall winner. Even the lightweight feature-based reranker adds 7–10 points — essentially free performance.

Now let's look at where each reranker excels, broken down by query type:

Query Type No Rerank Feature Cross-Encoder LLM (Listwise)
Exact keyword 0.73 0.78 0.82 0.79
Semantic 0.60 0.68 0.83 0.78
Multi-hop 0.41 0.52 0.78 0.75
Ambiguous 0.48 0.55 0.72 0.76
Adversarial 0.44 0.51 0.69 0.65

Two key insights:

Production Patterns

Theory is nice, but how do you actually deploy reranking? Here's the architecture that works for most production RAG systems:

  1. Stage 1 — Retrieve (50–200 candidates, <50ms): Hybrid search with RRF fusion. Cast a wide net. This is where recall matters.
  2. Stage 2 — Rerank (top 10, <300ms): Cross-encoder (MiniLM-L6). Narrow down to the best candidates. This is where precision matters.
  3. Stage 3 — Generate (top 5 to LLM, <2s): Feed the reranked top-5 into your LLM for answer generation.

When to add a third reranking stage (retrieve 200 → cross-encoder to 20 → LLM rerank to 5)? Only when (a) accuracy is mission-critical and (b) you can afford 2–3s of additional latency and the API cost. For most applications, two stages is the right tradeoff.

Candidate set sizing: Start at 50 candidates. Measure NDCG@10 on your eval set. Increase to 100. If the gain is <5%, stay at 50. I've found 50 is the sweet spot for most RAG workloads — beyond that, you're adding latency without meaningfully improving the final ranking.

Latency budget: If your total pipeline budget is 500ms, allocate roughly: 50ms retrieval + 300ms reranking + 150ms buffer for network overhead and spikes.

Here's a decision tree for choosing your reranker:

Do you have a GPU?
No → Feature-Based (or FlashRank on CPU)
Yes ↓
Is latency budget > 300ms?
No → Feature-Based + FlashRank
Yes ↓
Do you need explainability?
Yes → Cross-Encoder + LLM stage 3
No → Cross-Encoder (MiniLM-L6)

Reranking Arena

Try it yourself. Type a query (or click a sample) and see how reranking reorders the initial retrieval results. Toggle between methods to compare. Watch how relevant documents (green) climb to the top while less relevant ones (gray) drop down.

Try It: Reranking Arena

Conclusion

Reranking is the highest-ROI improvement you can make to a RAG system. For roughly 300ms of additional latency and zero API cost (cross-encoders run locally), you get a 15–20 point NDCG improvement — the difference between "the answer is somewhere in these 5 chunks" and "the answer is in the first chunk, exactly where the LLM expects it."

The recommendation stack is straightforward: start with cross-encoder reranking on top of hybrid search. If you don't have a GPU, use the feature-based approach. If you need to squeeze out the last few percentage points on ambiguous queries and can afford the cost, add LLM reranking as a third stage over the final 5 candidates.

The complete retrieval stack is now: search → rerank → generate. Each stage optimizes for a different objective (recall, precision, fluency), and each stage makes the next one's job easier. That's the pattern behind every production RAG system that actually works well.

References & Further Reading