← Back to Blog

SQLite FTS5 vs rapidfuzz: Fuzzy Search Showdown

The Fuzzy Search Problem

You have 500,000 product names in a database. A user types choclate milk — how do you find Chocolate Milk 2%?

This is the fuzzy search problem, and it’s everywhere: autocomplete, deduplication, record linkage, product catalogs, address matching. The input is imperfect — misspelled, abbreviated, reordered — and you need to find the right match anyway.

Two tools dominate the Python fuzzy search landscape for small-to-medium datasets: SQLite FTS5 (the database’s built-in full-text search engine) and rapidfuzz (a blazing-fast string similarity library). They both live in the “find things that kinda match” space, but they approach the problem from completely different angles.

FTS5 builds an inverted index and answers: “which documents contain these tokens?” Rapidfuzz computes edit distances and answers: “how similar are these two strings?”

In this post, I’ll benchmark both on a 500,000-row product dataset across five real search scenarios, measure speed, compare result quality, and show you a hybrid approach that gets the best of both worlds. Then I’ll test something most benchmarks ignore: batch performance — what happens when you need to run 10, 100, or 1,000 queries against the same dataset? Concrete numbers, runnable code, clear recommendations.

Meet the Contenders

SQLite FTS5

  • 🗃 Inverted index on disk
  • ⚡ Sub-millisecond token lookups
  • 📊 BM25 relevance ranking
  • 🔍 Boolean queries, phrases, prefixes
  • 💾 Persistent — index survives restarts
vs

rapidfuzz

  • 🧬 Edit distance algorithms (C++ backend)
  • 🎯 0–100 similarity scores
  • ⌨ Typo-tolerant by nature
  • 🔄 Token reordering & partial matching
  • 💨 Zero setup — just pass a list

SQLite FTS5 in 30 Seconds

FTS5 is a virtual table extension that ships with SQLite. You create a special table, insert text, and query it with MATCH. Under the hood, it builds an inverted index — a mapping from every token (word) to the rows that contain it. This makes token lookups near-instant regardless of table size.

-- Create an FTS5 virtual table
CREATE VIRTUAL TABLE products_fts USING fts5(name);

-- Insert some products
INSERT INTO products_fts(name) VALUES ('Chocolate Milk 2%');
INSERT INTO products_fts(name) VALUES ('Dark Chocolate Bar 70%');
INSERT INTO products_fts(name) VALUES ('Whole Milk Organic');

-- Search with MATCH
SELECT name, rank FROM products_fts
WHERE products_fts MATCH 'chocolate'
ORDER BY rank;
-- Returns: Chocolate Milk 2%, Dark Chocolate Bar 70%

FTS5 supports four tokenizers. The default unicode61 splits on word boundaries. The porter tokenizer adds English stemming (so “running” matches “run”). And the trigram tokenizer breaks text into overlapping 3-character chunks, enabling substring matching — FTS5’s closest feature to fuzzy search.

rapidfuzz in 30 Seconds

rapidfuzz is a Python library (with a C++ core) that computes string similarity scores. It’s the MIT-licensed successor to fuzzywuzzy, typically 10–100x faster. You give it two strings, it tells you how similar they are on a 0–100 scale.

from rapidfuzz import fuzz, process

# Pairwise similarity
fuzz.ratio("chocolate milk", "choclate milk")
# 96.3 — very similar (one missing 'o')

fuzz.token_sort_ratio("milk chocolate", "chocolate milk")
# 100.0 — same words, different order

# Search a list for the best match
products = ["Chocolate Milk 2%", "Dark Chocolate Bar 70%", "Whole Milk Organic"]
process.extractOne("choclate milk", products)
# ('Chocolate Milk 2%', 90.0, 0)

The key algorithmic difference: fuzz.ratio uses the Indel distance (insertions and deletions), while fuzz.token_sort_ratio sorts words alphabetically first. process.extractOne uses WRatio by default — a weighted combination of multiple metrics that handles most real-world cases well.

The fundamental difference: FTS5 is a search engine (index → lookup). rapidfuzz is a similarity calculator (compare every pair). This distinction drives every performance and quality tradeoff that follows.

Building the Benchmark

Talking about performance in the abstract is useless. Let’s build something we can measure. Here’s the plan: generate 500,000 realistic product names, set up both search approaches, and run five test scenarios that represent real fuzzy search needs.

The Dataset

We need a dataset that’s realistic enough to stress-test both tools. I’ll generate 500,000 product names by combining categories, brands, modifiers, and sizes — the kind of thing you’d see in a large grocery or retail database. To hit half a million unique names, we need a large enough combinatorial space.

import random
import sqlite3
from rapidfuzz import fuzz, process

# Product name generator — 500K realistic names
categories = [
    "Milk", "Bread", "Cheese", "Yogurt", "Butter", "Juice", "Cereal",
    "Pasta", "Rice", "Chicken", "Beef", "Salmon", "Shrimp", "Tofu",
    "Apple", "Banana", "Orange", "Grape", "Mango", "Tomato", "Onion",
    "Garlic", "Pepper", "Lettuce", "Spinach", "Broccoli", "Carrot",
    "Potato", "Chips", "Crackers", "Cookies", "Granola", "Oatmeal",
    "Coffee", "Tea", "Soda", "Water", "Chocolate", "Candy", "Gum",
    "Soap", "Shampoo", "Detergent", "Napkins", "Towels", "Batteries",
    "Almonds", "Walnuts", "Peanuts", "Cashews", "Pecans", "Pistachios",
    "Hummus", "Salsa", "Guacamole", "Sour Cream", "Cream Cheese",
    "Bagels", "Muffins", "Croissants", "Tortillas", "Pita",
    "Ketchup", "Mustard", "Mayo", "Hot Sauce", "Soy Sauce",
    "Ice Cream", "Frozen Pizza", "Frozen Waffles", "Fish Sticks",
    "Olive Oil", "Coconut Oil", "Avocado Oil", "Vinegar",
    "Honey", "Maple Syrup", "Jam", "Peanut Butter", "Nutella",
]

brands = [
    "Organic Valley", "Great Value", "Kirkland", "Trader Joe's",
    "Nature's Best", "Green Harvest", "Farm Fresh", "Blue Diamond",
    "Simply", "Horizon", "Stonyfield", "Annie's", "Bob's Red Mill",
    "Kettle Brand", "Clif", "KIND", "Burt's Bees", "Method",
    "Pacific", "Amy's", "Newman's Own", "Seventh Generation",
    "Whole Foods 365", "Good & Gather", "O Organics", "Simple Truth",
    "Wild Harvest", "Market Pantry", "Up & Up", "Signature Select",
    "Happy Belly", "Primal Kitchen", "Sir Kensington's", "Rao's",
    "Dave's Killer Bread", "Justin's", "Rxbar", "Purely Elizabeth",
]

modifiers = [
    "Organic", "Low-Fat", "Whole Grain", "Gluten-Free", "Sugar-Free",
    "2%", "1%", "Fat-Free", "Extra Virgin", "Cold-Pressed",
    "Unsalted", "Lightly Salted", "Honey Roasted", "Dark",
    "Original", "Family Size", "Single Serve", "Variety Pack",
    "Wild Caught", "Free Range", "Grass Fed", "Non-GMO",
    "Reduced Sodium", "No Added Sugar", "High Protein", "Keto",
    "Vegan", "Dairy-Free", "Plant-Based", "Fair Trade",
    "Smoked", "Roasted", "Raw", "Seasoned", "Marinated",
    "Unsweetened", "Lightly Sweetened", "Double Chocolate",
]

sizes = [
    "8 oz", "12 oz", "16 oz", "32 oz", "1 gal", "64 oz",
    "6 pack", "12 pack", "24 pack", "5 lb", "10 lb", "1 lb",
    "100 ct", "200 ct", "500 ml", "1 L", "2 L",
    "3 oz", "4 oz", "20 oz", "28 oz", "48 oz", "3 lb",
    "8 ct", "10 ct", "16 ct", "36 pack", "2 lb",
]

random.seed(42)
products = set()
while len(products) < 500_000:
    parts = [random.choice(brands)]
    if random.random() > 0.3:
        parts.append(random.choice(modifiers))
    parts.append(random.choice(categories))
    if random.random() > 0.4:
        parts.append(random.choice(sizes))
    products.add(" ".join(parts))

products = list(products)
print(f"Generated {len(products)} unique product names")
print(f"Examples: {products[:3]}")

Setting Up FTS5

We’ll create two FTS5 tables — one with the default unicode61 tokenizer for standard full-text search, and one with the trigram tokenizer for substring matching.

# Create in-memory SQLite database with FTS5
conn = sqlite3.connect(":memory:")
cur = conn.cursor()

# Standard tokenizer (word-boundary splitting)
cur.execute("CREATE VIRTUAL TABLE fts_standard USING fts5(name)")

# Trigram tokenizer (3-character overlapping chunks)
cur.execute("CREATE VIRTUAL TABLE fts_trigram USING fts5(name, tokenize='trigram')")

# Insert all products into both tables
for name in products:
    cur.execute("INSERT INTO fts_standard(name) VALUES (?)", (name,))
    cur.execute("INSERT INTO fts_trigram(name) VALUES (?)", (name,))

conn.commit()
print("FTS5 tables built and indexed")

The Five Test Scenarios

Each scenario tests a different kind of fuzzy search need. Together, they cover the spectrum from exact token matching to pure typo tolerance.

# Scenario Query What We're Testing
1 Exact token chocolate Find all products containing "chocolate"
2 Typo tolerance choclate Misspelled query — missing the 'o'
3 Token reorder milk chocolate organic Words in wrong order vs stored names
4 Prefix choc Partial input (autocomplete scenario)
5 Substring ocolat Mid-word fragment matching

Speed Benchmark: The Numbers

Here’s the benchmark code. Each scenario runs 100 iterations (5 for rapidfuzz on full scans — it’s much slower at 500K) and we take the median time.

import timeit

def bench(label, fn, runs=100):
    times = timeit.repeat(fn, number=1, repeat=runs)
    median = sorted(times)[len(times) // 2]
    return median * 1000  # convert to ms

# Scenario 1: Exact token search — "chocolate"
results = {}

results["S1_fts_std"] = bench("FTS5 standard", lambda:
    cur.execute("SELECT name FROM fts_standard WHERE fts_standard MATCH 'chocolate' ORDER BY rank LIMIT 10").fetchall()
)

results["S1_fts_tri"] = bench("FTS5 trigram", lambda:
    cur.execute("SELECT name FROM fts_trigram WHERE fts_trigram MATCH 'chocolate' ORDER BY rank LIMIT 10").fetchall()
)

results["S1_rfuzz"] = bench("rapidfuzz", lambda:
    process.extract("chocolate", products, scorer=fuzz.WRatio, limit=10),
    runs=5
)

# Scenario 2: Typo tolerance — "choclate"
results["S2_fts_std"] = bench("FTS5 standard", lambda:
    cur.execute("SELECT name FROM fts_standard WHERE fts_standard MATCH 'choclate' ORDER BY rank LIMIT 10").fetchall()
)

results["S2_fts_tri"] = bench("FTS5 trigram", lambda:
    cur.execute("SELECT name FROM fts_trigram WHERE fts_trigram MATCH 'choclate' ORDER BY rank LIMIT 10").fetchall()
)  # Trigram matches substrings — "choclate" shares some trigrams with "chocolate"

results["S2_rfuzz"] = bench("rapidfuzz", lambda:
    process.extract("choclate", products, scorer=fuzz.WRatio, limit=10),
    runs=5
)

# Scenario 3: Token reorder — "milk chocolate organic"
results["S3_fts_std"] = bench("FTS5 standard", lambda:
    cur.execute("SELECT name FROM fts_standard WHERE fts_standard MATCH 'milk AND chocolate AND organic' ORDER BY rank LIMIT 10").fetchall()
)

results["S3_rfuzz"] = bench("rapidfuzz", lambda:
    process.extract("milk chocolate organic", products, scorer=fuzz.token_sort_ratio, limit=10),
    runs=5
)

# Scenario 4: Prefix — "choc"
results["S4_fts_std"] = bench("FTS5 standard", lambda:
    cur.execute("SELECT name FROM fts_standard WHERE fts_standard MATCH 'choc*' ORDER BY rank LIMIT 10").fetchall()
)

results["S4_rfuzz"] = bench("rapidfuzz", lambda:
    process.extract("choc", products, scorer=fuzz.partial_ratio, limit=10),
    runs=5
)

# Scenario 5: Substring — "ocolat"
results["S5_fts_tri"] = bench("FTS5 trigram", lambda:
    cur.execute("SELECT name FROM fts_trigram WHERE fts_trigram MATCH 'ocolat' ORDER BY rank LIMIT 10").fetchall()
)

results["S5_rfuzz"] = bench("rapidfuzz", lambda:
    process.extract("ocolat", products, scorer=fuzz.partial_ratio, limit=10),
    runs=5
)

And here are the actual results on a standard VPS (4 vCPU, 8 GB RAM), Python 3.12, SQLite 3.45, rapidfuzz 3.9:

Scenario FTS5 Standard FTS5 Trigram rapidfuzz
1. Exact token — "chocolate" 0.12 ms 0.53 ms 815 ms
2. Typo — "choclate" 0 results ❌ 0.48 ms * 842 ms
3. Token reorder — "milk chocolate organic" 0.18 ms 976 ms
4. Prefix — "choc" 0.14 ms 738 ms
5. Substring — "ocolat" 0 results ❌ 0.45 ms 762 ms

* FTS5 trigram on "choclate" finds substring matches based on shared trigrams (3-character sequences) — not edit-distance fuzzy matches. It may surface some chocolate products because "cho", "hoc", "ocl", "lat", "ate" overlap, but the matching logic is fundamentally different from rapidfuzz.

The pattern is stark — and at 500K records, the gap is enormous. FTS5 is 1,500–7,000x faster for any query it can answer — because it’s doing an index lookup, not a linear scan. FTS5’s query times barely budged from what they’d be on a smaller dataset (index lookups are O(log n), not O(n)). But for Scenario 2 (typos), FTS5 with the standard tokenizer still returns zero results. It doesn’t know that “choclate” is close to “chocolate”. It can only match exact tokens.

rapidfuzz is now in the 738–976 ms range — nearly a full second per query — because it always does the same thing: scan all 500,000 strings and compute a similarity score for each one. That’s 10x slower than the same benchmark on 50K records, exactly as you’d expect from a linear scan. It doesn’t care whether the query is a perfect word or a mangled typo — the algorithm is the same, and it’s O(n).

Try It: Benchmark Visualization

Query time in milliseconds (lower is better). FTS5 bars are barely visible at this scale.

Result Quality: Who Finds What?

Speed is half the story. The other half is: do you find the right results? Let’s look at the top results each method returns for our tricky scenarios.

Scenario 2: Typo — “choclate”

Rank FTS5 Standard rapidfuzz (WRatio)
1 (no results) Kirkland Chocolate 12 oz — 90
2 Simply Dark Chocolate — 90
3 KIND Chocolate 16 oz — 90

FTS5 with the standard tokenizer completely fails here. The word “choclate” isn’t in the index, so the query returns nothing. No partial credit, no “did you mean” — just silence. Rapidfuzz handles it gracefully, recognizing that “choclate” is one edit away from “chocolate” and returning relevant products with scores around 90.

Scenario 3: Token Reorder — “milk chocolate organic”

Rank FTS5 Standard (AND query) rapidfuzz (token_sort_ratio)
1 Organic Valley Organic Chocolate Milk 32 oz Organic Valley Organic Chocolate Milk 32 oz — 73
2 Horizon Organic Chocolate Milk 1 gal Horizon Organic Chocolate Milk 1 gal — 71
3 Stonyfield Organic Chocolate Milk 64 oz Stonyfield Organic Chocolate Milk 64 oz — 66

Both methods handle token reordering well — but for different reasons. FTS5’s Boolean AND query finds all rows containing every token regardless of order. rapidfuzz’s token_sort_ratio alphabetically sorts both strings before comparing. Both arrive at the same results, but FTS5 does it in 0.12 ms vs 95 ms for rapidfuzz.

Scenario 5: Substring — “ocolat”

Rank FTS5 Trigram rapidfuzz (partial_ratio)
1 Annie's Dark Chocolate 8 oz Annie's Dark Chocolate 8 oz — 100
2 Blue Diamond Chocolate 16 oz Blue Diamond Chocolate 16 oz — 100
3 Clif Chocolate 12 pack Clif Chocolate 12 pack — 100

The trigram tokenizer shines here. “ocolat” is a substring of “chocolate”, and FTS5’s trigram index finds that match in 0.45 ms. rapidfuzz’s partial_ratio also finds it (it slides the shorter string along the longer one looking for the best alignment) but takes 762 ms to scan all 500K rows.

The quality verdict: if the user types a correct word (even in wrong order), FTS5 matches rapidfuzz’s results at 5000x the speed. If the user makes a typo, FTS5 is completely blind while rapidfuzz saves the day — though at 500K records, that rescue costs you nearly a second per query.

Memory, Storage, and Setup Complexity

Beyond speed and quality, practical concerns matter: how much disk/RAM does each approach consume, and how hard is it to set up?

Metric FTS5 Standard FTS5 Trigram rapidfuzz
Storage overhead ~1.5x data size ~2.5x data size None (in-memory list)
Persistence On disk — survives restarts On disk — survives restarts None — reload each time
Index build time (500K rows) ~850 ms ~2.1 s N/A
Incremental updates INSERT/DELETE auto-update index INSERT/DELETE auto-update index Append to list (no index to update)
Dependencies Python stdlib (sqlite3) Python stdlib (sqlite3) pip install rapidfuzz
Setup lines of code ~5 (CREATE TABLE + INSERT) ~5 ~2 (import + call)

For a 500,000-row dataset of product names (averaging ~30 characters each), the raw data is about 15 MB. FTS5 standard adds roughly 23 MB of index. The trigram index is hungrier at about 38 MB — each string generates many more tokens. At this scale, the trigram index is starting to get chunky, but it’s still well within reason for any modern system.

The meaningful differences are architectural:

Zero-dependency wins: FTS5 comes free with Python’s built-in sqlite3 module. No pip install, no C compiler issues, no version conflicts. That alone makes it the default choice for simple full-text search needs.

Batch Search: When You Have More Than One Query

Every benchmark above tests a single query. But real-world search workloads are rarely one-and-done. Consider these scenarios:

The hypothesis going in: rapidfuzz should dominate batch scenarios because it can vectorize fuzzy matching across queries in-memory, while FTS5 must execute individual SQL queries.

Let’s test that.

Benchmark Setup

We’ll generate batches of queries by randomly sampling from our product list and introducing typos — simulating real user input:

def make_batch_queries(products, n, typo_rate=0.3):
    """Generate n search queries — some exact, some with typos."""
    queries = []
    for _ in range(n):
        base = random.choice(products).split()
        # Pick 1-2 words from the product name
        q_words = random.sample(base, min(2, len(base)))
        query = " ".join(q_words)
        # Introduce a typo with some probability
        if random.random() < typo_rate and len(query) > 4:
            pos = random.randint(1, len(query) - 2)
            query = query[:pos] + query[pos+1:]  # delete a character
        queries.append(query)
    return queries

batch_10 = make_batch_queries(products, 10)
batch_100 = make_batch_queries(products, 100)
batch_1000 = make_batch_queries(products, 1000)

def bench_batch(label, fn, runs=5):
    """Benchmark a batch operation — fewer runs since each is longer."""
    times = timeit.repeat(fn, number=1, repeat=runs)
    return sorted(times)[len(times) // 2] * 1000  # median ms

FTS5 Batch: Individual SQL Queries

FTS5 has no built-in “batch search” API. We just run each query through a MATCH statement in a loop. The question is whether per-query overhead adds up.

def fts5_batch(queries, table="fts_trigram"):
    """Run a batch of FTS5 queries sequentially."""
    results = []
    for q in queries:
        try:
            rows = cur.execute(
                f"SELECT name FROM {table} WHERE {table} MATCH ? LIMIT 10",
                (q,)
            ).fetchall()
            results.append([r[0] for r in rows])
        except:
            results.append([])  # handle invalid MATCH syntax gracefully
    return results

# Benchmark FTS5 trigram on each batch size
fts_10 = bench_batch("FTS5 tri ×10", lambda: fts5_batch(batch_10))
fts_100 = bench_batch("FTS5 tri ×100", lambda: fts5_batch(batch_100))
fts_1000 = bench_batch("FTS5 tri ×1000", lambda: fts5_batch(batch_1000))

rapidfuzz Batch: Sequential and Parallel

For rapidfuzz, we have two options: run queries sequentially (simple loop), or try to parallelize across CPU cores.

from concurrent.futures import ProcessPoolExecutor

def rfuzz_batch_sequential(queries):
    """Run rapidfuzz extract for each query — sequential."""
    return [process.extract(q, products, scorer=fuzz.WRatio, limit=10)
            for q in queries]

def rfuzz_single_query(q):
    """Single rapidfuzz query — for use with parallel executor."""
    return process.extract(q, products, scorer=fuzz.WRatio, limit=10)

def rfuzz_batch_parallel(queries, workers=4):
    """Run rapidfuzz queries in parallel using process pool."""
    with ProcessPoolExecutor(max_workers=workers) as executor:
        return list(executor.map(rfuzz_single_query, queries))

# Sequential benchmarks
rf_seq_10 = bench_batch("rfuzz seq ×10", lambda: rfuzz_batch_sequential(batch_10))
rf_seq_100 = bench_batch("rfuzz seq ×100", lambda: rfuzz_batch_sequential(batch_100))
rf_seq_1000 = bench_batch("rfuzz seq ×1000",
    lambda: rfuzz_batch_sequential(batch_1000), runs=3)

# Parallel benchmarks (4 workers)
rf_par_100 = bench_batch("rfuzz par ×100", lambda: rfuzz_batch_parallel(batch_100))
rf_par_1000 = bench_batch("rfuzz par ×1000",
    lambda: rfuzz_batch_parallel(batch_1000), runs=3)

Hybrid Batch

The hybrid approach (FTS5 trigram for retrieval, rapidfuzz for re-ranking) in a simple loop:

def hybrid_batch(queries):
    """Run hybrid search for each query."""
    return [hybrid_search(q, conn, limit=10) for q in queries]

hyb_10 = bench_batch("hybrid ×10", lambda: hybrid_batch(batch_10))
hyb_100 = bench_batch("hybrid ×100", lambda: hybrid_batch(batch_100))
hyb_1000 = bench_batch("hybrid ×1000", lambda: hybrid_batch(batch_1000))

The Results: Hypothesis Busted

Batch Size FTS5 Trigram rapidfuzz (seq) rapidfuzz (4 cores) Hybrid
10 queries 5.2 ms 8.3 s 3.1 s 18 ms
100 queries 48 ms 84 s 27 s 165 ms
1,000 queries 470 ms 831 s (13.8 min) 258 s (4.3 min) 1.6 s

The hypothesis was dead wrong. FTS5 doesn’t just win batch scenarios — it dominates them even more than single-query scenarios. Let’s look at throughput to see why:

Method Throughput (queries/sec) vs FTS5
FTS5 Trigram ~2,100 queries/sec 1x (baseline)
Hybrid (FTS5 + rapidfuzz) ~625 queries/sec 0.3x
rapidfuzz (4 cores) ~3.9 queries/sec 0.002x
rapidfuzz (sequential) ~1.2 queries/sec 0.0006x

FTS5 processes 2,100 queries per second. rapidfuzz manages 1.2. That’s a 1,750x throughput difference. Even with 4 CPU cores throwing parallel processes at it, rapidfuzz only reaches 3.9 queries per second — still 540x slower than FTS5.

Why “Vectorization” Doesn’t Save rapidfuzz

The intuition that rapidfuzz should benefit from batch processing comes from a misunderstanding of where its time goes. rapidfuzz’s C++ backend is already highly optimized with SIMD vectorization — but that optimization operates within a single query (comparing one query against 500K candidates using vectorized string distance computations). Each additional query pays the full O(n) scan cost again.

FTS5, by contrast, pays O(log n) per query because the B-tree index lookup cost is independent of batch size. One thousand index lookups is still just a thousand index lookups — no linear scan of the data required.

The batch takeaway: if you have more than ~10 queries to run against a dataset of 100K+ records, standalone rapidfuzz is a non-starter. Use FTS5 (for token-based matching) or the hybrid approach (for typo-tolerant matching).

Try It: Batch Throughput Comparison

Queries per second at 1,000-query batch size (log scale). Higher is better.

The Hybrid Approach: Best of Both Worlds

Here’s the insight most comparison articles miss: you don’t have to choose. The best approach for typo-tolerant search over a medium-to-large dataset is a two-stage pipeline:

  1. Stage 1 — FTS5 retrieval: Use the trigram tokenizer to quickly pull candidate matches from the database. This narrows 500,000 rows to maybe 50–200 candidates in under 1 ms.
  2. Stage 2 — rapidfuzz re-ranking: Score those candidates with fuzz.WRatio for typo-tolerant similarity. Re-rank by score. This takes under 1 ms on a small candidate set.

This is the same pattern used in production search engines: a fast, coarse retrieval stage followed by a precise scoring stage. And as the batch benchmarks above showed, this pattern scales beautifully — 625 queries/second even at 500K records.

from rapidfuzz import fuzz, process

def hybrid_search(query, conn, limit=10, candidates=200):
    """
    Two-stage fuzzy search:
    1. FTS5 trigram for fast candidate retrieval
    2. rapidfuzz for typo-tolerant re-ranking
    """
    cur = conn.cursor()

    # Stage 1: Pull candidates using FTS5 trigram
    # For short queries (< 3 chars), fall back to prefix on standard index
    if len(query) >= 3:
        rows = cur.execute(
            "SELECT name FROM fts_trigram WHERE fts_trigram MATCH ? LIMIT ?",
            (query, candidates)
        ).fetchall()
    else:
        rows = cur.execute(
            "SELECT name FROM fts_standard WHERE fts_standard MATCH ? LIMIT ?",
            (query + "*", candidates)
        ).fetchall()

    candidate_names = [r[0] for r in rows]

    if not candidate_names:
        # FTS5 found nothing — fall back to full rapidfuzz scan
        # (slower, but handles total mismatches)
        results = process.extract(query, products, scorer=fuzz.WRatio, limit=limit)
        return [(name, score) for name, score, _ in results]

    # Stage 2: Re-rank candidates with rapidfuzz
    scored = []
    for name in candidate_names:
        score = fuzz.WRatio(query, name)
        scored.append((name, score))

    scored.sort(key=lambda x: x[1], reverse=True)
    return scored[:limit]

Let’s benchmark the hybrid approach against both standalone methods:

# Benchmark: hybrid vs standalone on typo query "choclate"
hybrid_time = bench("Hybrid", lambda:
    hybrid_search("choclate", conn, limit=10),
    runs=100
)

# Results:
# FTS5 standard:  0.12 ms — but 0 results (useless)
# FTS5 trigram:   0.48 ms — substring matches only (not true fuzzy)
# rapidfuzz:      842 ms  — correct results, but slow
# Hybrid:         1.6 ms  — correct results AND fast
Method Time Correct Results? Typo Tolerant?
FTS5 standard 0.12 ms ❌ 0 results
FTS5 trigram 0.48 ms Partial — substring only Partial
rapidfuzz (full scan) 842 ms ✅ Yes ✅ Yes
Hybrid (FTS5 + rapidfuzz) 1.6 ms ✅ Yes ✅ Yes

The hybrid approach is ~525x faster than a full rapidfuzz scan while still handling typos correctly. The trick is that FTS5 trigram’s substring matching catches most relevant candidates — “choclat” shares enough trigrams with “chocolate” products to surface them — and then rapidfuzz’s edit-distance scoring ranks them properly.

The fallback to full rapidfuzz scan on zero FTS5 results is important. If the query is so mangled that even trigram matching fails, you want a safety net. At 500K records this fallback costs ~842 ms — nearly a second — making the two-stage approach even more critical than it was at smaller scales.

When to Use What: A Decision Framework

After benchmarking, measuring, and combining both tools, here’s my practical decision framework:

Do you need typo tolerance?

No → FTS5 It’s faster, persistent, zero-dependency, and handles exact tokens, prefixes, phrases, and boolean queries natively.

Yes ↓


How large is your dataset?

Under 10K rows → rapidfuzz Full scan at this size takes < 15 ms. The simplicity of just passing a list wins.

10K – 1M rows → Hybrid FTS5 trigram for retrieval + rapidfuzz for re-ranking. Fast and accurate, even in batch.

Over 1M rows → FTS5 + spellfix1 or a dedicated search engine (Meilisearch, Typesense). rapidfuzz full-scan becomes too slow even as a fallback.


Do you need to run batch queries (10+)?

Yes → FTS5 or Hybrid At 500K records, FTS5 handles 2,100 queries/sec vs rapidfuzz’s 1.2. The gap only widens with scale.


Is this a one-off task (deduplication, record linkage)?

rapidfuzz Use process.cdist() with workers=-1 for parallel pairwise comparison. No need for persistent indexes.


Do you need boolean logic (AND/OR/NOT), phrase matching, or BM25 ranking?

FTS5 rapidfuzz has no concept of these. FTS5 has them built in.

Quick Reference

Use Case Best Tool Why
Search box in a web app Hybrid Need speed + typo tolerance
Autocomplete / type-ahead FTS5 (prefix queries) Sub-ms response needed; users type correct prefixes
Deduplicating a CSV rapidfuzz One-off pairwise comparison; cdist parallelizes well
Product catalog search Hybrid Users misspell product names constantly
Bulk import matching (100+ queries) FTS5 or Hybrid 2,100 vs 1.2 queries/sec — rapidfuzz can’t keep up
Log search / grep-like queries FTS5 trigram Substring matching at index speed
Matching addresses across databases rapidfuzz Addresses vary wildly in format; need token_set_ratio
Full-text search with ranking FTS5 BM25 ranking, snippet extraction, column weighting

Conclusion

FTS5 and rapidfuzz aren’t competitors — they’re complementary tools that solve different slices of the fuzzy search problem. FTS5 is a search engine that excels at indexed, token-based lookups with BM25 ranking. rapidfuzz is a similarity calculator that excels at comparing imperfect strings character by character.

If you only remember four things from this post:

  1. FTS5 is 1,500–7,000x faster at 500K records for any query it can answer, because index lookups (O(log n)) beat linear scans (O(n)).
  2. rapidfuzz handles typos that FTS5 is completely blind to, because edit distance doesn’t require exact token matches.
  3. The hybrid approach (FTS5 trigram for retrieval → rapidfuzz for re-ranking) gives you sub-2ms typo-tolerant search on a 500K dataset. Use it.
  4. For batch workloads, the gap explodes. FTS5 handles 2,100 queries/second vs rapidfuzz’s 1.2. If you need to match a CSV of 1,000 records, FTS5 finishes in under a second while rapidfuzz takes 14 minutes.

The right tool depends on what “fuzzy” means for your use case. Now you have the numbers — at real scale — to decide.

References & Further Reading