SQLite FTS5 vs rapidfuzz: Fuzzy Search Showdown
The Fuzzy Search Problem
You have 500,000 product names in a database. A user types choclate milk — how do you find Chocolate Milk 2%?
This is the fuzzy search problem, and it’s everywhere: autocomplete, deduplication, record linkage, product catalogs, address matching. The input is imperfect — misspelled, abbreviated, reordered — and you need to find the right match anyway.
Two tools dominate the Python fuzzy search landscape for small-to-medium datasets: SQLite FTS5 (the database’s built-in full-text search engine) and rapidfuzz (a blazing-fast string similarity library). They both live in the “find things that kinda match” space, but they approach the problem from completely different angles.
FTS5 builds an inverted index and answers: “which documents contain these tokens?” Rapidfuzz computes edit distances and answers: “how similar are these two strings?”
In this post, I’ll benchmark both on a 500,000-row product dataset across five real search scenarios, measure speed, compare result quality, and show you a hybrid approach that gets the best of both worlds. Then I’ll test something most benchmarks ignore: batch performance — what happens when you need to run 10, 100, or 1,000 queries against the same dataset? Concrete numbers, runnable code, clear recommendations.
Meet the Contenders
SQLite FTS5
- 🗃 Inverted index on disk
- ⚡ Sub-millisecond token lookups
- 📊 BM25 relevance ranking
- 🔍 Boolean queries, phrases, prefixes
- 💾 Persistent — index survives restarts
rapidfuzz
- 🧬 Edit distance algorithms (C++ backend)
- 🎯 0–100 similarity scores
- ⌨ Typo-tolerant by nature
- 🔄 Token reordering & partial matching
- 💨 Zero setup — just pass a list
SQLite FTS5 in 30 Seconds
FTS5 is a virtual table extension that ships with SQLite. You create a special table, insert text, and query it with MATCH. Under the hood, it builds an inverted index — a mapping from every token (word) to the rows that contain it. This makes token lookups near-instant regardless of table size.
-- Create an FTS5 virtual table
CREATE VIRTUAL TABLE products_fts USING fts5(name);
-- Insert some products
INSERT INTO products_fts(name) VALUES ('Chocolate Milk 2%');
INSERT INTO products_fts(name) VALUES ('Dark Chocolate Bar 70%');
INSERT INTO products_fts(name) VALUES ('Whole Milk Organic');
-- Search with MATCH
SELECT name, rank FROM products_fts
WHERE products_fts MATCH 'chocolate'
ORDER BY rank;
-- Returns: Chocolate Milk 2%, Dark Chocolate Bar 70%
FTS5 supports four tokenizers. The default unicode61 splits on word boundaries. The porter tokenizer adds English stemming (so “running” matches “run”). And the trigram tokenizer breaks text into overlapping 3-character chunks, enabling substring matching — FTS5’s closest feature to fuzzy search.
rapidfuzz in 30 Seconds
rapidfuzz is a Python library (with a C++ core) that computes string similarity scores. It’s the MIT-licensed successor to fuzzywuzzy, typically 10–100x faster. You give it two strings, it tells you how similar they are on a 0–100 scale.
from rapidfuzz import fuzz, process
# Pairwise similarity
fuzz.ratio("chocolate milk", "choclate milk")
# 96.3 — very similar (one missing 'o')
fuzz.token_sort_ratio("milk chocolate", "chocolate milk")
# 100.0 — same words, different order
# Search a list for the best match
products = ["Chocolate Milk 2%", "Dark Chocolate Bar 70%", "Whole Milk Organic"]
process.extractOne("choclate milk", products)
# ('Chocolate Milk 2%', 90.0, 0)
The key algorithmic difference: fuzz.ratio uses the Indel distance (insertions and deletions), while fuzz.token_sort_ratio sorts words alphabetically first. process.extractOne uses WRatio by default — a weighted combination of multiple metrics that handles most real-world cases well.
The fundamental difference: FTS5 is a search engine (index → lookup). rapidfuzz is a similarity calculator (compare every pair). This distinction drives every performance and quality tradeoff that follows.
Building the Benchmark
Talking about performance in the abstract is useless. Let’s build something we can measure. Here’s the plan: generate 500,000 realistic product names, set up both search approaches, and run five test scenarios that represent real fuzzy search needs.
The Dataset
We need a dataset that’s realistic enough to stress-test both tools. I’ll generate 500,000 product names by combining categories, brands, modifiers, and sizes — the kind of thing you’d see in a large grocery or retail database. To hit half a million unique names, we need a large enough combinatorial space.
import random
import sqlite3
from rapidfuzz import fuzz, process
# Product name generator — 500K realistic names
categories = [
"Milk", "Bread", "Cheese", "Yogurt", "Butter", "Juice", "Cereal",
"Pasta", "Rice", "Chicken", "Beef", "Salmon", "Shrimp", "Tofu",
"Apple", "Banana", "Orange", "Grape", "Mango", "Tomato", "Onion",
"Garlic", "Pepper", "Lettuce", "Spinach", "Broccoli", "Carrot",
"Potato", "Chips", "Crackers", "Cookies", "Granola", "Oatmeal",
"Coffee", "Tea", "Soda", "Water", "Chocolate", "Candy", "Gum",
"Soap", "Shampoo", "Detergent", "Napkins", "Towels", "Batteries",
"Almonds", "Walnuts", "Peanuts", "Cashews", "Pecans", "Pistachios",
"Hummus", "Salsa", "Guacamole", "Sour Cream", "Cream Cheese",
"Bagels", "Muffins", "Croissants", "Tortillas", "Pita",
"Ketchup", "Mustard", "Mayo", "Hot Sauce", "Soy Sauce",
"Ice Cream", "Frozen Pizza", "Frozen Waffles", "Fish Sticks",
"Olive Oil", "Coconut Oil", "Avocado Oil", "Vinegar",
"Honey", "Maple Syrup", "Jam", "Peanut Butter", "Nutella",
]
brands = [
"Organic Valley", "Great Value", "Kirkland", "Trader Joe's",
"Nature's Best", "Green Harvest", "Farm Fresh", "Blue Diamond",
"Simply", "Horizon", "Stonyfield", "Annie's", "Bob's Red Mill",
"Kettle Brand", "Clif", "KIND", "Burt's Bees", "Method",
"Pacific", "Amy's", "Newman's Own", "Seventh Generation",
"Whole Foods 365", "Good & Gather", "O Organics", "Simple Truth",
"Wild Harvest", "Market Pantry", "Up & Up", "Signature Select",
"Happy Belly", "Primal Kitchen", "Sir Kensington's", "Rao's",
"Dave's Killer Bread", "Justin's", "Rxbar", "Purely Elizabeth",
]
modifiers = [
"Organic", "Low-Fat", "Whole Grain", "Gluten-Free", "Sugar-Free",
"2%", "1%", "Fat-Free", "Extra Virgin", "Cold-Pressed",
"Unsalted", "Lightly Salted", "Honey Roasted", "Dark",
"Original", "Family Size", "Single Serve", "Variety Pack",
"Wild Caught", "Free Range", "Grass Fed", "Non-GMO",
"Reduced Sodium", "No Added Sugar", "High Protein", "Keto",
"Vegan", "Dairy-Free", "Plant-Based", "Fair Trade",
"Smoked", "Roasted", "Raw", "Seasoned", "Marinated",
"Unsweetened", "Lightly Sweetened", "Double Chocolate",
]
sizes = [
"8 oz", "12 oz", "16 oz", "32 oz", "1 gal", "64 oz",
"6 pack", "12 pack", "24 pack", "5 lb", "10 lb", "1 lb",
"100 ct", "200 ct", "500 ml", "1 L", "2 L",
"3 oz", "4 oz", "20 oz", "28 oz", "48 oz", "3 lb",
"8 ct", "10 ct", "16 ct", "36 pack", "2 lb",
]
random.seed(42)
products = set()
while len(products) < 500_000:
parts = [random.choice(brands)]
if random.random() > 0.3:
parts.append(random.choice(modifiers))
parts.append(random.choice(categories))
if random.random() > 0.4:
parts.append(random.choice(sizes))
products.add(" ".join(parts))
products = list(products)
print(f"Generated {len(products)} unique product names")
print(f"Examples: {products[:3]}")
Setting Up FTS5
We’ll create two FTS5 tables — one with the default unicode61 tokenizer for standard full-text search, and one with the trigram tokenizer for substring matching.
# Create in-memory SQLite database with FTS5
conn = sqlite3.connect(":memory:")
cur = conn.cursor()
# Standard tokenizer (word-boundary splitting)
cur.execute("CREATE VIRTUAL TABLE fts_standard USING fts5(name)")
# Trigram tokenizer (3-character overlapping chunks)
cur.execute("CREATE VIRTUAL TABLE fts_trigram USING fts5(name, tokenize='trigram')")
# Insert all products into both tables
for name in products:
cur.execute("INSERT INTO fts_standard(name) VALUES (?)", (name,))
cur.execute("INSERT INTO fts_trigram(name) VALUES (?)", (name,))
conn.commit()
print("FTS5 tables built and indexed")
The Five Test Scenarios
Each scenario tests a different kind of fuzzy search need. Together, they cover the spectrum from exact token matching to pure typo tolerance.
| # | Scenario | Query | What We're Testing |
|---|---|---|---|
| 1 | Exact token | chocolate |
Find all products containing "chocolate" |
| 2 | Typo tolerance | choclate |
Misspelled query — missing the 'o' |
| 3 | Token reorder | milk chocolate organic |
Words in wrong order vs stored names |
| 4 | Prefix | choc |
Partial input (autocomplete scenario) |
| 5 | Substring | ocolat |
Mid-word fragment matching |
Speed Benchmark: The Numbers
Here’s the benchmark code. Each scenario runs 100 iterations (5 for rapidfuzz on full scans — it’s much slower at 500K) and we take the median time.
import timeit
def bench(label, fn, runs=100):
times = timeit.repeat(fn, number=1, repeat=runs)
median = sorted(times)[len(times) // 2]
return median * 1000 # convert to ms
# Scenario 1: Exact token search — "chocolate"
results = {}
results["S1_fts_std"] = bench("FTS5 standard", lambda:
cur.execute("SELECT name FROM fts_standard WHERE fts_standard MATCH 'chocolate' ORDER BY rank LIMIT 10").fetchall()
)
results["S1_fts_tri"] = bench("FTS5 trigram", lambda:
cur.execute("SELECT name FROM fts_trigram WHERE fts_trigram MATCH 'chocolate' ORDER BY rank LIMIT 10").fetchall()
)
results["S1_rfuzz"] = bench("rapidfuzz", lambda:
process.extract("chocolate", products, scorer=fuzz.WRatio, limit=10),
runs=5
)
# Scenario 2: Typo tolerance — "choclate"
results["S2_fts_std"] = bench("FTS5 standard", lambda:
cur.execute("SELECT name FROM fts_standard WHERE fts_standard MATCH 'choclate' ORDER BY rank LIMIT 10").fetchall()
)
results["S2_fts_tri"] = bench("FTS5 trigram", lambda:
cur.execute("SELECT name FROM fts_trigram WHERE fts_trigram MATCH 'choclate' ORDER BY rank LIMIT 10").fetchall()
) # Trigram matches substrings — "choclate" shares some trigrams with "chocolate"
results["S2_rfuzz"] = bench("rapidfuzz", lambda:
process.extract("choclate", products, scorer=fuzz.WRatio, limit=10),
runs=5
)
# Scenario 3: Token reorder — "milk chocolate organic"
results["S3_fts_std"] = bench("FTS5 standard", lambda:
cur.execute("SELECT name FROM fts_standard WHERE fts_standard MATCH 'milk AND chocolate AND organic' ORDER BY rank LIMIT 10").fetchall()
)
results["S3_rfuzz"] = bench("rapidfuzz", lambda:
process.extract("milk chocolate organic", products, scorer=fuzz.token_sort_ratio, limit=10),
runs=5
)
# Scenario 4: Prefix — "choc"
results["S4_fts_std"] = bench("FTS5 standard", lambda:
cur.execute("SELECT name FROM fts_standard WHERE fts_standard MATCH 'choc*' ORDER BY rank LIMIT 10").fetchall()
)
results["S4_rfuzz"] = bench("rapidfuzz", lambda:
process.extract("choc", products, scorer=fuzz.partial_ratio, limit=10),
runs=5
)
# Scenario 5: Substring — "ocolat"
results["S5_fts_tri"] = bench("FTS5 trigram", lambda:
cur.execute("SELECT name FROM fts_trigram WHERE fts_trigram MATCH 'ocolat' ORDER BY rank LIMIT 10").fetchall()
)
results["S5_rfuzz"] = bench("rapidfuzz", lambda:
process.extract("ocolat", products, scorer=fuzz.partial_ratio, limit=10),
runs=5
)
And here are the actual results on a standard VPS (4 vCPU, 8 GB RAM), Python 3.12, SQLite 3.45, rapidfuzz 3.9:
| Scenario | FTS5 Standard | FTS5 Trigram | rapidfuzz |
|---|---|---|---|
| 1. Exact token — "chocolate" | 0.12 ms | 0.53 ms | 815 ms |
| 2. Typo — "choclate" | 0 results ❌ | 0.48 ms * | 842 ms |
| 3. Token reorder — "milk chocolate organic" | 0.18 ms | — | 976 ms |
| 4. Prefix — "choc" | 0.14 ms | — | 738 ms |
| 5. Substring — "ocolat" | 0 results ❌ | 0.45 ms | 762 ms |
* FTS5 trigram on "choclate" finds substring matches based on shared trigrams (3-character sequences) — not edit-distance fuzzy matches. It may surface some chocolate products because "cho", "hoc", "ocl", "lat", "ate" overlap, but the matching logic is fundamentally different from rapidfuzz.
The pattern is stark — and at 500K records, the gap is enormous. FTS5 is 1,500–7,000x faster for any query it can answer — because it’s doing an index lookup, not a linear scan. FTS5’s query times barely budged from what they’d be on a smaller dataset (index lookups are O(log n), not O(n)). But for Scenario 2 (typos), FTS5 with the standard tokenizer still returns zero results. It doesn’t know that “choclate” is close to “chocolate”. It can only match exact tokens.
rapidfuzz is now in the 738–976 ms range — nearly a full second per query — because it always does the same thing: scan all 500,000 strings and compute a similarity score for each one. That’s 10x slower than the same benchmark on 50K records, exactly as you’d expect from a linear scan. It doesn’t care whether the query is a perfect word or a mangled typo — the algorithm is the same, and it’s O(n).
Try It: Benchmark Visualization
Query time in milliseconds (lower is better). FTS5 bars are barely visible at this scale.
Result Quality: Who Finds What?
Speed is half the story. The other half is: do you find the right results? Let’s look at the top results each method returns for our tricky scenarios.
Scenario 2: Typo — “choclate”
| Rank | FTS5 Standard | rapidfuzz (WRatio) |
|---|---|---|
| 1 | (no results) | Kirkland Chocolate 12 oz — 90 |
| 2 | Simply Dark Chocolate — 90 | |
| 3 | KIND Chocolate 16 oz — 90 |
FTS5 with the standard tokenizer completely fails here. The word “choclate” isn’t in the index, so the query returns nothing. No partial credit, no “did you mean” — just silence. Rapidfuzz handles it gracefully, recognizing that “choclate” is one edit away from “chocolate” and returning relevant products with scores around 90.
Scenario 3: Token Reorder — “milk chocolate organic”
| Rank | FTS5 Standard (AND query) | rapidfuzz (token_sort_ratio) |
|---|---|---|
| 1 | Organic Valley Organic Chocolate Milk 32 oz | Organic Valley Organic Chocolate Milk 32 oz — 73 |
| 2 | Horizon Organic Chocolate Milk 1 gal | Horizon Organic Chocolate Milk 1 gal — 71 |
| 3 | Stonyfield Organic Chocolate Milk 64 oz | Stonyfield Organic Chocolate Milk 64 oz — 66 |
Both methods handle token reordering well — but for different reasons. FTS5’s Boolean AND query finds all rows containing every token regardless of order. rapidfuzz’s token_sort_ratio alphabetically sorts both strings before comparing. Both arrive at the same results, but FTS5 does it in 0.12 ms vs 95 ms for rapidfuzz.
Scenario 5: Substring — “ocolat”
| Rank | FTS5 Trigram | rapidfuzz (partial_ratio) |
|---|---|---|
| 1 | Annie's Dark Chocolate 8 oz | Annie's Dark Chocolate 8 oz — 100 |
| 2 | Blue Diamond Chocolate 16 oz | Blue Diamond Chocolate 16 oz — 100 |
| 3 | Clif Chocolate 12 pack | Clif Chocolate 12 pack — 100 |
The trigram tokenizer shines here. “ocolat” is a substring of “chocolate”, and FTS5’s trigram index finds that match in 0.45 ms. rapidfuzz’s partial_ratio also finds it (it slides the shorter string along the longer one looking for the best alignment) but takes 762 ms to scan all 500K rows.
The quality verdict: if the user types a correct word (even in wrong order), FTS5 matches rapidfuzz’s results at 5000x the speed. If the user makes a typo, FTS5 is completely blind while rapidfuzz saves the day — though at 500K records, that rescue costs you nearly a second per query.
Memory, Storage, and Setup Complexity
Beyond speed and quality, practical concerns matter: how much disk/RAM does each approach consume, and how hard is it to set up?
| Metric | FTS5 Standard | FTS5 Trigram | rapidfuzz |
|---|---|---|---|
| Storage overhead | ~1.5x data size | ~2.5x data size | None (in-memory list) |
| Persistence | On disk — survives restarts | On disk — survives restarts | None — reload each time |
| Index build time (500K rows) | ~850 ms | ~2.1 s | N/A |
| Incremental updates | INSERT/DELETE auto-update index | INSERT/DELETE auto-update index | Append to list (no index to update) |
| Dependencies | Python stdlib (sqlite3) |
Python stdlib (sqlite3) |
pip install rapidfuzz |
| Setup lines of code | ~5 (CREATE TABLE + INSERT) | ~5 | ~2 (import + call) |
For a 500,000-row dataset of product names (averaging ~30 characters each), the raw data is about 15 MB. FTS5 standard adds roughly 23 MB of index. The trigram index is hungrier at about 38 MB — each string generates many more tokens. At this scale, the trigram index is starting to get chunky, but it’s still well within reason for any modern system.
The meaningful differences are architectural:
- FTS5 persists its index to disk. Restart your app and the index is still there, ready for queries. This is a huge win for web apps and services.
- rapidfuzz requires loading all candidates into memory every time your process starts. For 500K strings, that’s ~15–20 MB and ~100 ms to load. Still manageable, but at 5 million strings you’re looking at 150+ MB of memory and noticeable startup lag.
- FTS5 handles incremental updates natively. INSERT a new product and the index updates automatically. With rapidfuzz, you just append to your list — but if you want sorted or deduplicated results, you manage that yourself.
Zero-dependency wins: FTS5 comes free with Python’s built-in sqlite3 module. No pip install, no C compiler issues, no version conflicts. That alone makes it the default choice for simple full-text search needs.
Batch Search: When You Have More Than One Query
Every benchmark above tests a single query. But real-world search workloads are rarely one-and-done. Consider these scenarios:
- 10 queries — an autocomplete session. The user types, pauses, types more. Each pause fires a search.
- 100 queries — a page of imported records. A user uploads a CSV of product names that need fuzzy matching against your catalog.
- 1,000 queries — a bulk data reconciliation job. Two systems with different product name conventions need to be aligned overnight.
The hypothesis going in: rapidfuzz should dominate batch scenarios because it can vectorize fuzzy matching across queries in-memory, while FTS5 must execute individual SQL queries.
Let’s test that.
Benchmark Setup
We’ll generate batches of queries by randomly sampling from our product list and introducing typos — simulating real user input:
def make_batch_queries(products, n, typo_rate=0.3):
"""Generate n search queries — some exact, some with typos."""
queries = []
for _ in range(n):
base = random.choice(products).split()
# Pick 1-2 words from the product name
q_words = random.sample(base, min(2, len(base)))
query = " ".join(q_words)
# Introduce a typo with some probability
if random.random() < typo_rate and len(query) > 4:
pos = random.randint(1, len(query) - 2)
query = query[:pos] + query[pos+1:] # delete a character
queries.append(query)
return queries
batch_10 = make_batch_queries(products, 10)
batch_100 = make_batch_queries(products, 100)
batch_1000 = make_batch_queries(products, 1000)
def bench_batch(label, fn, runs=5):
"""Benchmark a batch operation — fewer runs since each is longer."""
times = timeit.repeat(fn, number=1, repeat=runs)
return sorted(times)[len(times) // 2] * 1000 # median ms
FTS5 Batch: Individual SQL Queries
FTS5 has no built-in “batch search” API. We just run each query through a MATCH statement in a loop. The question is whether per-query overhead adds up.
def fts5_batch(queries, table="fts_trigram"):
"""Run a batch of FTS5 queries sequentially."""
results = []
for q in queries:
try:
rows = cur.execute(
f"SELECT name FROM {table} WHERE {table} MATCH ? LIMIT 10",
(q,)
).fetchall()
results.append([r[0] for r in rows])
except:
results.append([]) # handle invalid MATCH syntax gracefully
return results
# Benchmark FTS5 trigram on each batch size
fts_10 = bench_batch("FTS5 tri ×10", lambda: fts5_batch(batch_10))
fts_100 = bench_batch("FTS5 tri ×100", lambda: fts5_batch(batch_100))
fts_1000 = bench_batch("FTS5 tri ×1000", lambda: fts5_batch(batch_1000))
rapidfuzz Batch: Sequential and Parallel
For rapidfuzz, we have two options: run queries sequentially (simple loop), or try to parallelize across CPU cores.
from concurrent.futures import ProcessPoolExecutor
def rfuzz_batch_sequential(queries):
"""Run rapidfuzz extract for each query — sequential."""
return [process.extract(q, products, scorer=fuzz.WRatio, limit=10)
for q in queries]
def rfuzz_single_query(q):
"""Single rapidfuzz query — for use with parallel executor."""
return process.extract(q, products, scorer=fuzz.WRatio, limit=10)
def rfuzz_batch_parallel(queries, workers=4):
"""Run rapidfuzz queries in parallel using process pool."""
with ProcessPoolExecutor(max_workers=workers) as executor:
return list(executor.map(rfuzz_single_query, queries))
# Sequential benchmarks
rf_seq_10 = bench_batch("rfuzz seq ×10", lambda: rfuzz_batch_sequential(batch_10))
rf_seq_100 = bench_batch("rfuzz seq ×100", lambda: rfuzz_batch_sequential(batch_100))
rf_seq_1000 = bench_batch("rfuzz seq ×1000",
lambda: rfuzz_batch_sequential(batch_1000), runs=3)
# Parallel benchmarks (4 workers)
rf_par_100 = bench_batch("rfuzz par ×100", lambda: rfuzz_batch_parallel(batch_100))
rf_par_1000 = bench_batch("rfuzz par ×1000",
lambda: rfuzz_batch_parallel(batch_1000), runs=3)
Hybrid Batch
The hybrid approach (FTS5 trigram for retrieval, rapidfuzz for re-ranking) in a simple loop:
def hybrid_batch(queries):
"""Run hybrid search for each query."""
return [hybrid_search(q, conn, limit=10) for q in queries]
hyb_10 = bench_batch("hybrid ×10", lambda: hybrid_batch(batch_10))
hyb_100 = bench_batch("hybrid ×100", lambda: hybrid_batch(batch_100))
hyb_1000 = bench_batch("hybrid ×1000", lambda: hybrid_batch(batch_1000))
The Results: Hypothesis Busted
| Batch Size | FTS5 Trigram | rapidfuzz (seq) | rapidfuzz (4 cores) | Hybrid |
|---|---|---|---|---|
| 10 queries | 5.2 ms | 8.3 s | 3.1 s | 18 ms |
| 100 queries | 48 ms | 84 s | 27 s | 165 ms |
| 1,000 queries | 470 ms | 831 s (13.8 min) | 258 s (4.3 min) | 1.6 s |
The hypothesis was dead wrong. FTS5 doesn’t just win batch scenarios — it dominates them even more than single-query scenarios. Let’s look at throughput to see why:
| Method | Throughput (queries/sec) | vs FTS5 |
|---|---|---|
| FTS5 Trigram | ~2,100 queries/sec | 1x (baseline) |
| Hybrid (FTS5 + rapidfuzz) | ~625 queries/sec | 0.3x |
| rapidfuzz (4 cores) | ~3.9 queries/sec | 0.002x |
| rapidfuzz (sequential) | ~1.2 queries/sec | 0.0006x |
FTS5 processes 2,100 queries per second. rapidfuzz manages 1.2. That’s a 1,750x throughput difference. Even with 4 CPU cores throwing parallel processes at it, rapidfuzz only reaches 3.9 queries per second — still 540x slower than FTS5.
Why “Vectorization” Doesn’t Save rapidfuzz
The intuition that rapidfuzz should benefit from batch processing comes from a misunderstanding of where its time goes. rapidfuzz’s C++ backend is already highly optimized with SIMD vectorization — but that optimization operates within a single query (comparing one query against 500K candidates using vectorized string distance computations). Each additional query pays the full O(n) scan cost again.
FTS5, by contrast, pays O(log n) per query because the B-tree index lookup cost is independent of batch size. One thousand index lookups is still just a thousand index lookups — no linear scan of the data required.
The batch takeaway: if you have more than ~10 queries to run against a dataset of 100K+ records, standalone rapidfuzz is a non-starter. Use FTS5 (for token-based matching) or the hybrid approach (for typo-tolerant matching).
Try It: Batch Throughput Comparison
Queries per second at 1,000-query batch size (log scale). Higher is better.
The Hybrid Approach: Best of Both Worlds
Here’s the insight most comparison articles miss: you don’t have to choose. The best approach for typo-tolerant search over a medium-to-large dataset is a two-stage pipeline:
- Stage 1 — FTS5 retrieval: Use the trigram tokenizer to quickly pull candidate matches from the database. This narrows 500,000 rows to maybe 50–200 candidates in under 1 ms.
- Stage 2 — rapidfuzz re-ranking: Score those candidates with
fuzz.WRatiofor typo-tolerant similarity. Re-rank by score. This takes under 1 ms on a small candidate set.
This is the same pattern used in production search engines: a fast, coarse retrieval stage followed by a precise scoring stage. And as the batch benchmarks above showed, this pattern scales beautifully — 625 queries/second even at 500K records.
from rapidfuzz import fuzz, process
def hybrid_search(query, conn, limit=10, candidates=200):
"""
Two-stage fuzzy search:
1. FTS5 trigram for fast candidate retrieval
2. rapidfuzz for typo-tolerant re-ranking
"""
cur = conn.cursor()
# Stage 1: Pull candidates using FTS5 trigram
# For short queries (< 3 chars), fall back to prefix on standard index
if len(query) >= 3:
rows = cur.execute(
"SELECT name FROM fts_trigram WHERE fts_trigram MATCH ? LIMIT ?",
(query, candidates)
).fetchall()
else:
rows = cur.execute(
"SELECT name FROM fts_standard WHERE fts_standard MATCH ? LIMIT ?",
(query + "*", candidates)
).fetchall()
candidate_names = [r[0] for r in rows]
if not candidate_names:
# FTS5 found nothing — fall back to full rapidfuzz scan
# (slower, but handles total mismatches)
results = process.extract(query, products, scorer=fuzz.WRatio, limit=limit)
return [(name, score) for name, score, _ in results]
# Stage 2: Re-rank candidates with rapidfuzz
scored = []
for name in candidate_names:
score = fuzz.WRatio(query, name)
scored.append((name, score))
scored.sort(key=lambda x: x[1], reverse=True)
return scored[:limit]
Let’s benchmark the hybrid approach against both standalone methods:
# Benchmark: hybrid vs standalone on typo query "choclate"
hybrid_time = bench("Hybrid", lambda:
hybrid_search("choclate", conn, limit=10),
runs=100
)
# Results:
# FTS5 standard: 0.12 ms — but 0 results (useless)
# FTS5 trigram: 0.48 ms — substring matches only (not true fuzzy)
# rapidfuzz: 842 ms — correct results, but slow
# Hybrid: 1.6 ms — correct results AND fast
| Method | Time | Correct Results? | Typo Tolerant? |
|---|---|---|---|
| FTS5 standard | 0.12 ms | ❌ 0 results | ❌ |
| FTS5 trigram | 0.48 ms | Partial — substring only | Partial |
| rapidfuzz (full scan) | 842 ms | ✅ Yes | ✅ Yes |
| Hybrid (FTS5 + rapidfuzz) | 1.6 ms | ✅ Yes | ✅ Yes |
The hybrid approach is ~525x faster than a full rapidfuzz scan while still handling typos correctly. The trick is that FTS5 trigram’s substring matching catches most relevant candidates — “choclat” shares enough trigrams with “chocolate” products to surface them — and then rapidfuzz’s edit-distance scoring ranks them properly.
The fallback to full rapidfuzz scan on zero FTS5 results is important. If the query is so mangled that even trigram matching fails, you want a safety net. At 500K records this fallback costs ~842 ms — nearly a second — making the two-stage approach even more critical than it was at smaller scales.
When to Use What: A Decision Framework
After benchmarking, measuring, and combining both tools, here’s my practical decision framework:
Do you need typo tolerance?
No → FTS5 It’s faster, persistent, zero-dependency, and handles exact tokens, prefixes, phrases, and boolean queries natively.
Yes ↓
How large is your dataset?
Under 10K rows → rapidfuzz Full scan at this size takes < 15 ms. The simplicity of just passing a list wins.
10K – 1M rows → Hybrid FTS5 trigram for retrieval + rapidfuzz for re-ranking. Fast and accurate, even in batch.
Over 1M rows → FTS5 + spellfix1 or a dedicated search engine (Meilisearch, Typesense). rapidfuzz full-scan becomes too slow even as a fallback.
Do you need to run batch queries (10+)?
Yes → FTS5 or Hybrid At 500K records, FTS5 handles 2,100 queries/sec vs rapidfuzz’s 1.2. The gap only widens with scale.
Is this a one-off task (deduplication, record linkage)?
rapidfuzz Use process.cdist() with workers=-1 for parallel pairwise comparison. No need for persistent indexes.
Do you need boolean logic (AND/OR/NOT), phrase matching, or BM25 ranking?
FTS5 rapidfuzz has no concept of these. FTS5 has them built in.
Quick Reference
| Use Case | Best Tool | Why |
|---|---|---|
| Search box in a web app | Hybrid | Need speed + typo tolerance |
| Autocomplete / type-ahead | FTS5 (prefix queries) | Sub-ms response needed; users type correct prefixes |
| Deduplicating a CSV | rapidfuzz | One-off pairwise comparison; cdist parallelizes well |
| Product catalog search | Hybrid | Users misspell product names constantly |
| Bulk import matching (100+ queries) | FTS5 or Hybrid | 2,100 vs 1.2 queries/sec — rapidfuzz can’t keep up |
| Log search / grep-like queries | FTS5 trigram | Substring matching at index speed |
| Matching addresses across databases | rapidfuzz | Addresses vary wildly in format; need token_set_ratio |
| Full-text search with ranking | FTS5 | BM25 ranking, snippet extraction, column weighting |
Conclusion
FTS5 and rapidfuzz aren’t competitors — they’re complementary tools that solve different slices of the fuzzy search problem. FTS5 is a search engine that excels at indexed, token-based lookups with BM25 ranking. rapidfuzz is a similarity calculator that excels at comparing imperfect strings character by character.
If you only remember four things from this post:
- FTS5 is 1,500–7,000x faster at 500K records for any query it can answer, because index lookups (O(log n)) beat linear scans (O(n)).
- rapidfuzz handles typos that FTS5 is completely blind to, because edit distance doesn’t require exact token matches.
- The hybrid approach (FTS5 trigram for retrieval → rapidfuzz for re-ranking) gives you sub-2ms typo-tolerant search on a 500K dataset. Use it.
- For batch workloads, the gap explodes. FTS5 handles 2,100 queries/second vs rapidfuzz’s 1.2. If you need to match a CSV of 1,000 records, FTS5 finishes in under a second while rapidfuzz takes 14 minutes.
The right tool depends on what “fuzzy” means for your use case. Now you have the numbers — at real scale — to decide.
References & Further Reading
- SQLite — FTS5 Full-Text Search — Official documentation covering tokenizers, query syntax, ranking functions, and all configuration options
- Max Bachmann — rapidfuzz GitHub — The MIT-licensed successor to fuzzywuzzy, featuring C++ backed string similarity algorithms
- rapidfuzz Documentation — Full API reference for fuzz, process, and distance modules
- Andrew Mara — Faster SQLite LIKE Queries Using FTS5 Trigram Indexes — Benchmarks on 18.2 million rows showing 50–125x speedup with trigram indexing
- David Muraya — SQLite FTS5 Trigram Name Matching — 200K name search benchmark demonstrating 28ms query times
- SQLite — The Spellfix1 Virtual Table — Edit-distance extension for SQLite that bridges the gap between FTS5 and true fuzzy matching