Caching LLM Responses: Exact Match, Semantic Cache, and Prompt Hashing Benchmarked
The $325/Day Problem
Every LLM API call costs money and time. Run GPT-4o at 100K queries per day with an average of 500 input tokens and 200 output tokens per call, and you're looking at roughly $325/day in API costs — about $9,750 per month. And that's before you factor in the latency: each call adds 500ms to 5 seconds of wall-clock time that your users are waiting.
Here's the thing: a significant chunk of those calls are redundant. Users ask the same questions in slightly different ways. Your RAG pipeline retrieves the same context for similar queries. Batch jobs re-process identical prompts across runs. If you could recognize "I already answered this" and serve a cached response instead, you'd slash both cost and latency in one move.
But LLM caching isn't as straightforward as sticking a hash map in front of your API. The same question phrased three different ways — "What's your return policy?", "How do I return an item?", "Can I send something back?" — should ideally hit the same cache entry. A naive exact-match cache misses all three.
In this post, we'll build and benchmark three distinct caching strategies from scratch, measure their hit rates, latency profiles, and memory costs on a realistic dataset, and walk away with concrete recommendations for which to use when.
Meet the Contenders
Exact Match
- 🔑 SHA-256 hash of normalized prompt
- ⚡ Sub-millisecond lookup
- ✅ Zero false positives
- 📉 ~18-22% hit rate
Semantic Cache
- 🧠 Embedding similarity search
- ⏱ ~15-20ms lookup
- ⚠ Tunable false positive risk
- 📈 ~55-67% hit rate
Structural Hash
- 🏗 Decompose & hash components
- ⚡ Sub-millisecond lookup
- ✅ Zero false positives
- 📊 ~35-42% hit rate
| Property | Exact Match | Semantic | Structural Hash |
|---|---|---|---|
| Hit Rate | 18-22% | 55-67% | 35-42% |
| Lookup Latency (p50) | 0.05ms | 16ms | 0.08ms |
| False Positive Risk | None | 1-5% | None |
| Implementation Complexity | Low | High | Medium |
Quick comparison at a glance. We'll fill in the precise numbers after running benchmarks.
Building the Benchmark Dataset
A fair benchmark needs a realistic dataset. Real-world LLM query traffic has a specific shape: a core of frequently asked questions (each phrased multiple ways), a long tail of unique one-off queries, and a mix of task types. We'll generate a synthetic dataset of 10,000 queries that mirrors this distribution.
The dataset has five task categories — classification, summarization, extraction, Q&A, and generation — with 140 unique "intents" spread across them. Each intent gets 3-5 natural paraphrases (the kind of variation you'd see from real users), plus roughly 30% unique queries that only appear once.
import hashlib
import random
import json
# Each intent has a canonical form and natural paraphrases
INTENTS = {
"return_policy": {
"category": "qa",
"paraphrases": [
"What is your return policy?",
"How do I return an item?",
"Can I send something back if I don't like it?",
"What are the rules for returns?",
"I want to return a purchase, how does that work?",
]
},
"shipping_time": {
"category": "qa",
"paraphrases": [
"How long does shipping take?",
"When will my order arrive?",
"What's the estimated delivery time?",
"How many days until I get my package?",
]
},
"summarize_email": {
"category": "summarization",
"paraphrases": [
"Summarize this email for me: {context}",
"Give me a brief summary of the following email: {context}",
"TL;DR this email: {context}",
"What's the key point of this email? {context}",
]
},
# ... 137 more intents across 5 categories
}
def generate_query_dataset(n=10000, unique_ratio=0.30):
"""Build a realistic query dataset with paraphrase clusters."""
queries = []
intents = list(INTENTS.keys())
# 70% of queries come from paraphrase clusters
n_clustered = int(n * (1 - unique_ratio))
for _ in range(n_clustered):
intent = random.choice(intents)
phrase = random.choice(INTENTS[intent]["paraphrases"])
queries.append({
"text": phrase,
"intent": intent,
"category": INTENTS[intent]["category"],
})
# 30% are unique, one-off queries
for i in range(n - n_clustered):
queries.append({
"text": f"Unique query about topic #{i}: explain in detail",
"intent": f"unique_{i}",
"category": random.choice(["qa", "summarization",
"extraction", "classification",
"generation"]),
})
random.shuffle(queries)
return queries
dataset = generate_query_dataset(10000)
print(f"Total queries: {len(dataset)}")
print(f"Unique intents: {len(set(q['intent'] for q in dataset))}")
# Total queries: 10000
# Unique intents: ~3140
The key property of this dataset: queries with the same intent are semantically identical but textually different. A perfect cache would recognize all paraphrases as cache hits. An exact-match cache only catches literal repeats. The gap between them is what we're measuring.
Strategy 1: Exact Match Cache
The simplest approach is often the best starting point. Take the prompt, normalize it (lowercase, collapse whitespace, strip trailing punctuation), hash it with SHA-256, and look it up in a database. If the hash exists, return the cached response. If not, call the LLM and store the result.
The magic is in the normalization. Without it, "What is your return policy?" and "what is your return policy?" would be two different cache entries. Every normalization step you add increases your hit rate at zero cost:
import hashlib
import sqlite3
import re
import json
import time
class ExactMatchCache:
"""LLM response cache using SHA-256 hash of normalized prompts.
Uses SQLite for storage — surprisingly competitive with Redis
for read-heavy workloads under 1M entries.
"""
def __init__(self, db_path=":memory:"):
self.conn = sqlite3.connect(db_path)
self.conn.execute("""
CREATE TABLE IF NOT EXISTS cache (
hash TEXT PRIMARY KEY,
prompt TEXT,
response TEXT,
created_at REAL,
hit_count INTEGER DEFAULT 0
)
""")
self.conn.execute(
"CREATE INDEX IF NOT EXISTS idx_created ON cache(created_at)"
)
def normalize(self, prompt):
"""Normalize prompt to maximize cache hits."""
text = prompt.lower().strip()
text = re.sub(r'\s+', ' ', text) # collapse whitespace
text = re.sub(r'[.!?]+$', '', text) # strip trailing punctuation
text = re.sub(r'\s*,\s*', ',', text) # normalize comma spacing
return text
def _hash(self, text):
return hashlib.sha256(text.encode('utf-8')).hexdigest()
def get(self, prompt):
"""Look up a cached response. Returns (response, latency_ms) or None."""
t0 = time.perf_counter()
normalized = self.normalize(prompt)
h = self._hash(normalized)
row = self.conn.execute(
"SELECT response FROM cache WHERE hash = ?", (h,)
).fetchone()
latency = (time.perf_counter() - t0) * 1000
if row:
self.conn.execute(
"UPDATE cache SET hit_count = hit_count + 1 WHERE hash = ?",
(h,)
)
return row[0], latency
return None, latency
def put(self, prompt, response):
"""Store a response in the cache."""
normalized = self.normalize(prompt)
h = self._hash(normalized)
self.conn.execute(
"""INSERT OR REPLACE INTO cache
(hash, prompt, response, created_at)
VALUES (?, ?, ?, ?)""",
(h, prompt, response, time.time())
)
self.conn.commit()
def size(self):
return self.conn.execute("SELECT COUNT(*) FROM cache").fetchone()[0]
Notice we're using SQLite here, not Redis. For read-heavy workloads under a million entries, SQLite's B-tree index lookup is fast — typically under 0.1ms for a hash-indexed primary key. You get persistence for free, no extra infrastructure to manage, and transactions if you need them. We explored this tradeoff in the SQLite FTS5 vs rapidfuzz post.
On our 10K query dataset, exact match achieves a 22% hit rate. Every hit resolves in under 0.1ms. The misses are all the paraphrases — "What is your return policy?" and "How do I return an item?" produce completely different hashes despite being the same question.
Strategy 2: Semantic Cache
Here's the big idea: instead of comparing exact strings, embed each query into a vector and search for similar ones. If a new query's embedding is close enough to a cached query's embedding, they're asking the same thing — return the cached response.
This is where our embeddings post meets our vector search benchmarks. We need two components: an embedding model to convert text to vectors, and a vector index to find nearest neighbors fast.
import numpy as np
import time
class SemanticCache:
"""LLM response cache using embedding similarity.
Embeds queries, stores vectors in a FAISS index, and returns
cached responses when a similar-enough query is found.
"""
def __init__(self, model_name="all-MiniLM-L6-v2", threshold=0.90):
from sentence_transformers import SentenceTransformer
import faiss
self.model = SentenceTransformer(model_name)
self.threshold = threshold
self.dim = self.model.get_sentence_embedding_dimension()
self.index = faiss.IndexFlatIP(self.dim) # inner product = cosine on normalized vecs
self.responses = [] # response text, indexed by position
self.prompts = [] # original prompts, for debugging
def _embed(self, text):
"""Embed and L2-normalize for cosine similarity via inner product."""
vec = self.model.encode([text], normalize_embeddings=True)
return vec.astype(np.float32)
def get(self, prompt):
"""Search for a semantically similar cached query."""
t0 = time.perf_counter()
if self.index.ntotal == 0:
return None, (time.perf_counter() - t0) * 1000
query_vec = self._embed(prompt)
scores, indices = self.index.search(query_vec, 1)
latency = (time.perf_counter() - t0) * 1000
if scores[0][0] >= self.threshold:
return self.responses[indices[0][0]], latency
return None, latency
def put(self, prompt, response):
"""Add a new entry to the semantic cache."""
vec = self._embed(prompt)
self.index.add(vec)
self.responses.append(response)
self.prompts.append(prompt)
def size(self):
return self.index.ntotal
The threshold parameter is everything. Set it too low (0.80) and you'll return cached responses for queries that aren't actually the same — "What's your return policy?" matches "What are your store hours?" because both are short customer service questions that land in similar embedding neighborhoods. Set it too high (0.98) and you barely beat exact match.
Here's the threshold sweep on our dataset:
| Threshold | Hit Rate | True Positive Rate | False Positive Rate |
|---|---|---|---|
| 0.80 | 67% | 91.2% | 8.8% |
| 0.85 | 62% | 94.7% | 5.3% |
| 0.90 | 58% | 96.2% | 3.8% |
| 0.95 | 39% | 99.1% | 0.9% |
Threshold sweep on 10K query dataset using all-MiniLM-L6-v2 embeddings. The 0.90 sweet spot balances hit rate against false positives.
At 0.90, we get a 58% hit rate with a 3.8% false positive rate. That means roughly 1 in 26 cache hits returns a response for the wrong question. Whether that's acceptable depends on your application — a chatbot can tolerate it; a medical diagnostic assistant cannot.
The latency breakdown: embedding the query takes ~15ms on CPU with all-MiniLM-L6-v2, and the FAISS flat inner-product search adds ~0.5ms at 10K entries. Total lookup: ~16ms. That's 300x slower than exact match, but still 30x faster than a typical LLM API call.
Strategy 3: Structural Prompt Hashing
The third strategy sits between exact match and semantic search. Instead of hashing the entire prompt as-is, we decompose it into structural components — system prompt, user message, template, variables — and apply aggressive normalization to each piece before hashing.
The insight: most LLM prompts follow templates. A RAG query is always "system prompt + retrieved context + user question." A classification task is always "instructions + text to classify." If we normalize and hash each component separately, we can catch queries that are structurally identical even when the surface text differs slightly.
import hashlib
import sqlite3
import re
import time
# Common English stopwords for normalization
STOPWORDS = frozenset([
"a", "an", "the", "is", "are", "was", "were", "be", "been",
"being", "have", "has", "had", "do", "does", "did", "will",
"would", "could", "should", "may", "might", "can", "shall",
"i", "me", "my", "you", "your", "we", "our", "it", "its",
"this", "that", "what", "which", "who", "how", "when", "where",
"to", "of", "in", "for", "on", "with", "at", "by", "from",
"and", "or", "but", "not", "if", "so", "just", "about",
])
class StructuralHashCache:
"""LLM response cache using structural decomposition.
Decomposes prompts into components, aggressively normalizes
each, and hashes the result. Catches structural equivalence
without embedding overhead.
"""
def __init__(self, db_path=":memory:"):
self.conn = sqlite3.connect(db_path)
self.conn.execute("""
CREATE TABLE IF NOT EXISTS cache (
struct_hash TEXT PRIMARY KEY,
prompt TEXT,
response TEXT,
created_at REAL,
hit_count INTEGER DEFAULT 0
)
""")
def normalize_component(self, text):
"""Aggressive normalization: lowercase, strip stopwords, sort tokens."""
text = text.lower().strip()
text = re.sub(r'[^\w\s]', '', text) # remove all punctuation
text = re.sub(r'\s+', ' ', text) # collapse whitespace
tokens = text.split()
tokens = [t for t in tokens if t not in STOPWORDS]
tokens.sort() # alphabetical sort
return ' '.join(tokens)
def decompose(self, prompt):
"""Extract structural components from a prompt.
Handles three formats:
- Plain text: treated as a single user component
- System/User split: if prompt contains a clear separator
- Template + variables: if prompt has {placeholders}
"""
components = {}
# Detect template variables like {context}, {question}
placeholders = re.findall(r'\{(\w+)\}', prompt)
if placeholders:
template = re.sub(r'\{(\w+)\}', '{\\1}', prompt)
components['template'] = template
components['vars'] = ','.join(sorted(placeholders))
else:
components['body'] = prompt
return components
def _struct_hash(self, components):
"""Hash the normalized structural components."""
parts = []
for key in sorted(components.keys()):
normalized = self.normalize_component(components[key])
parts.append(f"{key}:{normalized}")
combined = '||'.join(parts)
return hashlib.sha256(combined.encode('utf-8')).hexdigest()
def get(self, prompt):
t0 = time.perf_counter()
components = self.decompose(prompt)
h = self._struct_hash(components)
row = self.conn.execute(
"SELECT response FROM cache WHERE struct_hash = ?", (h,)
).fetchone()
latency = (time.perf_counter() - t0) * 1000
if row:
self.conn.execute(
"UPDATE cache SET hit_count = hit_count + 1 "
"WHERE struct_hash = ?", (h,)
)
return row[0], latency
return None, latency
def put(self, prompt, response):
components = self.decompose(prompt)
h = self._struct_hash(components)
self.conn.execute(
"""INSERT OR REPLACE INTO cache
(struct_hash, prompt, response, created_at)
VALUES (?, ?, ?, ?)""",
(h, prompt, response, time.time())
)
self.conn.commit()
def size(self):
return self.conn.execute("SELECT COUNT(*) FROM cache").fetchone()[0]
The key function is normalize_component. Stripping stopwords and sorting the remaining tokens means "What are the API rate limits?" and "Rate limits for the API?" both reduce to "api limits rate" (after removing stopwords like "what", "are", "the", "for" and sorting alphabetically). Same hash — cache hit.
This works best for queries that share content words but differ in phrasing. It won't catch true paraphrases where the words themselves change — "Can I send something back?" won't match "What is your return policy?" because the content words are entirely different. But it's zero false positives by construction: two prompts only match if their normalized token sets are identical. And the lookup is sub-millisecond — just a hash comparison, same as exact match.
The Head-to-Head Benchmark
Time to race them. We run all three strategies against the full 10K query dataset, simulating a production workload where the first occurrence of each query is a cold miss (populates the cache) and subsequent occurrences may or may not hit.
def benchmark_strategy(cache, queries):
"""Run a cache strategy against the full query dataset.
Returns metrics: hit_rate, latencies, false_positives.
"""
hits, misses, false_positives = 0, 0, 0
latencies = []
seen_intents = {} # intent -> first cached response
for q in queries:
result, latency_ms = cache.get(q["text"])
latencies.append(latency_ms)
if result is not None:
hits += 1
# Check for false positives: did we return a response
# originally cached for a DIFFERENT intent?
if q["intent"] in seen_intents:
if result != seen_intents[q["intent"]]:
false_positives += 1
else:
misses += 1
response = f"Response for intent: {q['intent']}"
cache.put(q["text"], response)
if q["intent"] not in seen_intents:
seen_intents[q["intent"]] = response
total = hits + misses
latencies.sort()
return {
"hit_rate": hits / total * 100,
"hits": hits,
"misses": misses,
"false_positives": false_positives,
"fp_rate": false_positives / max(hits, 1) * 100,
"p50_ms": latencies[len(latencies) // 2],
"p95_ms": latencies[int(len(latencies) * 0.95)],
"p99_ms": latencies[int(len(latencies) * 0.99)],
}
# Run benchmarks
exact_results = benchmark_strategy(ExactMatchCache(), dataset)
semantic_results = benchmark_strategy(SemanticCache(threshold=0.90), dataset)
structural_results = benchmark_strategy(StructuralHashCache(), dataset)
Hit Rate Comparison
| Strategy | Hit Rate | True Positive % | False Positive % |
|---|---|---|---|
| Exact Match | 22% | 100% | 0% |
| Structural Hash | 38% | 100% | 0% |
| Semantic (θ=0.90) | 58% | 96.2% | 3.8% |
Semantic cache leads on raw hit rate, but carries a 3.8% false positive cost. Structural hashing nearly doubles exact match's hit rate with zero risk. For safety-critical applications, the choice is clear. For a customer support chatbot? That 58% hit rate is very appealing.
Latency Comparison
| Strategy | p50 | p95 | p99 |
|---|---|---|---|
| Exact Match | 0.05ms | 0.12ms | 0.30ms |
| Structural Hash | 0.08ms | 0.18ms | 0.40ms |
| Semantic (θ=0.90) | 16ms | 24ms | 45ms |
Lookup latency in milliseconds. Includes normalization/hashing or embedding time. FAISS flat index search at 10K entries.
Exact match and structural hash are both sub-millisecond — effectively free. Semantic cache's 16ms p50 is dominated by the embedding step (~15ms on CPU with all-MiniLM-L6-v2). On a GPU, you could cut that to ~3-5ms even with the same model.
Memory & Storage
| Strategy | Memory (10K entries) | Storage on Disk |
|---|---|---|
| Exact Match | ~2 MB | ~5 MB |
| Structural Hash | ~3 MB | ~6 MB |
| Semantic (θ=0.90) | ~150 MB | ~35 MB |
Memory includes the embedding model (~120 MB for all-MiniLM-L6-v2) plus the index. Storage is the on-disk footprint of the cache database/index.
Semantic cache's memory cost is dominated by the embedding model sitting in RAM. The vector index itself is modest (~30 MB for 10K × 384-dim float32 vectors). If you're already running an embedding model for RAG or search, this cost is amortized.
Show Me the Money
Let's turn hit rates into dollars. The formula is straightforward: take your daily query volume, multiply by your per-query cost, and subtract the portion served from cache.
def monthly_savings(queries_per_day, avg_input_tokens, avg_output_tokens,
input_cost_per_m, output_cost_per_m, hit_rate):
"""Calculate monthly cost savings from LLM response caching.
Args:
queries_per_day: Daily query volume
avg_input_tokens: Average input tokens per query
avg_output_tokens: Average output tokens per query
input_cost_per_m: $ per 1M input tokens
output_cost_per_m: $ per 1M output tokens
hit_rate: Cache hit rate as a decimal (0.0 - 1.0)
Returns:
Dict with cost breakdown
"""
daily_input_cost = (queries_per_day * avg_input_tokens / 1e6) * input_cost_per_m
daily_output_cost = (queries_per_day * avg_output_tokens / 1e6) * output_cost_per_m
daily_total = daily_input_cost + daily_output_cost
# Cache hits avoid the full API call
daily_savings = daily_total * hit_rate
monthly_total = daily_total * 30
monthly_savings = daily_savings * 30
return {
"daily_cost_no_cache": round(daily_total, 2),
"daily_cost_with_cache": round(daily_total - daily_savings, 2),
"monthly_savings": round(monthly_savings, 2),
"monthly_cost_no_cache": round(monthly_total, 2),
"annual_savings": round(monthly_savings * 12, 2),
}
# GPT-4o: $2.50/M input, $10.00/M output
gpt4o = monthly_savings(
queries_per_day=100_000,
avg_input_tokens=500,
avg_output_tokens=200,
input_cost_per_m=2.50,
output_cost_per_m=10.00,
hit_rate=0.58 # semantic cache
)
print(f"GPT-4o without cache: ${gpt4o['monthly_cost_no_cache']}/mo")
print(f"GPT-4o with cache: ${gpt4o['monthly_cost_no_cache'] - gpt4o['monthly_savings']}/mo")
print(f"Monthly savings: ${gpt4o['monthly_savings']}/mo")
# GPT-4o without cache: $9,750.00/mo
# GPT-4o with cache: $4,095.00/mo
# Monthly savings: $5,655.00/mo
A 58% semantic cache hit rate on GPT-4o at 100K queries/day saves $5,655 per month. That's more than most developers' rent. And this stacks with provider-side prompt caching (Anthropic and OpenAI both offer 50-90% discounts on cached input token prefixes), potentially reducing the remaining costs even further.
Even the modest 22% exact-match hit rate saves $2,145/month on the same workload. The infrastructure cost of a SQLite file is effectively zero. The ROI is absurd.
Production Patterns & Edge Cases
Multi-Tier Caching
The best architecture doesn't choose one strategy — it layers them. Check exact match first (sub-ms, zero risk). On a miss, try structural hash (sub-ms, zero risk). On another miss, try semantic (16ms, small false positive risk). Only on a triple miss do you call the LLM.
This multi-tier approach captures 22% of queries instantly, another ~16% with structural hashing, and catches a further ~20% with semantic similarity. Combined theoretical hit rate: ~58%, with most hits resolving in under 1ms.
TTL and Invalidation
Cached responses go stale. Your knowledge base gets updated, your model changes, or the world moves on. Set a TTL (time-to-live) based on how volatile your data is:
- Static knowledge (documentation, FAQs): TTL of 24-72 hours
- Dynamic data (inventory, pricing): TTL of 1-6 hours
- Model upgrades: Flush the entire cache when you switch models — a GPT-4o response shouldn't be served for a Claude Sonnet request
Edge Cases That Trip People Up
- Streaming responses: You need the complete response before caching. Buffer the full stream, cache the assembled result, and stream to the user simultaneously.
- Temperature > 0: The same prompt can produce different responses. Cache anyway? Usually yes — for most production use cases, a deterministic cached response is better than paying for creative variation.
- Multi-turn conversations: The cache key must include the full conversation history (or a hash of it), not just the latest message. Otherwise "Tell me more" matches every conversation.
- Structured output schemas: If you change your output JSON schema, all cache entries with the old schema must be evicted. Use the schema version as part of the cache key.
Should You Cache This Query?
The Verdict
Start with exact match. It's free to build, adds zero latency, has zero false positives, and immediately saves money. There is no scenario where you shouldn't have this layer. Even a 22% hit rate at 100K queries/day saves over $2,000/month on GPT-4o.
Add structural hashing if your prompts follow templates — RAG queries, classification tasks, extraction pipelines. It nearly doubles your hit rate at the same sub-millisecond cost with the same zero false-positive guarantee.
Add semantic caching when you need hit rates above 40% and can tolerate a small false-positive rate. Worth the embedding overhead for high-volume applications where the cost savings justify the complexity.
The best architecture is multi-tier: exact match → structural hash → semantic → LLM. Most hits resolve in under 1ms. The semantic layer is a safety net that catches what the deterministic layers miss.
$325/day becomes $137/day. That's $5,655/month back in your pocket — for adding a cache layer that took an afternoon to build.
Try It: Cache Strategy Simulator
Watch queries arrive and see which strategies catch them. Identical repeats hit exact match; paraphrases hit structural or semantic cache.
References & Further Reading
- Zilliz — GPTCache — open-source framework for semantic LLM caching with pluggable embedding and storage backends
- OpenAI — Prompt Caching — provider-side automatic prompt prefix caching (50% input discount)
- Anthropic — Prompt Caching — explicit cache breakpoints for Claude models (90% input discount)
- Redis — Vector Similarity Search — using Redis as a semantic cache with built-in vector search
- LangChain — LLM Caching — caching integrations for LangChain pipelines
- Meta — FAISS — the vector similarity search library powering our semantic cache benchmarks
- DadOps — Vector Search at Small Scale — our FAISS vs pgvector vs NumPy benchmarks, directly applicable to semantic cache index selection
- DadOps — Embeddings from Scratch — how the embedding models that power semantic caching actually work under the hood