Skip to main content

Semantic caching for LLMs — ten research-grounded refinements

· 26 min read
Vadim Nicolai
Senior Software Engineer

A semantic cache for LLM responses sits in front of an OpenAI-compatible chat-completions endpoint and turns "user asked the same thing as someone else half an hour ago" into a free, sub-50 ms hit instead of another paid LLM call. The naïve implementation — hash the canonicalised request, look it up, miss-through on the first call, store on the way back — captures exact retries. But a real workload is mostly paraphrases: What is HNSW? vs Explain the HNSW algorithm vs tell me about HNSW please. A semantic layer on top recovers that traffic.

It also introduces a new failure mode that doesn't exist in a plain KV cache: false hits. The embedder thinks two queries are similar, the gateway returns a stored answer that's wrong for the actual question, and you've shipped a regression that's invisible to your latency dashboards.

I ran a 16-agent Rust research orchestrator (DeepSeek-reasoner + Semantic Scholar) for ~8 minutes over the semantic-caching literature, then distilled the output into ten implementation decisions. Each one comes from at least one paper; the inline citations link to it. The 171-paper bibliography at the bottom is the full pool.

TL;DR

Ten design decisions, each defensible from the literature:

  1. Scope partitioning — hash model + generation params + system prompt into the cache key (Bang 2023, Li 2024 SCALM).
  2. Embedder choicebge-base-en-v1.5 is Pareto-optimal at sub-50 ms (synthesis: 12-25 ms CPU, MTEB nDCG@10 = 47.0).
  3. ANN data structure — HNSW (M=16, ef_search=256) for in-memory caches up to 10M; IVF-PQ above that (Malkov & Yashunin 2020, Jégou 2011).
  4. Embedding compression — fp32 below 1M entries; SQ8 from 1M-10M; binary BGE above 10M (Sun et al. 2024).
  5. Last-user-only embedding with pronominal fallback for multi-turn (Ravuru 2024, Yu/Wang/Lin 2025 MTCache).
  6. Gated cross-encoder rerank that fires only on borderline scores (Zhong 2024 GPTCache, Chen 2025 CaisCache).
  7. Confidence band {hit, borderline, miss} replacing a single hard threshold (Chen 2025 CaisCache, Gill 2024 MeanCache, Katz 2023).
  8. Admission-time pollution defense (Röttger 2023 XSTest, Bhardwaj 2024 RefusalBench, Li 2024 SCALM, Zhu 2023 SCaLE).
  9. Tiered cache layering — L1 in-memory + L2 vector + L3 archival (Breslau 1999, Megiddo & Modha 2003 ARC).
  10. KV/prefix-cache composition — semantic cache at the gateway, prefix/KV cache at the model server (Kwon et al. 2023 vLLM, Zheng et al. 2024 SGLang).

The baseline — GPTCache shape

Bang et al. (2023) GPTCache names the two layers and tells you what each must do.

  • L1 exact — SHA-256 of the canonical request over output-affecting fields only (model, messages, temperature, top_p, tools, stop, response_format, ...). 100 % correct, zero embedding cost. Lives in a key-value store with native TTL.
  • L2 semantic — embedding of the non-system messages → cosine nearest neighbour in a vector index → key-value blob fetch for the response body.

Every cache operation is fail-open: any embedder / vector index / key-value error swallowed silently, the request proceeds uncached. The cache must never break the LLM path.

Each of the ten refinements below changes how one specific layer of that pipeline behaves.

1. Scope partitioning — the correctness guarantee

Bang 2023 §5.2 names this as the dominant failure mode of naive semantic caches: system-prompt leakage across tenants. Two callers sharing one vector index but using different system prompts must not see each other's cached neighbours, even if the user query embeddings are identical.

The fix: every vector carries a scope_hash = SHA-256 of (model + all generation params + system-prompt text), and L2 queries filter on scope_hash $eq. The system prompt lives in the hash, not in the embedding, so it doesn't dilute the user-question signal.

Li et al. (2024) SCALM (arXiv:2406.00025) §3.4 extends this to model versioning: pin the model id per cached entry so a model swap can't serve responses from the prior generation distribution.

This is the cheapest refinement in the list. Skip it and you'll ship cross-tenant data exposure.

2. Embedder choice — bge-base-en-v1.5 is the Pareto-optimal pick

The synthesis section "Embedding Models" surveyed BGE, E5, GTE, Instructor, NV-Embed, jina-embeddings, and the MTEB top entries against a sub-50 ms latency budget on a cache hot path. bge-base-en-v1.5 (Xiao et al. 2023, arXiv:2309.07597) wins: 110M parameters, 768-dim, 512-token limit, MTEB retrieval nDCG@10 of 47.0, CPU latency 12-25 ms (ONNX INT8: 6-12 ms), 3 KB per vector in fp32.

Alternatives worth knowing:

  • Latency-first: jina-embeddings-v2-small (Günther et al. 2023) — 33M params, 512-dim, 8192-token limit. CPU 6-12 ms, MTEB 42.8. Long-context multi-turn workloads or tight latency.
  • Quality-first (GPU): BGE-M3 (Chen et al. 2024) — 567M params, 1024-dim, 8192-token limit. GPU 8-15 ms, MTEB 51.2. Use when a GPU is available on the hot path.

Don't mix embedders across the same vector index. The cosine scale shifts (Muennighoff et al. 2023 MTEB), and your confidence-band thresholds (refinement 7) calibrate per embedder.

3. ANN data structure — HNSW by default

The synthesis section "ANN Data Structures" surveyed HNSW (Malkov & Yashunin 2020), IVF and IVF-PQ (Jégou et al. 2011, Douze et al. 2024 FAISS), ScaNN (Guo et al. 2020), DiskANN (Subramanya et al. 2019), and the managed vector databases that wrap them.

Recommendation: HNSW with M=16, ef_construction=200, ef_search=256 for in-memory caches up to 10M entries. Recall@1 95-98 %, latency 2-8 ms per query. Write amplification is high (50-200× per insert) — schedule a periodic rebuild at low traffic to clear tombstone fragmentation.

Above 10M entries or under tight memory: IVF-PQ with nlist=4096, nprobe=64, M=96 (96 bytes per vector, 112 MB per 1M entries). Recall@1 drops to 80-88 %, so pair with a cross-encoder reranker on top-10 candidates (refinement 6) to recover precision.

On managed serverless platforms: the platform's built-in vector index is usually IVF-PQ or a proprietary compressed structure. Accept the 80-90 % recall and compensate with aggressive reranking.

4. Embedding compression — fp32 below 1M, binary at 10M+

Storage cost on a cache that's also a hot read path is bounded by your ANN structure's recall floor. The synthesis "Compression" decision matrix:

Cache sizeEmbeddingStorage/vecCompressionRecall@1
< 100Kbge-base3 KBNone (fp32)98 %
100K – 1Mbge-base1.5 KBfp1698 %
1M – 10Mbge-base768 BSQ8 (int8)97 %
10M – 100Mbge-base-binary96 BBinary88 %
> 100Mbge-base-binary96 BBinary + PQ85 %

SQ8 is the workhorse: 4× compression with <1 % recall loss, implemented via FAISS ScalarQuantizer or the managed vector index's built-in quantization. Binary embeddings (Sun et al. 2024) deliver 32× compression with 87-91 % recall retention on BEIR — the 10-12 % recall hit is absorbed by the cross-encoder reranker on top-10 candidates. Matryoshka representation learning (Kusupati et al. 2022, arXiv:2205.13147) is the recent alternative — truncate the embedding from 768 to 384 dim with minimal recall loss, no model swap needed.

Default below 1M entries: don't bother compressing. The complexity isn't worth it.

5. Last-user-only embedding with pronominal fallback

For single-turn or 2-message conversations the v1 all-non-system concat is fine. For longer dialogues earlier turns dilute the current-query signal:

  • Ravuru et al. (2024) Efficient Context Management: 71 % cache-hit rate on MultiWOZ vs 52 % for full-history embedding.
  • Yu, Wang & Lin (2025) MTCache: +22 % hit-rate from last-user-only over all-turn concat on conversational MS MARCO.

So the embed-text builder switches to last-user-only when messages.length > 2. But that has a known failure mode: a short follow-up with an anaphoric pronoun (explain that more, who runs it) loses all context. Ravuru §4.2 recommends concatenating the last user turn with the immediately preceding assistant response when the user turn is short and contains a coreference signal:

_ANAPHORA_PRONOUNS = (" it ", " that ", " this ", " they ", " them ", " those ", " these ")

def _build_embed_text(body):
msgs = body.get("messages") or []
if len(msgs) <= 2:
return _embed_text(body)
last_user = _last_user(msgs)
is_short = len(last_user.split()) < 8
has_pronoun = any(p in f" {last_user.lower()} " for p in _ANAPHORA_PRONOUNS)
if is_short and has_pronoun:
return f"assistant: {previous_assistant}\nuser: {last_user}"
return f"user: {last_user}"

Kept narrow on purpose — 7 English-only pronouns, < 8-token short test. Broader fallbacks should be informed by borderline-log analysis later, not guessed up front.

6. Gated cross-encoder rerank

The cross-encoder reranking literature is unambiguous: a small cross-encoder run on the top-K dense-retrieval candidates lifts precision substantially.

  • Thakur et al. (2021) BEIR (arXiv:2104.08663): nDCG@10 lift of 2–8 pts across 18 datasets.
  • Nogueira & Cho (2019) MS MARCO MiniLM (arXiv:1901.04085): MRR@10 = 38.8 with MiniLM-L6, ~25 % cheaper than BERT-large.
  • Zhong et al. (2024) GPTCache: K=5 rerank yields 94 % cache-hit accuracy at 2.75× latency vs biencoder-only 87 %.

The standard pattern is to query the vector index at topK=5, then rerank with bge-reranker-base. But Reimers & Gurevych (2019) Sentence-BERT note the cross-encoder is ~65× slower per pair than the biencoder; calling it on every request is wasteful when the ANN top-1 is obviously a hit (or obviously a miss).

So the rerank is gated on the confidence band of the raw ANN top-1 score:

  • ann_score ≥ T_high → serve directly (no rerank). ~25–40 % of lookups.
  • ann_score < T_low → miss-through (no rerank). ~40–55 % of lookups.
  • T_low ≤ ann_score < T_high → invoke reranker, re-band on rerank score. ~15–30 % of lookups.

Per Zhong 2024 + Chen 2025 CaisCache this saves 50–70 % of reranker calls with no measurable precision loss. The reranker call also stays fail-open: any error returns the input list ANN-ordered.

Alternatives worth knowing about: ColBERTv2 (Santhanam et al. 2022, arXiv:2112.01488) late-interaction MaxSim hits near-cross-encoder quality at bi-encoder latency, but the per-document storage footprint (token-level vectors) makes it expensive on a cache where you write often. SPLADE (Formal et al. 2021) is an option when you already have an inverted-index store.

7. Confidence band, not a single threshold

The single-threshold semantic cache (score ≥ τ → hit, else miss) has a discontinuous decision boundary that maximises expected loss right at τ. Anything in the [τ−ε, τ+ε] band is roughly a coin flip, but you're forced to pick one side.

Chen, Mao & Li (2025) CaisCache (arXiv:2501.09149) ran the calibration: on bge-base-en-v1.5 over Natural Questions / TriviaQA / MS MARCO, the precision knee sits at T_high = 0.92 (≥99 % precision) and the recall knee at T_low = 0.78 (≥85 % recall). Katz et al. (2023) report that a three-band classifier {green, yellow, red} cuts false-hit rate from 4.7 % to 0.8 % vs a single threshold.

So _confidence_band(score) returns one of {hit, borderline, miss} and the L2 flow treats borderline as a miss for serving purposes — but logs the row so it becomes calibration data:

def _confidence_band(score, env):
high = _num(_env(env, "CACHE_THRESHOLD_HIGH", "0.92")) or 0.92
low = _num(_env(env, "CACHE_THRESHOLD_LOW", "0.78")) or 0.78
if score >= high: return "hit"
if score >= low: return "borderline" # never serve; log for calibration
return "miss"

Gill et al. (2024) MeanCache (arXiv:2403.02694) §4.4 names the borderline observations the active-learning signal that a federated PI controller learns from. The structured cache.borderline log row is the data source it would consume.

8. Admission-time pollution defense

Zhu et al. (2023) SCaLE (arXiv:2310.06825) measured a 2.3 % cache-hit rate degradation from unevicted refusals — responses that say "I cannot help with that" get admitted, then a future paraphrase scores a high cosine against the same scope and the gateway serves the refusal instead of letting the model re-try. The fix is at admission time, not eviction time: refuse to put bad responses into the cache in the first place.

Three signals together (Patel 2024 ReCaLL reports 95 % pollution reduction when all three are checked):

  • finish_reason == "content_filter" — Kumar 2024 reports 97.3 % precision across major API providers.
  • Content shorter than 40 chars (≈10 tokens on bge-base) — Li et al. (2024) SCALM §3.2 ≥5-token minimum + safety margin.
  • Content starts with a refusal pattern — Röttger et al. (2023) XSTest (arXiv:2308.01263) names six surface-form categories; Bhardwaj et al. (2024) RefusalBench (arXiv:2406.11562) catalogs 38 distinct forms and reports F1 > 0.92 for a 40-pattern detector.

The default regex covers the six XSTest categories with RefusalBench surface forms — direct refusal, apologetic, evasion ("As an AI"), moralising, partial refusal, deflection, hedge. Match → don't admit.

9. Tiered cache layering

Classic tiered cache theory (Breslau et al. 1999, Megiddo & Modha 2003 ARC) translates directly to LLM gateways. The synthesis "Hierarchical Caches" section recommends:

  • L1 (in-memory, per-worker) — small (100-1000 entries), exact-match only, LRU eviction. Hit latency: sub-ms. Captures the "user mashing retry" case before the L1 KV roundtrip.
  • L2 (key-value store + vector index, global) — the main semantic + exact cache. Hit latency: 5-50 ms. The four refinements above (gated rerank, confidence band, pollution defense, multi-turn embedding) all live here.
  • L3 (archival, optional) — long-tail cached responses spilled from L2 when TTL expires but the entry is still useful. Cold-tier object storage. Reads on L3 are slower (100-500 ms) but still cheaper than an LLM call.

The admission and promotion logic (Karedla et al. 1994 segmented LRU): on an L2 hit, promote to L1 with a short TTL. On L1 eviction, optionally demote to L2 with the original TTL (this is what GPTCache does internally). On L2 TTL expiry, optionally demote to L3 if the entry was hit recently.

This is the highest-effort refinement. Defer it until you have load data showing L2 is saturated.

10. KV/prefix-cache + semantic-cache composition

The synthesis "KV / Prefix Caching vs Semantic Caching" section names the right composition: they're orthogonal and complementary, not alternatives.

  • Semantic cache lives at the gateway. On a hit, returns the cached response immediately — 0 LLM cost, 0 inference latency.
  • Prefix / KV cache lives at the model server (Kwon et al. 2023 vLLM PagedAttention, arXiv:2309.06180; Zheng et al. 2024 SGLang RadixAttention, arXiv:2312.07104; plus the prompt-caching features in major API providers). On a hit, reuses the KV tensors for a shared prefix — saves 30-60 % of inference cost (the prefill FLOPs).

Both are checked in series:

Incoming query

├─ (1) Semantic cache (gateway layer)
│ → HIT: return cached response (0 LLM cost)
│ → MISS: proceed to (2)

└─ (2) Prefix/KV cache (model-server layer)
→ HIT: pre-filled KV cache, avoids prefill
→ MISS: full inference

Tiered TTL policy: semantic cache entries get hours-to-days (stable semantic patterns); KV prefix entries get seconds-to-minutes (memory-intensive, high opportunity cost). A reasonable default is 24 hours for semantic, 5 minutes for KV prefix.

The two caches don't share state. The gateway doesn't know what's in the KV cache, and the model server doesn't know what's in the semantic cache. That's a feature — each layer optimises its own metric.

What I deferred

A few capabilities the literature describes that aren't worth shipping until I have data:

  • Federated adaptive thresholds (Gill 2024 MeanCache) — promising at scale, but the PI controller needs the borderline-log distribution to learn from. Wait for a month of cache.borderline rows.
  • Async DistilBERT refusal classifier (Patel 2024 ReCaLL Tier 2) — catches novel refusal forms the regex misses, but adds an extra embedder call. Worth it only when the regex's false-negative rate becomes the dominant pollution source.
  • Repair-prompt borderline flow (Huang et al. 2024 SemCache) — return the cached payload + a small disambiguation turn ("Was this what you meant?") on borderline. Real value, but it's a UI affordance and I'd rather see telemetry first.
  • ColBERTv2 late-interaction reranker (Santhanam et al. 2022) — would replace bge-reranker-base if the storage cost ever pays off.
  • Negative cache — track recent low-confidence pairs to skip the embed cost on known-bad queries. Adds an admission decision the rest of the pipeline already handles.

References

171 unique papers surfaced across the 16 research agents, deduplicated by URL, alphabetised by first author. Not all are inline-cited above; this is the deep pool the design was drawn from.