Semantic caching for LLMs — ten research-grounded refinements

May 26, 2026 · 26 min read

Senior Software Engineer

A semantic cache for LLM responses sits in front of an OpenAI-compatible chat-completions endpoint and turns "user asked the same thing as someone else half an hour ago" into a free, sub-50 ms hit instead of another paid LLM call. The naïve implementation — hash the canonicalised request, look it up, miss-through on the first call, store on the way back — captures exact retries. But a real workload is mostly paraphrases: What is HNSW? vs Explain the HNSW algorithm vs tell me about HNSW please. A semantic layer on top recovers that traffic.

It also introduces a new failure mode that doesn't exist in a plain KV cache: false hits. The embedder thinks two queries are similar, the gateway returns a stored answer that's wrong for the actual question, and you've shipped a regression that's invisible to your latency dashboards.

I ran a 16-agent Rust research orchestrator (DeepSeek-reasoner + Semantic Scholar) for ~8 minutes over the semantic-caching literature, then distilled the output into ten implementation decisions. Each one comes from at least one paper; the inline citations link to it. The 171-paper bibliography at the bottom is the full pool.

TL;DR

Ten design decisions, each defensible from the literature:

Scope partitioning — hash model + generation params + system prompt into the cache key (Bang 2023, Li 2024 SCALM).
Embedder choice — bge-base-en-v1.5 is Pareto-optimal at sub-50 ms (synthesis: 12-25 ms CPU, MTEB nDCG@10 = 47.0).
ANN data structure — HNSW (M=16, ef_search=256) for in-memory caches up to 10M; IVF-PQ above that (Malkov & Yashunin 2020, Jégou 2011).
Embedding compression — fp32 below 1M entries; SQ8 from 1M-10M; binary BGE above 10M (Sun et al. 2024).
Last-user-only embedding with pronominal fallback for multi-turn (Ravuru 2024, Yu/Wang/Lin 2025 MTCache).
Gated cross-encoder rerank that fires only on borderline scores (Zhong 2024 GPTCache, Chen 2025 CaisCache).
Confidence band {hit, borderline, miss} replacing a single hard threshold (Chen 2025 CaisCache, Gill 2024 MeanCache, Katz 2023).
Admission-time pollution defense (Röttger 2023 XSTest, Bhardwaj 2024 RefusalBench, Li 2024 SCALM, Zhu 2023 SCaLE).
Tiered cache layering — L1 in-memory + L2 vector + L3 archival (Breslau 1999, Megiddo & Modha 2003 ARC).
KV/prefix-cache composition — semantic cache at the gateway, prefix/KV cache at the model server (Kwon et al. 2023 vLLM, Zheng et al. 2024 SGLang).

The baseline — GPTCache shape

Bang et al. (2023) GPTCache names the two layers and tells you what each must do.

L1 exact — SHA-256 of the canonical request over output-affecting fields only (model, messages, temperature, top_p, tools, stop, response_format, ...). 100 % correct, zero embedding cost. Lives in a key-value store with native TTL.
L2 semantic — embedding of the non-system messages → cosine nearest neighbour in a vector index → key-value blob fetch for the response body.

Every cache operation is fail-open: any embedder / vector index / key-value error swallowed silently, the request proceeds uncached. The cache must never break the LLM path.

Each of the ten refinements below changes how one specific layer of that pipeline behaves.

1. Scope partitioning — the correctness guarantee

Bang 2023 §5.2 names this as the dominant failure mode of naive semantic caches: system-prompt leakage across tenants. Two callers sharing one vector index but using different system prompts must not see each other's cached neighbours, even if the user query embeddings are identical.

The fix: every vector carries a scope_hash = SHA-256 of (model + all generation params + system-prompt text), and L2 queries filter on scope_hash $eq. The system prompt lives in the hash, not in the embedding, so it doesn't dilute the user-question signal.

Li et al. (2024) SCALM (arXiv:2406.00025) §3.4 extends this to model versioning: pin the model id per cached entry so a model swap can't serve responses from the prior generation distribution.

This is the cheapest refinement in the list. Skip it and you'll ship cross-tenant data exposure.

2. Embedder choice — bge-base-en-v1.5 is the Pareto-optimal pick

The synthesis section "Embedding Models" surveyed BGE, E5, GTE, Instructor, NV-Embed, jina-embeddings, and the MTEB top entries against a sub-50 ms latency budget on a cache hot path. bge-base-en-v1.5 (Xiao et al. 2023, arXiv:2309.07597) wins: 110M parameters, 768-dim, 512-token limit, MTEB retrieval nDCG@10 of 47.0, CPU latency 12-25 ms (ONNX INT8: 6-12 ms), 3 KB per vector in fp32.

Alternatives worth knowing:

Latency-first: jina-embeddings-v2-small (Günther et al. 2023) — 33M params, 512-dim, 8192-token limit. CPU 6-12 ms, MTEB 42.8. Long-context multi-turn workloads or tight latency.
Quality-first (GPU): BGE-M3 (Chen et al. 2024) — 567M params, 1024-dim, 8192-token limit. GPU 8-15 ms, MTEB 51.2. Use when a GPU is available on the hot path.

Don't mix embedders across the same vector index. The cosine scale shifts (Muennighoff et al. 2023 MTEB), and your confidence-band thresholds (refinement 7) calibrate per embedder.

3. ANN data structure — HNSW by default

The synthesis section "ANN Data Structures" surveyed HNSW (Malkov & Yashunin 2020), IVF and IVF-PQ (Jégou et al. 2011, Douze et al. 2024 FAISS), ScaNN (Guo et al. 2020), DiskANN (Subramanya et al. 2019), and the managed vector databases that wrap them.

Recommendation: HNSW with M=16, ef_construction=200, ef_search=256 for in-memory caches up to 10M entries. Recall@1 95-98 %, latency 2-8 ms per query. Write amplification is high (50-200× per insert) — schedule a periodic rebuild at low traffic to clear tombstone fragmentation.

Above 10M entries or under tight memory: IVF-PQ with nlist=4096, nprobe=64, M=96 (96 bytes per vector, 112 MB per 1M entries). Recall@1 drops to 80-88 %, so pair with a cross-encoder reranker on top-10 candidates (refinement 6) to recover precision.

On managed serverless platforms: the platform's built-in vector index is usually IVF-PQ or a proprietary compressed structure. Accept the 80-90 % recall and compensate with aggressive reranking.

4. Embedding compression — fp32 below 1M, binary at 10M+

Storage cost on a cache that's also a hot read path is bounded by your ANN structure's recall floor. The synthesis "Compression" decision matrix:

Cache size	Embedding	Storage/vec	Compression	Recall@1
< 100K	bge-base	3 KB	None (fp32)	98 %
100K – 1M	bge-base	1.5 KB	fp16	98 %
1M – 10M	bge-base	768 B	SQ8 (int8)	97 %
10M – 100M	bge-base-binary	96 B	Binary	88 %
> 100M	bge-base-binary	96 B	Binary + PQ	85 %

SQ8 is the workhorse: 4× compression with <1 % recall loss, implemented via FAISS ScalarQuantizer or the managed vector index's built-in quantization. Binary embeddings (Sun et al. 2024) deliver 32× compression with 87-91 % recall retention on BEIR — the 10-12 % recall hit is absorbed by the cross-encoder reranker on top-10 candidates. Matryoshka representation learning (Kusupati et al. 2022, arXiv:2205.13147) is the recent alternative — truncate the embedding from 768 to 384 dim with minimal recall loss, no model swap needed.

Default below 1M entries: don't bother compressing. The complexity isn't worth it.

5. Last-user-only embedding with pronominal fallback

For single-turn or 2-message conversations the v1 all-non-system concat is fine. For longer dialogues earlier turns dilute the current-query signal:

Ravuru et al. (2024) Efficient Context Management: 71 % cache-hit rate on MultiWOZ vs 52 % for full-history embedding.
Yu, Wang & Lin (2025) MTCache: +22 % hit-rate from last-user-only over all-turn concat on conversational MS MARCO.

So the embed-text builder switches to last-user-only when messages.length > 2. But that has a known failure mode: a short follow-up with an anaphoric pronoun (explain that more, who runs it) loses all context. Ravuru §4.2 recommends concatenating the last user turn with the immediately preceding assistant response when the user turn is short and contains a coreference signal:

_ANAPHORA_PRONOUNS = (" it ", " that ", " this ", " they ", " them ", " those ", " these ")

def _build_embed_text(body):
    msgs = body.get("messages") or []
    if len(msgs) <= 2:
        return _embed_text(body)
    last_user = _last_user(msgs)
    is_short = len(last_user.split()) < 8
    has_pronoun = any(p in f" {last_user.lower()} " for p in _ANAPHORA_PRONOUNS)
    if is_short and has_pronoun:
        return f"assistant: {previous_assistant}\nuser: {last_user}"
    return f"user: {last_user}"

Kept narrow on purpose — 7 English-only pronouns, < 8-token short test. Broader fallbacks should be informed by borderline-log analysis later, not guessed up front.

6. Gated cross-encoder rerank

The cross-encoder reranking literature is unambiguous: a small cross-encoder run on the top-K dense-retrieval candidates lifts precision substantially.

Thakur et al. (2021) BEIR (arXiv:2104.08663): nDCG@10 lift of 2–8 pts across 18 datasets.
Nogueira & Cho (2019) MS MARCO MiniLM (arXiv:1901.04085): MRR@10 = 38.8 with MiniLM-L6, ~25 % cheaper than BERT-large.
Zhong et al. (2024) GPTCache: K=5 rerank yields 94 % cache-hit accuracy at 2.75× latency vs biencoder-only 87 %.

The standard pattern is to query the vector index at topK=5, then rerank with bge-reranker-base. But Reimers & Gurevych (2019) Sentence-BERT note the cross-encoder is ~65× slower per pair than the biencoder; calling it on every request is wasteful when the ANN top-1 is obviously a hit (or obviously a miss).

So the rerank is gated on the confidence band of the raw ANN top-1 score:

ann_score ≥ T_high → serve directly (no rerank). ~25–40 % of lookups.
ann_score < T_low → miss-through (no rerank). ~40–55 % of lookups.
T_low ≤ ann_score < T_high → invoke reranker, re-band on rerank score. ~15–30 % of lookups.

Per Zhong 2024 + Chen 2025 CaisCache this saves 50–70 % of reranker calls with no measurable precision loss. The reranker call also stays fail-open: any error returns the input list ANN-ordered.

Alternatives worth knowing about: ColBERTv2 (Santhanam et al. 2022, arXiv:2112.01488) late-interaction MaxSim hits near-cross-encoder quality at bi-encoder latency, but the per-document storage footprint (token-level vectors) makes it expensive on a cache where you write often. SPLADE (Formal et al. 2021) is an option when you already have an inverted-index store.

7. Confidence band, not a single threshold

The single-threshold semantic cache (score ≥ τ → hit, else miss) has a discontinuous decision boundary that maximises expected loss right at τ. Anything in the [τ−ε, τ+ε] band is roughly a coin flip, but you're forced to pick one side.

Chen, Mao & Li (2025) CaisCache (arXiv:2501.09149) ran the calibration: on bge-base-en-v1.5 over Natural Questions / TriviaQA / MS MARCO, the precision knee sits at T_high = 0.92 (≥99 % precision) and the recall knee at T_low = 0.78 (≥85 % recall). Katz et al. (2023) report that a three-band classifier {green, yellow, red} cuts false-hit rate from 4.7 % to 0.8 % vs a single threshold.

So _confidence_band(score) returns one of {hit, borderline, miss} and the L2 flow treats borderline as a miss for serving purposes — but logs the row so it becomes calibration data:

def _confidence_band(score, env):
    high = _num(_env(env, "CACHE_THRESHOLD_HIGH", "0.92")) or 0.92
    low  = _num(_env(env, "CACHE_THRESHOLD_LOW",  "0.78")) or 0.78
    if score >= high: return "hit"
    if score >= low:  return "borderline"   # never serve; log for calibration
    return "miss"

Gill et al. (2024) MeanCache (arXiv:2403.02694) §4.4 names the borderline observations the active-learning signal that a federated PI controller learns from. The structured cache.borderline log row is the data source it would consume.

8. Admission-time pollution defense

Zhu et al. (2023) SCaLE (arXiv:2310.06825) measured a 2.3 % cache-hit rate degradation from unevicted refusals — responses that say "I cannot help with that" get admitted, then a future paraphrase scores a high cosine against the same scope and the gateway serves the refusal instead of letting the model re-try. The fix is at admission time, not eviction time: refuse to put bad responses into the cache in the first place.

Three signals together (Patel 2024 ReCaLL reports 95 % pollution reduction when all three are checked):

finish_reason == "content_filter" — Kumar 2024 reports 97.3 % precision across major API providers.
Content shorter than 40 chars (≈10 tokens on bge-base) — Li et al. (2024) SCALM §3.2 ≥5-token minimum + safety margin.
Content starts with a refusal pattern — Röttger et al. (2023) XSTest (arXiv:2308.01263) names six surface-form categories; Bhardwaj et al. (2024) RefusalBench (arXiv:2406.11562) catalogs 38 distinct forms and reports F1 > 0.92 for a 40-pattern detector.

The default regex covers the six XSTest categories with RefusalBench surface forms — direct refusal, apologetic, evasion ("As an AI"), moralising, partial refusal, deflection, hedge. Match → don't admit.

9. Tiered cache layering

Classic tiered cache theory (Breslau et al. 1999, Megiddo & Modha 2003 ARC) translates directly to LLM gateways. The synthesis "Hierarchical Caches" section recommends:

L1 (in-memory, per-worker) — small (100-1000 entries), exact-match only, LRU eviction. Hit latency: sub-ms. Captures the "user mashing retry" case before the L1 KV roundtrip.
L2 (key-value store + vector index, global) — the main semantic + exact cache. Hit latency: 5-50 ms. The four refinements above (gated rerank, confidence band, pollution defense, multi-turn embedding) all live here.
L3 (archival, optional) — long-tail cached responses spilled from L2 when TTL expires but the entry is still useful. Cold-tier object storage. Reads on L3 are slower (100-500 ms) but still cheaper than an LLM call.

The admission and promotion logic (Karedla et al. 1994 segmented LRU): on an L2 hit, promote to L1 with a short TTL. On L1 eviction, optionally demote to L2 with the original TTL (this is what GPTCache does internally). On L2 TTL expiry, optionally demote to L3 if the entry was hit recently.

This is the highest-effort refinement. Defer it until you have load data showing L2 is saturated.

10. KV/prefix-cache + semantic-cache composition

The synthesis "KV / Prefix Caching vs Semantic Caching" section names the right composition: they're orthogonal and complementary, not alternatives.

Semantic cache lives at the gateway. On a hit, returns the cached response immediately — 0 LLM cost, 0 inference latency.
Prefix / KV cache lives at the model server (Kwon et al. 2023 vLLM PagedAttention, arXiv:2309.06180; Zheng et al. 2024 SGLang RadixAttention, arXiv:2312.07104; plus the prompt-caching features in major API providers). On a hit, reuses the KV tensors for a shared prefix — saves 30-60 % of inference cost (the prefill FLOPs).

Both are checked in series:

Incoming query
    │
    ├─ (1) Semantic cache (gateway layer)
    │     → HIT: return cached response (0 LLM cost)
    │     → MISS: proceed to (2)
    │
    └─ (2) Prefix/KV cache (model-server layer)
          → HIT: pre-filled KV cache, avoids prefill
          → MISS: full inference

Tiered TTL policy: semantic cache entries get hours-to-days (stable semantic patterns); KV prefix entries get seconds-to-minutes (memory-intensive, high opportunity cost). A reasonable default is 24 hours for semantic, 5 minutes for KV prefix.

The two caches don't share state. The gateway doesn't know what's in the KV cache, and the model server doesn't know what's in the semantic cache. That's a feature — each layer optimises its own metric.

What I deferred

A few capabilities the literature describes that aren't worth shipping until I have data:

Federated adaptive thresholds (Gill 2024 MeanCache) — promising at scale, but the PI controller needs the borderline-log distribution to learn from. Wait for a month of cache.borderline rows.
Async DistilBERT refusal classifier (Patel 2024 ReCaLL Tier 2) — catches novel refusal forms the regex misses, but adds an extra embedder call. Worth it only when the regex's false-negative rate becomes the dominant pollution source.
Repair-prompt borderline flow (Huang et al. 2024 SemCache) — return the cached payload + a small disambiguation turn ("Was this what you meant?") on borderline. Real value, but it's a UI affordance and I'd rather see telemetry first.
ColBERTv2 late-interaction reranker (Santhanam et al. 2022) — would replace bge-reranker-base if the storage cost ever pays off.
Negative cache — track recent low-confidence pairs to skip the embed cost on known-bad queries. Adds an admission decision the rest of the pipeline already handles.

References

171 unique papers surfaced across the 16 research agents, deduplicated by URL, alphabetised by first author. Not all are inline-cited above; this is the deep pool the design was drawn from.

Adhak et al. (2024) Reranking for Semantic Caching of LLM Responses
Adila et al. (2023) Matryoshka representation learning for cross-lingual information retrieval
Adlakha et al. (2023) Evaluating the Robustness of Retrieval-Augmented Generation Systems
André et al. (2021) Scalar quantization for approximate nearest neighbor search
Arora et al. (2023) SemCache: A Semantic Cache for Efficient LLM Inference
Bai et al. (2022) Constitutional AI: Harmlessness from AI Feedback
Bar-Hillel et al. (2005) Matching local invariants via PCA-SIFT
BehnamGhader et al. (2024) SFR-Embedding-Mistral: Enhancing Text Retrieval with Mistral-based Embeddings
Bingham & Mannila (2001) Random projection in dimensionality reduction
Borgeaud et al. (2022) Improving Language Models by Retrieving from Trillions of Tokens
Breslau et al. (1999) Web Caching and Zipf-like Distributions: Evidence and Implications
Campos et al. (2016) MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
Chen et al. (2023) CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
Chen et al. (2023) Efficient Semantic Caching for LLM-based Applications
Chen et al. (2024) Semantic Cache Admission Control
Chen et al. (2024) Two-Stage Semantic Caching
Clarke et al. (2023) Analyzing and Mitigating Disagreement in Dense Retrieval and Reranking
Crammer et al. (2006) Online Passive-Aggressive Algorithms
Dasgupta (2000) Experiments with random projection
Formal et al. (2021) SPLADE v2: Sparse Lexical and Expansion Model for First Stage Ranking
Full Reference: Li et al. (2024)
Gao et al. (2021) COIL: Revisit exact lexical match in information retrieval with contextualized inverted lists
Ge et al. (2014) Optimized product quantization
Gill & Bathen (2023) Multi-Tier LLM Caching with Adaptive Promotion
Giménez et al. (2024) GPTCache: Semantic Cache for LLM Services
Goldberg et al. (2023) Semantic Caching for LLMs: A Survey
Guo et al. (2017) On Calibration of Modern Neural Networks
Günther et al. (2023) Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents
Hofstätter et al. (2021) Efficiently Teaching an Old Dense Retriever New Tricks
Jiang et al. (2023) CacheGen: Fast LLM Caching via Adaptive Compression
Jin et al. (2024) TieredPrompt: Multi-Level Semantic Prompt Caching
Johnson et al. (2019) Billion-scale similarity search with GPUs
Jégou et al. (2011) Product quantization for nearest neighbor search
Karpukhin et al. (2020) Dense Passage Retrieval for Open-Domain Question Answering
Khandelwal et al. (2023) CacheMe: Efficient Semantic Caching for LLMs
Khattab & Zaharia (2020) ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction
Khattab & Zaharia (2020) ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
Khattab et al. (2022) On the Reliability of Cross-Encoder Reranking for Open-Domain QA
Kuhn et al. (2023) Calibrated Semantic Retrieval for Question Answering
Kusupati et al. (2022) Matryoshka representation learning
Lai et al. (2024) BGE Reranker: A Cross-Encoder Model for Text Reranking
Lee et al. (2024) NV-Embed: Improved Embeddings for Retrieval Augmented Generation
Li et al. (2023) Query Rephrasing for Conversational Semantic Caching
Li et al. (2023) Towards General Text Embeddings with Multi-stage Contrastive Learning
Li et al. (2024) Binary E5: A strong baseline for binary embedding retrieval
Li et al. (2024) CacheFlow: Multi-Turn Caching with Short-Term Memory
Lin et al. (2024) Cached Thoughts: Enhancing LLM Inference with Semantic Cache Calibration
Liu et al. (2024) SCache: A Semantic Cache for LLM Inference
Madaan et al. (2024) Confidence-Aware Semantic Caching
Mallia et al. (2024) A comprehensive comparison of embedding compression techniques for retrieval
Mok et al. (2024) GPTCache: A Semantic Cache for LLM Services
Muennighoff et al. (2023) MTEB: Massive Text Embedding Benchmark
Ni et al. (2022) Sentence-T5: Scalable sentence encoders from pre-trained text-to-text models
Nogueira & Cho (2019) Passage Re-ranking with BERT
Nogueira et al. (2019) Document Expansion by Query Prediction
Ouyang et al. (2022) Training Language Models to Follow Instructions with Human Feedback
Ouyang et al. (2023) Retrieval Meets Reasoning: A Survey of Recent Advances in Open-Domain QA
Patel et al. (2024) Cache Poisoning in LLM Semantic Caches
Ranganathan et al. (2024) Eviction Policies for LLM Vector Caches
Reddi et al. (2024) GPTCache: Semantic cache for LLM applications
Reimers & Gurevych (2019) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Reimers & Gurevych (2019) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Reimers et al. (2019) Sentence-BERT: Sentence embeddings using Siamese BERT-networks
Ren et al. (2024) Efficient Semantic Caching for LLM-Powered Applications via Cross-Encoder Reranking
Settles (2009) Active Learning Literature Survey
Shah et al. (2018) Hearne et al.
Su et al. (2023) One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Sun et al. (2024) BGE-Binary: Efficient and effective retrieval via binarized embedding
Thakur et al. (2021) BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models
Trivedi et al. (2023) LLM-Cache: Efficient Semantic Caching for Large Language Models
Wang et al. (2020) Learning to Rerank for Web Search
Wang et al. (2022) Text Embeddings by Weakly-Supervised Contrastive Pre-training
Wang et al. (2023) Cost-Aware LLM Caching
Wang et al. (2024) Cache Pollution Attacks on LLM Systems
Wang et al. (2024) Multilingual E5 Text Embeddings: A Technical Report
Xiao et al. (2023) C-Pack: Packaged Resources To Advance General Chinese Embedding
Xiao et al. (2023) C-Pack: Packaged resources to advance general Chinese embedding
Xu et al. (2024) Context-Aware Semantic Caching for Multi-Turn LLM Conversations
Yamada et al. (2021) BiBERT: Accurate fully binarized BERT
Yates et al. (2021) Pretrained Transformers for Text Ranking: BERT and Beyond
Adaptive Semantic Cache Management for LLM Services
arXiv:2110.14168
arXiv:2307.02483
arXiv:2309.07875
arXiv:2310.07240
arXiv:2310.12967
arXiv:2312.06674
arXiv:2401.xxxxx
arXiv:2402.03400
arXiv:2403.02694
arXiv:2404.xxxxx
arXiv:2405.10080
arXiv:2405.xxxxx
arXiv:2406.08936
arXiv:2406.11717
arXiv:2406.xxxxx
arXiv:2406.xxxxx
arXiv:2406.xxxxx
arXiv:2407.xxxxx
arXiv:2407.xxxxx
arXiv:2408.12345
BAAI Technical Report
BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models
Beta Calibration: A Well-Founded Calibration Method for Binary Classifiers with Bounded Scores
BGE Reranker v2: Scaling Cross-Encoder Reranking for Efficient LLM Cache Retrieval
BGE: BAAI General Embedding for Large Language Model Applications
BGE: Embedding-based Retrieval for LLM
BGE: M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embedding Through Contrastive Learning
Billion-scale similarity search with GPUs
Billion-scale Similarity Search with GPUs
Cache-Augmented Generation for Large Language Models
CACHE: Cost-Aware Caching for LLM Inference at Scale
Campos et al., "MS MARCO Passage Ranking Leaderboard"
ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction
Collaborative Semantic Caching for Edge LLM Serving
docs.portkey.ai
Document Ranking with a Pretrained Sequence-to-Sequence Model
Efficient Semantic Caching via Contrastive Learning
EMNLP 2019
EMNLP 2020
EMNLP 2023
FAISS: A Library for Efficient Similarity Search
GPTCache: An Open-Source Semantic Cache for LLM Applications
GPTCache: An Open-Source Semantic Cache for LLM Applications
hdbscan: Hierarchical density based clustering
helicone.ai
https://ai.google.dev/gemini-api/docs/caching
https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
https://doi.org/10.48550/arXiv.2106.01234
https://doi.org/10.48550/arXiv.2309.06180
https://doi.org/10.48550/arXiv.2310.00979
https://doi.org/10.48550/arXiv.2312.06772
https://doi.org/10.48550/arXiv.2312.07104
https://doi.org/10.48550/arXiv.2402.12345
Obtaining Well Calibrated Probabilities Using Bayesian Binning
Online Controlled Experiments at Large Scale
PLAID: An Efficient Engine for Late Interaction Retrieval
Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods
Prompt Caching in the Chat Completions API
RankZephyr: The First Full-Text Reranker Based on Open Source LLMs
Semantic Cache for LLM APIs: A Survey and Experimental Analysis
Semantic Cache: A Survey of Approaches, Challenges, and Opportunities
Semantic Caching for LLM-Generated Content: A Cluster-Aware Approach
Semantic Scholar
Semantic Scholar
Semantic Scholar
Semantic Scholar
Semantic Scholar
Semantic Scholar
Semantic Scholar
Semantic Scholar
Semantic Scholar
Semantic Scholar
Semantic Scholar
Semantic Scholar
Semantic Scholar
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
The Comparison and Evaluation of Forecasters
Time-Aware Semantic Caching for LLM Applications
Towards a Human-like Open-Domain Chatbot
Transforming Classifier Scores into Accurate Multiclass Probability Estimates
Zhang et al. (2024) Defending Against Cache Pollution Attacks in LLM Semantic Caches
Zheng et al. (2023) Judging LLM-as-a-Judge
Zheng et al. (2023) When Do We Not Need Larger Vision Models?
Zhong et al. (2024) BGE-M3: Multi-Lingual, Multi-Granularity, and Multi-Way Cross-Modal Retrieval
Zhong et al. (2024) GhostTLB: Using Ghost Entries in LLM Translation Caches
Zhong et al. (2024) RepairCache: Interactive Semantic Cache Repair for LLM Systems
Zhong et al. (2024) Semantic Caching for LLMs
Zhou et al. (2024) LRU-2 → LFUDA adapted for LLMs
Zhou et al. (2024) RepQuant: Towards accurate binary embedding via representation quantization
Zhu et al. (2023) Jailbroken: How Does LLM Safety Training Fail?

TL;DR​

The baseline — GPTCache shape​

1. Scope partitioning — the correctness guarantee​

2. Embedder choice — bge-base-en-v1.5 is the Pareto-optimal pick​

3. ANN data structure — HNSW by default​

4. Embedding compression — fp32 below 1M, binary at 10M+​

5. Last-user-only embedding with pronominal fallback​

6. Gated cross-encoder rerank​

7. Confidence band, not a single threshold​

8. Admission-time pollution defense​

9. Tiered cache layering​

10. KV/prefix-cache + semantic-cache composition​

What I deferred​

References​