Semantic Caching vs Prompt Caching¶
~2 минуты чтения
Предварительно: vLLM & Paged Attention | Оптимизация расходов LLMOps
Source: Redis Blog "Prompt Caching vs Semantic Caching" (Dec 2025), Zylos Research LLM Caching Strategies 2025, AWS LLM Caching Guide (Feb 2026)
Концепция¶
Prompt Caching: - Cache processed tokens from identical/similar prompt prefixes - Reuse KV cache from previous requests - Best for: long documents, multi-turn conversations with fixed context
Semantic Caching: - Cache responses based on query meaning (embeddings) - Match similar queries, not exact matches - Best for: chatbots, RAG pipelines, FAQ systems
Comparison Table¶
| Feature | Prompt Caching | Semantic Caching |
|---|---|---|
| What cached | Processed tokens (KV cache) | Query-response pairs (embeddings) |
| Matching | Exact/prefix match | Similarity search |
| TTL | 5-60 min (provider) | Configurable |
| Cost reduction | Reduces token computation | Up to 90% API call reduction |
| Latency | Avoids re-processing | ~15x speedup |
| Use case | Long context, documents | Chatbots, FAQs |
Multi-Tier Architecture (Best Practice 2026)¶
User Request
|
[L1] Exact Match (Redis) -- <10ms
| miss
[L2] Semantic Cache (Vector) -- 50-150ms
| miss
[L3] Provider Prompt Cache (Anthropic/OpenAI)
| miss
[L4] Full LLM Inference
|
Cache response in L1, L2, L3
Semantic Cache Implementation¶
class SemanticCache:
"""Embedding-based semantic caching for LLM responses"""
def __init__(self, embedding_model, similarity_threshold=0.95, ttl=3600):
self.embedder = embedding_model
self.threshold = similarity_threshold
self.ttl = ttl
self.cache = {} # In production: Redis, Pinecone, etc.
def get(self, query):
"""Check semantic cache for similar query"""
query_embedding = self.embedder.embed(query)
for cached_query, cached_data in self.cache.items():
similarity = cosine_similarity(query_embedding, cached_data["embedding"])
if similarity >= self.threshold:
return {
"hit": True,
"response": cached_data["response"],
"similarity": similarity,
"original_query": cached_query
}
return {"hit": False}
def set(self, query, response):
"""Store query-response pair in cache"""
embedding = self.embedder.embed(query)
self.cache[query] = {
"embedding": embedding,
"response": response,
"timestamp": time.time()
}
def get_or_generate(self, query, llm_generate_fn):
"""Get from cache or generate new response"""
cached = self.get(query)
if cached["hit"]:
return cached["response"]
response = llm_generate_fn(query)
self.set(query, response)
return response
Double Caching Pattern¶
class DoubleCacheLLM:
"""Combine prompt caching + semantic caching"""
def __init__(self, llm_client, semantic_cache):
self.llm = llm_client
self.semantic_cache = semantic_cache
def generate(self, system_prompt, user_query):
# L1: Check semantic cache
cached = self.semantic_cache.get(user_query)
if cached["hit"]:
return cached["response"]
# L2: Use provider's prompt caching
response = self.llm.generate(
system=system_prompt,
messages=[{"role": "user", "content": user_query}],
extra_headers={"anthropic-cache-control": "ephemeral"}
)
# L3: Store in semantic cache
self.semantic_cache.set(user_query, response)
return response
Интервью вопросы¶
Q: В чём разница между prompt caching и semantic caching?
A: Prompt caching: reuse processed tokens (KV cache) для identical/similar prefixes -- на уровне model internals. Semantic caching: match queries по meaning через embeddings, return cached response -- на уровне API. Prompt caching = model optimization, semantic caching = API optimization.
Q: Когда использовать semantic caching?
A: Когда: (1) Users задают похожие вопросы разными словами, (2) FAQ/knowledge base queries, (3) Chatbots с повторяющимися интентами. Пример: "How do I reset password?" vs "I forgot my password" -- semantic cache вернёт один ответ. Не использовать когда answers должны быть deterministic или real-time data.
Q: Какой TTL ставить для semantic cache?
A: Зависит от domain: FAQ/facts -- 24h+, news/current events -- 5-15min, technical docs -- 1-7 days. Важно: eviction policy по LRU, similarity threshold 0.9-0.95, monitoring hit rate. Redis LangCache позволяет configurable TTL + similarity-based invalidation.
Q: Как комбинировать prompt + semantic caching?
A: Multi-tier: L1 exact match (Redis), L2 semantic (vector), L3 provider prompt cache, L4 full inference. Prompt cache обрабатывает long context reuse, semantic cache обрабатывает query similarity. Combined savings >80% vs naive implementation.
See Also¶
- vLLM & Paged Attention -- KV cache mechanics
- Оптимизация расходов LLMOps -- cost optimization strategies