Перейти к содержанию

Semantic Caching vs Prompt Caching

~2 минуты чтения

Предварительно: vLLM & Paged Attention | Оптимизация расходов LLMOps

Source: Redis Blog "Prompt Caching vs Semantic Caching" (Dec 2025), Zylos Research LLM Caching Strategies 2025, AWS LLM Caching Guide (Feb 2026)


Концепция

Prompt Caching: - Cache processed tokens from identical/similar prompt prefixes - Reuse KV cache from previous requests - Best for: long documents, multi-turn conversations with fixed context

Semantic Caching: - Cache responses based on query meaning (embeddings) - Match similar queries, not exact matches - Best for: chatbots, RAG pipelines, FAQ systems

Comparison Table

Feature Prompt Caching Semantic Caching
What cached Processed tokens (KV cache) Query-response pairs (embeddings)
Matching Exact/prefix match Similarity search
TTL 5-60 min (provider) Configurable
Cost reduction Reduces token computation Up to 90% API call reduction
Latency Avoids re-processing ~15x speedup
Use case Long context, documents Chatbots, FAQs

Multi-Tier Architecture (Best Practice 2026)

User Request
    |
[L1] Exact Match (Redis) -- <10ms
    | miss
[L2] Semantic Cache (Vector) -- 50-150ms
    | miss
[L3] Provider Prompt Cache (Anthropic/OpenAI)
    | miss
[L4] Full LLM Inference
    |
Cache response in L1, L2, L3

Semantic Cache Implementation

class SemanticCache:
    """Embedding-based semantic caching for LLM responses"""

    def __init__(self, embedding_model, similarity_threshold=0.95, ttl=3600):
        self.embedder = embedding_model
        self.threshold = similarity_threshold
        self.ttl = ttl
        self.cache = {}  # In production: Redis, Pinecone, etc.

    def get(self, query):
        """Check semantic cache for similar query"""
        query_embedding = self.embedder.embed(query)

        for cached_query, cached_data in self.cache.items():
            similarity = cosine_similarity(query_embedding, cached_data["embedding"])

            if similarity >= self.threshold:
                return {
                    "hit": True,
                    "response": cached_data["response"],
                    "similarity": similarity,
                    "original_query": cached_query
                }

        return {"hit": False}

    def set(self, query, response):
        """Store query-response pair in cache"""
        embedding = self.embedder.embed(query)

        self.cache[query] = {
            "embedding": embedding,
            "response": response,
            "timestamp": time.time()
        }

    def get_or_generate(self, query, llm_generate_fn):
        """Get from cache or generate new response"""
        cached = self.get(query)
        if cached["hit"]:
            return cached["response"]

        response = llm_generate_fn(query)
        self.set(query, response)
        return response

Double Caching Pattern

class DoubleCacheLLM:
    """Combine prompt caching + semantic caching"""

    def __init__(self, llm_client, semantic_cache):
        self.llm = llm_client
        self.semantic_cache = semantic_cache

    def generate(self, system_prompt, user_query):
        # L1: Check semantic cache
        cached = self.semantic_cache.get(user_query)
        if cached["hit"]:
            return cached["response"]

        # L2: Use provider's prompt caching
        response = self.llm.generate(
            system=system_prompt,
            messages=[{"role": "user", "content": user_query}],
            extra_headers={"anthropic-cache-control": "ephemeral"}
        )

        # L3: Store in semantic cache
        self.semantic_cache.set(user_query, response)
        return response

Интервью вопросы

Q: В чём разница между prompt caching и semantic caching?

A: Prompt caching: reuse processed tokens (KV cache) для identical/similar prefixes -- на уровне model internals. Semantic caching: match queries по meaning через embeddings, return cached response -- на уровне API. Prompt caching = model optimization, semantic caching = API optimization.

Q: Когда использовать semantic caching?

A: Когда: (1) Users задают похожие вопросы разными словами, (2) FAQ/knowledge base queries, (3) Chatbots с повторяющимися интентами. Пример: "How do I reset password?" vs "I forgot my password" -- semantic cache вернёт один ответ. Не использовать когда answers должны быть deterministic или real-time data.

Q: Какой TTL ставить для semantic cache?

A: Зависит от domain: FAQ/facts -- 24h+, news/current events -- 5-15min, technical docs -- 1-7 days. Важно: eviction policy по LRU, similarity threshold 0.9-0.95, monitoring hit rate. Redis LangCache позволяет configurable TTL + similarity-based invalidation.

Q: Как комбинировать prompt + semantic caching?

A: Multi-tier: L1 exact match (Redis), L2 semantic (vector), L3 provider prompt cache, L4 full inference. Prompt cache обрабатывает long context reuse, semantic cache обрабатывает query similarity. Combined savings >80% vs naive implementation.


See Also