Перейти к содержанию

Оптимизация KV Cache

~7 минут чтения

KV cache memory formula, PagedAttention (vLLM), prefix caching, RadixAttention (SGLang), KVFlow (NeurIPS 2025), Mooncake multi-tier, LMCache distributed, continuous batching, copy-on-write, block manager, cache timing attacks, production benchmarks (2025-2026)


Ключевые концепции

KV Cache -- хранение key-value векторов предыдущих токенов для избежания повторных вычислений attention.

Memory Formula

KV Cache Memory -- ключевая формула для system design

\[\text{KV Cache} = 2 \times L \times H \times d_{head} \times T \times \text{bytes}\]

Где \(L\) = layers, \(H\) = attention heads, \(d_{head}\) = head dim, \(T\) = sequence length. Пример: LLaMA-70B, 100 users x 4K context = ~1 TB KV cache. Без PagedAttention это не влезет ни в какой GPU cluster.

Model Seq Len KV Cache per Request
LLaMA-7B 4K 2 GB
LLaMA-7B 32K 16 GB
LLaMA-70B 4K 16 GB
LLaMA-70B 4K, 100 users ~1 TB

Optimization Landscape

Technique Memory Gain Throughput Gain Complexity
PagedAttention 5x 4-6x Medium
Prefix Caching 2-4x 2-3x Low
RadixAttention ~5x 3-4x Medium
KVFlow ~5x 2.19x vs SGLang High
KV Quantization 2-8x 1.5x Low
Multi-Query Attention 8x 2x High

1. Зачем KV Cache

Без KV cache каждый новый токен требует recomputation attention над ВСЕЙ последовательностью:

\[\text{Attention}(Q_t, K_{1:t}, V_{1:t}) = \text{softmax}\left(\frac{Q_t K_{1:t}^T}{\sqrt{d_k}}\right) V_{1:t}\]

Cost без cache: \(O(t^2)\). С cache: \(O(t)\).

# Pre-fill phase: compute K, V for all input tokens
K_cache = model.compute_K(input_tokens)  # (seq_len, d_k)
V_cache = model.compute_V(input_tokens)  # (seq_len, d_v)

# Decode phase: only compute for new token
for new_token in generate:
    K_new = model.compute_K(new_token)  # (1, d_k)
    V_new = model.compute_V(new_token)  # (1, d_v)
    K_cache = concat(K_cache, K_new)
    V_cache = concat(V_cache, V_new)
    output = attention(compute_Q(new_token), K_cache, V_cache)

KV cache масштабируется линейно с batch size

KV cache считается для ОДНОГО запроса, но при serving N одновременных запросов память умножается на N. LLaMA-70B при 4K контексте: 1 запрос = ~10 GB, 100 запросов = ~1 TB. Именно поэтому PagedAttention и MQA/GQA критичны для production: без них даже 8xA100 (640 GB) обслуживают единицы запросов одновременно.


2. Проблема фрагментации памяти

Traditional Allocation

Request 1: [====____________________] (4K used, 12K alloc)
Request 2: [==______________________] (2K used, 12K alloc)
Request 3: [======__________________] (6K used, 12K alloc)

Problem: Can't compact or share unused memory
Result: 60-80% memory waste
Problem Description
Internal fragmentation Request needs 500 tokens, system reserves 4096 -- 87% waste
External fragmentation Enough total free memory, but no contiguous block
Over-reservation Each request reserves max length even for short answers

3. PagedAttention (vLLM)

Paper: Kwon et al., SOSP 2023

OS Virtual Memory Analogy

OS Concept PagedAttention Equivalent
Virtual memory Logical KV cache positions
Physical pages GPU memory blocks
Page table Block table
Page fault On-demand allocation
Memory sharing Prefix sharing (Copy-on-Write)

Block Structure

Block = block_size x head_dim x num_heads x 2 (K+V) x bytes

LLaMA-7B: 16 tokens x 128 dim x 32 heads x 2 x 2 bytes = 256 KB/block

Block Table (per sequence):
  Logical Block 0 -> Physical Block 42
  Logical Block 1 -> Physical Block 17  (non-contiguous!)
  Logical Block 2 -> Physical Block 89

PagedAttention Algorithm

def paged_attention(
    query: torch.Tensor,       # [num_tokens, num_heads, head_dim]
    key_cache: torch.Tensor,   # [num_blocks, num_heads, block_size, head_dim]
    value_cache: torch.Tensor,
    block_tables: torch.Tensor,  # [num_seqs, max_blocks_per_seq]
    context_lens: torch.Tensor,  # [num_seqs]
    block_size: int = 16,
) -> torch.Tensor:
    scale = 1.0 / (head_dim ** 0.5)
    output = torch.zeros_like(query)
    token_idx = 0

    for seq_idx, seq_len in enumerate(context_lens):
        block_table = block_tables[seq_idx]
        num_blocks = (seq_len + block_size - 1) // block_size

        # Gather keys/values from non-contiguous blocks
        keys = [key_cache[block_table[i]] for i in range(num_blocks)]
        values = [value_cache[block_table[i]] for i in range(num_blocks)]

        seq_keys = torch.cat(keys, dim=1)[:, :seq_len, :]
        seq_values = torch.cat(values, dim=1)[:, :seq_len, :]
        seq_query = query[token_idx:token_idx + seq_len]

        # Standard attention on gathered KV
        attn = F.softmax(
            torch.einsum('thd,hsd->ths', seq_query, seq_keys) * scale, dim=-1
        )
        output[token_idx:token_idx + seq_len] = torch.einsum(
            'ths,hsd->thd', attn, seq_values
        )
        token_idx += seq_len
    return output

Block Manager

class BlockManager:
    def __init__(self, num_blocks, block_size, num_heads, head_dim):
        self.block_size = block_size
        self.key_cache = torch.zeros(num_blocks, num_heads, block_size, head_dim)
        self.value_cache = torch.zeros(num_blocks, num_heads, block_size, head_dim)
        self.free_blocks = list(range(num_blocks))
        self.block_tables = {}

    def allocate(self, seq_id):
        block_id = self.free_blocks.pop()
        self.block_tables.setdefault(seq_id, []).append(block_id)
        return block_id

    def free(self, seq_id):
        for block_id in self.block_tables.pop(seq_id, []):
            self.free_blocks.append(block_id)

    def write(self, seq_id, token_idx, keys, values):
        block_idx = token_idx // self.block_size
        offset = token_idx % self.block_size
        while len(self.block_tables[seq_id]) <= block_idx:
            self.allocate(seq_id)
        physical = self.block_tables[seq_id][block_idx]
        self.key_cache[physical, :, offset] = keys
        self.value_cache[physical, :, offset] = values

Performance

System Memory Utilization Throughput
HuggingFace Transformers 20-40% 1x
Orca 40-60% 1.5x
vLLM (PagedAttention) 90%+ 4-6x
Model Batch HF Transformers vLLM Speedup
LLaMA-7B 32 1.2 req/s 5.8 req/s 4.8x
LLaMA-7B 128 OOM 14.2 req/s --
LLaMA-13B 32 0.6 req/s 3.1 req/s 5.2x
LLaMA-70B 8 OOM 0.8 req/s --

4. Prefix Caching

Концепция

Если несколько запросов имеют общий префикс (system prompt, few-shot, conversation history), можно reuse KV cache.

\[\text{Speedup} = \frac{T_{\text{prefix}} + T_{\text{query}}}{T_{\text{query}}}\]
Without: 100 users x 1000 tokens system prompt = 100,000 prefill tokens
With:    1 prefill + 100 users x 0 redundant = savings 99%

vLLM: Block Hashing

class PrefixCache:
    def __init__(self):
        self.cache = {}  # hash -> block_ids

    def compute_hash(self, tokens):
        return hashlib.md5(str(tokens).encode()).hexdigest()

    def get_cached_blocks(self, tokens):
        for prefix_len in range(len(tokens), 0, -1):
            h = self.compute_hash(tokens[:prefix_len])
            if h in self.cache:
                return self.cache[h], prefix_len
        return [], 0

Copy-on-Write (CoW)

Block Table 1: [0,1,2,3] -> [unique blocks]
Block Table 2: [0,1,2,3] -> [SHARED] -> [unique]
Block Table 3: [0,1,2,3] -> [SHARED] -> [unique]

Blocks 0-3 are physically shared.
When request writes to shared block:
  1. Copy the block
  2. Update copy (not original)
Memory savings: 2-4x for shared prefixes

5. RadixAttention (SGLang)

Radix tree для автоматического prefix sharing с token-level granularity.

                [root]
               /      \
          "You are"   "The task"
          /    \           \
    "a helpful" "concise"  "is to"
       /                      \
  "assistant"              "analyze"

Features

  • Token-level granularity (vs block-level в vLLM)
  • Automatic prefix detection (no manual specification)
  • LRU eviction policy
  • O(k) lookup, k = prefix length

Code

class RadixTreeNode:
    def __init__(self):
        self.children = {}
        self.kv_cache = None
        self.ref_count = 0

class RadixAttention:
    def __init__(self):
        self.root = RadixTreeNode()

    def insert(self, tokens, kv_cache):
        node = self.root
        for token in tokens:
            if token not in node.children:
                node.children[token] = RadixTreeNode()
            node = node.children[token]
        node.kv_cache = kv_cache
        node.ref_count += 1

    def lookup(self, tokens):
        node = self.root
        last_match, matched_len = None, 0
        for i, token in enumerate(tokens):
            if token in node.children:
                node = node.children[token]
                if node.kv_cache is not None:
                    last_match, matched_len = node, i + 1
            else:
                break
        return last_match, matched_len

SGLang vs vLLM

Feature vLLM SGLang
Granularity Block-level (16 tokens) Token-level
Structure Hash table Radix tree
Sharing Explicit prefix Automatic
Best for Simple prefixes, high throughput Complex branching, agents

Performance

Scenario No Cache RadixAttention Speedup
Multi-turn chat 100ms 35ms 2.9x
System prompt reuse 200ms 50ms 4x
RAG with same context 500ms 150ms 3.3x

6. KVFlow (NeurIPS 2025)

Workflow-aware KV cache для multi-agent систем.

       [User Request]
             |
       [Agent A] <-- shared prefix (system prompt)
            / \
    [Agent B] [Agent C] <-- different continuations
         \     /
       [Agent D] <-- aggregation

Innovations

Innovation Description
Workflow prediction Anticipate agent execution paths
Smart preemption Evict unlikely-to-be-used caches
Prefix sharing Share common prefixes across agents
Priority scheduling Prioritize high-probability paths

Performance vs SGLang

Scenario SGLang KVFlow Speedup
Single workflow, large prompt 1.0x 1.83x +83%
Multi-workflow 1.0x 2.19x +119%
Agentic RAG 1.0x 1.65x +65%

7. Mooncake: Multi-Tier KV Storage

Tier 0: GPU Memory (hot prefixes)     | Latency: <1ms
Tier 1: CPU Memory (warm prefixes)    | Latency: 5-10ms
Tier 2: SSD/Disk (cold prefixes)      | Latency: 50-100ms

LRU + frequency-based eviction across tiers

8. LMCache: Distributed KV Cache

Paper: Liu et al., arXiv:2510.09665 (Oct 2025)

vLLM Engine / SGLang Engine / Other Engines
                    |
             LMCache Connector
              /     |      \
        GPU Memory  CPU Memory  Disk/Network
Workload Speedup
Multi-turn QA 15x
Document analysis 12x
Agent conversations 8x

9. Continuous Batching

Iteration-level scheduling вместо static batching.

class ContinuousBatcher:
    def __init__(self, model, max_batch_size):
        self.running = []
        self.waiting = []

    def step(self):
        self.running = [r for r in self.running if not r.finished]
        while len(self.running) < self.max_batch_size and self.waiting:
            self.running.append(self.waiting.pop(0))
        if self.running:
            tokens = [r.next_token for r in self.running]
            outputs = self.model.forward_batch(tokens)
            for r, out in zip(self.running, outputs):
                r.update(out)
  • No waiting for batch fill
  • Variable batch size adapts to load
  • Lower first-token latency

10. Security: Cache Timing Attacks

User A: [System Prompt] + [Query A]
User B: [System Prompt] + [Query B]

If User B's request is faster than expected,
attacker learns User A sent same system prompt.

Mitigations: constant-time cache lookup, per-user cache isolation, cache hit masking, differential privacy.


11. Best Practices

Prompt Design for Cache Hits

# BAD: User-specific data in system prompt
system_prompt = f"You are helpful. User: {username}"  # Changes every request!

# GOOD: Stable prefix, variable suffix
system_prompt = "You are a helpful assistant."  # Cacheable!
user_context = f"User: {username}"  # Variable part after

Cache-Aware API Usage

# Anthropic: explicit cache control
response = client.messages.create(
    model="claude-sonnet-4.5",
    system=[{
        "type": "text",
        "text": LONG_SYSTEM_PROMPT,
        "cache_control": {"type": "ephemeral"}
    }],
    messages=messages,
)

12. Production Benchmarks

Stripe Case Study (Dec 2025)

  • 50M API calls, migrated from HF Transformers to vLLM
  • Same workload on ⅓ GPU fleet
  • Cost reduction: 73%

Throughput (tokens/sec)

Model Traditional vLLM SGLang
Llama-7B 500 2,000 2,500
Llama-70B 50 300 400
Mixtral-8x7B 100 800 1,000

Memory Efficiency

System KV Cache Waste Effective Batch Size
HF Transformers 60-80% 8
vLLM (PagedAttention) <4% 32
SGLang (RadixAttention) <1% 64

13. Decision Framework

Scenario Recommended
Simple chatbot vLLM + PagedAttention
Multi-turn conversations vLLM + Prefix Caching
Agent workflows SGLang + RadixAttention
Multi-agent systems KVFlow
Very long contexts LMCache + CPU offload
Large-scale multi-tenant Mooncake + isolation
Maximum throughput SGLang + Continuous batching
Simple deployment vLLM (OpenAI-compatible API)

Framework Comparison

Feature vLLM SGLang KVFlow Mooncake
Granularity Block Token Workflow Multi-tier
Structure Hash Radix tree Graph Tiered
Agent-aware No No Yes Partial
Distributed Yes Yes Yes Yes

Для интервью

Q: "Что такое KV cache и зачем он нужен?"

KV cache хранит key-value векторы предыдущих токенов, чтобы при генерации нового токена не пересчитывать attention над всей последовательностью. Без cache: O(t^2), с cache: O(t). Размер: 2 x layers x heads x dim x seq_len x bytes. Для Llama-70B при 4K tokens -- ~10 GB на один запрос. Проблема: при static allocation 60-80% памяти тратится впустую из-за фрагментации.

Q: "Объясните PagedAttention и как он решает проблему памяти."

PagedAttention (vLLM, SOSP 2023) -- аналогия с виртуальной памятью ОС. GPU память делится на fixed-size blocks (16 tokens). Каждая sequence имеет block table (логический -> физический блок). Блоки аллоцируются on-demand, могут быть non-contiguous. Результат: waste <4% (vs 60-80%), throughput 4-6x, batch size 2-4x больше. Stripe: 73% cost reduction на 50M API calls. Prefix caching через Copy-on-Write: общие prefix blocks shared между запросами, копируются только при записи.

Q: "Сравните vLLM и SGLang для prefix caching."

vLLM: block-level hashing (16 tokens/block), hash table, explicit prefix matching. SGLang: RadixAttention -- token-level granularity, radix tree, automatic prefix detection. SGLang лучше для complex branching и agent workflows (2.9-4x speedup). vLLM лучше для simple prefixes и high-throughput (lower memory overhead). Для multi-agent: KVFlow (NeurIPS 2025) -- workflow-aware caching, 2.19x speedup vs SGLang. Для large-scale: Mooncake multi-tier (GPU/CPU/SSD).


Ключевые числа

Факт Значение
Traditional allocation waste 60-80%
PagedAttention waste <4%
RadixAttention waste <1%
PagedAttention throughput gain 4-6x
Prefix caching speedup (system prompt) 4x
KVFlow vs SGLang (multi-workflow) 2.19x
LMCache multi-turn QA speedup 15x
Stripe cost reduction (vLLM) 73%
KV cache Llama-70B 4K tokens ~10 GB
Block size (typical) 16 tokens
Cache hit rate (system prompt reuse) 80-95%
Cache hit rate (agent workflows) 40-75%

Формулы

KV Cache Memory

\[\text{Memory} = 2 \times L \times H \times d_{head} \times T \times \text{bytes}\]

Memory Waste (Traditional)

\[\text{Waste} = \frac{T_{max} - T_{actual}}{T_{max}}\]

Block Allocation

\[\text{Blocks} = \lceil T / B \rceil\]

Prefix Caching Speedup

\[\text{Speedup} = \frac{T_{\text{prefix}} + T_{\text{query}}}{T_{\text{query}}}\]

Cache Hit Rate

\[\text{Hit Rate} = \frac{\text{Cached Prefix Tokens}}{\text{Total Prefix Tokens}}\]

Самопроверка

  1. Рассчитайте KV cache для Mistral-7B (32 layers, 8 KV heads (GQA), dim=128, FP16) при sequence length 8192. Сравните с LLaMA-7B (32 layers, 32 heads, dim=128).
  2. Система обслуживает 50 пользователей с общим system prompt (1000 tokens) и уникальными запросами (500 tokens). Посчитайте экономию памяти от prefix caching.
  3. Объясните, почему RadixAttention (token-level) лучше для agent workflows, чем block-level caching vLLM (подсказка: подумайте о branching).

Источники

  1. Kwon et al. -- "Efficient Memory Management for LLM Serving with PagedAttention" (SOSP 2023)
  2. Liu et al. -- "LMCache: Distributed KV Cache" (arXiv:2510.09665, Oct 2025)
  3. "KVFlow: Efficient Prefix Caching for LLM-Based Multi-Agent Workflows" (NeurIPS 2025)
  4. vLLM Documentation -- docs.vllm.ai
  5. SGLang Documentation -- github.com/sgl-project/sglang
  6. Mooncake -- "Serving Platform for Kimi LLM"
  7. Red Hat -- "How PagedAttention Resolves Memory Waste" (July 2025)
  8. Introl -- "vLLM Production Deployment" (Feb 2026)
  9. Subhash Dasyam -- "Long-Context Inference Security: KV-Cache Privacy"
  10. vLLM Blog -- "Inside vLLM: Anatomy of a High-Throughput LLM Inference System" (2025)

See Also