Оптимизация KV Cache¶
~7 минут чтения
KV cache memory formula, PagedAttention (vLLM), prefix caching, RadixAttention (SGLang), KVFlow (NeurIPS 2025), Mooncake multi-tier, LMCache distributed, continuous batching, copy-on-write, block manager, cache timing attacks, production benchmarks (2025-2026)
Ключевые концепции¶
KV Cache -- хранение key-value векторов предыдущих токенов для избежания повторных вычислений attention.
Memory Formula¶
KV Cache Memory -- ключевая формула для system design
Где \(L\) = layers, \(H\) = attention heads, \(d_{head}\) = head dim, \(T\) = sequence length. Пример: LLaMA-70B, 100 users x 4K context = ~1 TB KV cache. Без PagedAttention это не влезет ни в какой GPU cluster.
| Model | Seq Len | KV Cache per Request |
|---|---|---|
| LLaMA-7B | 4K | 2 GB |
| LLaMA-7B | 32K | 16 GB |
| LLaMA-70B | 4K | 16 GB |
| LLaMA-70B | 4K, 100 users | ~1 TB |
Optimization Landscape¶
| Technique | Memory Gain | Throughput Gain | Complexity |
|---|---|---|---|
| PagedAttention | 5x | 4-6x | Medium |
| Prefix Caching | 2-4x | 2-3x | Low |
| RadixAttention | ~5x | 3-4x | Medium |
| KVFlow | ~5x | 2.19x vs SGLang | High |
| KV Quantization | 2-8x | 1.5x | Low |
| Multi-Query Attention | 8x | 2x | High |
1. Зачем KV Cache¶
Без KV cache каждый новый токен требует recomputation attention над ВСЕЙ последовательностью:
Cost без cache: \(O(t^2)\). С cache: \(O(t)\).
# Pre-fill phase: compute K, V for all input tokens
K_cache = model.compute_K(input_tokens) # (seq_len, d_k)
V_cache = model.compute_V(input_tokens) # (seq_len, d_v)
# Decode phase: only compute for new token
for new_token in generate:
K_new = model.compute_K(new_token) # (1, d_k)
V_new = model.compute_V(new_token) # (1, d_v)
K_cache = concat(K_cache, K_new)
V_cache = concat(V_cache, V_new)
output = attention(compute_Q(new_token), K_cache, V_cache)
KV cache масштабируется линейно с batch size
KV cache считается для ОДНОГО запроса, но при serving N одновременных запросов память умножается на N. LLaMA-70B при 4K контексте: 1 запрос = ~10 GB, 100 запросов = ~1 TB. Именно поэтому PagedAttention и MQA/GQA критичны для production: без них даже 8xA100 (640 GB) обслуживают единицы запросов одновременно.
2. Проблема фрагментации памяти¶
Traditional Allocation¶
Request 1: [====____________________] (4K used, 12K alloc)
Request 2: [==______________________] (2K used, 12K alloc)
Request 3: [======__________________] (6K used, 12K alloc)
Problem: Can't compact or share unused memory
Result: 60-80% memory waste
| Problem | Description |
|---|---|
| Internal fragmentation | Request needs 500 tokens, system reserves 4096 -- 87% waste |
| External fragmentation | Enough total free memory, but no contiguous block |
| Over-reservation | Each request reserves max length even for short answers |
3. PagedAttention (vLLM)¶
Paper: Kwon et al., SOSP 2023
OS Virtual Memory Analogy¶
| OS Concept | PagedAttention Equivalent |
|---|---|
| Virtual memory | Logical KV cache positions |
| Physical pages | GPU memory blocks |
| Page table | Block table |
| Page fault | On-demand allocation |
| Memory sharing | Prefix sharing (Copy-on-Write) |
Block Structure¶
Block = block_size x head_dim x num_heads x 2 (K+V) x bytes
LLaMA-7B: 16 tokens x 128 dim x 32 heads x 2 x 2 bytes = 256 KB/block
Block Table (per sequence):
Logical Block 0 -> Physical Block 42
Logical Block 1 -> Physical Block 17 (non-contiguous!)
Logical Block 2 -> Physical Block 89
PagedAttention Algorithm¶
def paged_attention(
query: torch.Tensor, # [num_tokens, num_heads, head_dim]
key_cache: torch.Tensor, # [num_blocks, num_heads, block_size, head_dim]
value_cache: torch.Tensor,
block_tables: torch.Tensor, # [num_seqs, max_blocks_per_seq]
context_lens: torch.Tensor, # [num_seqs]
block_size: int = 16,
) -> torch.Tensor:
scale = 1.0 / (head_dim ** 0.5)
output = torch.zeros_like(query)
token_idx = 0
for seq_idx, seq_len in enumerate(context_lens):
block_table = block_tables[seq_idx]
num_blocks = (seq_len + block_size - 1) // block_size
# Gather keys/values from non-contiguous blocks
keys = [key_cache[block_table[i]] for i in range(num_blocks)]
values = [value_cache[block_table[i]] for i in range(num_blocks)]
seq_keys = torch.cat(keys, dim=1)[:, :seq_len, :]
seq_values = torch.cat(values, dim=1)[:, :seq_len, :]
seq_query = query[token_idx:token_idx + seq_len]
# Standard attention on gathered KV
attn = F.softmax(
torch.einsum('thd,hsd->ths', seq_query, seq_keys) * scale, dim=-1
)
output[token_idx:token_idx + seq_len] = torch.einsum(
'ths,hsd->thd', attn, seq_values
)
token_idx += seq_len
return output
Block Manager¶
class BlockManager:
def __init__(self, num_blocks, block_size, num_heads, head_dim):
self.block_size = block_size
self.key_cache = torch.zeros(num_blocks, num_heads, block_size, head_dim)
self.value_cache = torch.zeros(num_blocks, num_heads, block_size, head_dim)
self.free_blocks = list(range(num_blocks))
self.block_tables = {}
def allocate(self, seq_id):
block_id = self.free_blocks.pop()
self.block_tables.setdefault(seq_id, []).append(block_id)
return block_id
def free(self, seq_id):
for block_id in self.block_tables.pop(seq_id, []):
self.free_blocks.append(block_id)
def write(self, seq_id, token_idx, keys, values):
block_idx = token_idx // self.block_size
offset = token_idx % self.block_size
while len(self.block_tables[seq_id]) <= block_idx:
self.allocate(seq_id)
physical = self.block_tables[seq_id][block_idx]
self.key_cache[physical, :, offset] = keys
self.value_cache[physical, :, offset] = values
Performance¶
| System | Memory Utilization | Throughput |
|---|---|---|
| HuggingFace Transformers | 20-40% | 1x |
| Orca | 40-60% | 1.5x |
| vLLM (PagedAttention) | 90%+ | 4-6x |
| Model | Batch | HF Transformers | vLLM | Speedup |
|---|---|---|---|---|
| LLaMA-7B | 32 | 1.2 req/s | 5.8 req/s | 4.8x |
| LLaMA-7B | 128 | OOM | 14.2 req/s | -- |
| LLaMA-13B | 32 | 0.6 req/s | 3.1 req/s | 5.2x |
| LLaMA-70B | 8 | OOM | 0.8 req/s | -- |
4. Prefix Caching¶
Концепция¶
Если несколько запросов имеют общий префикс (system prompt, few-shot, conversation history), можно reuse KV cache.
Without: 100 users x 1000 tokens system prompt = 100,000 prefill tokens
With: 1 prefill + 100 users x 0 redundant = savings 99%
vLLM: Block Hashing¶
class PrefixCache:
def __init__(self):
self.cache = {} # hash -> block_ids
def compute_hash(self, tokens):
return hashlib.md5(str(tokens).encode()).hexdigest()
def get_cached_blocks(self, tokens):
for prefix_len in range(len(tokens), 0, -1):
h = self.compute_hash(tokens[:prefix_len])
if h in self.cache:
return self.cache[h], prefix_len
return [], 0
Copy-on-Write (CoW)¶
Block Table 1: [0,1,2,3] -> [unique blocks]
Block Table 2: [0,1,2,3] -> [SHARED] -> [unique]
Block Table 3: [0,1,2,3] -> [SHARED] -> [unique]
Blocks 0-3 are physically shared.
When request writes to shared block:
1. Copy the block
2. Update copy (not original)
Memory savings: 2-4x for shared prefixes
5. RadixAttention (SGLang)¶
Radix tree для автоматического prefix sharing с token-level granularity.
Features¶
- Token-level granularity (vs block-level в vLLM)
- Automatic prefix detection (no manual specification)
- LRU eviction policy
- O(k) lookup, k = prefix length
Code¶
class RadixTreeNode:
def __init__(self):
self.children = {}
self.kv_cache = None
self.ref_count = 0
class RadixAttention:
def __init__(self):
self.root = RadixTreeNode()
def insert(self, tokens, kv_cache):
node = self.root
for token in tokens:
if token not in node.children:
node.children[token] = RadixTreeNode()
node = node.children[token]
node.kv_cache = kv_cache
node.ref_count += 1
def lookup(self, tokens):
node = self.root
last_match, matched_len = None, 0
for i, token in enumerate(tokens):
if token in node.children:
node = node.children[token]
if node.kv_cache is not None:
last_match, matched_len = node, i + 1
else:
break
return last_match, matched_len
SGLang vs vLLM¶
| Feature | vLLM | SGLang |
|---|---|---|
| Granularity | Block-level (16 tokens) | Token-level |
| Structure | Hash table | Radix tree |
| Sharing | Explicit prefix | Automatic |
| Best for | Simple prefixes, high throughput | Complex branching, agents |
Performance¶
| Scenario | No Cache | RadixAttention | Speedup |
|---|---|---|---|
| Multi-turn chat | 100ms | 35ms | 2.9x |
| System prompt reuse | 200ms | 50ms | 4x |
| RAG with same context | 500ms | 150ms | 3.3x |
6. KVFlow (NeurIPS 2025)¶
Workflow-aware KV cache для multi-agent систем.
[User Request]
|
[Agent A] <-- shared prefix (system prompt)
/ \
[Agent B] [Agent C] <-- different continuations
\ /
[Agent D] <-- aggregation
Innovations¶
| Innovation | Description |
|---|---|
| Workflow prediction | Anticipate agent execution paths |
| Smart preemption | Evict unlikely-to-be-used caches |
| Prefix sharing | Share common prefixes across agents |
| Priority scheduling | Prioritize high-probability paths |
Performance vs SGLang¶
| Scenario | SGLang | KVFlow | Speedup |
|---|---|---|---|
| Single workflow, large prompt | 1.0x | 1.83x | +83% |
| Multi-workflow | 1.0x | 2.19x | +119% |
| Agentic RAG | 1.0x | 1.65x | +65% |
7. Mooncake: Multi-Tier KV Storage¶
Tier 0: GPU Memory (hot prefixes) | Latency: <1ms
Tier 1: CPU Memory (warm prefixes) | Latency: 5-10ms
Tier 2: SSD/Disk (cold prefixes) | Latency: 50-100ms
LRU + frequency-based eviction across tiers
8. LMCache: Distributed KV Cache¶
Paper: Liu et al., arXiv:2510.09665 (Oct 2025)
vLLM Engine / SGLang Engine / Other Engines
|
LMCache Connector
/ | \
GPU Memory CPU Memory Disk/Network
| Workload | Speedup |
|---|---|
| Multi-turn QA | 15x |
| Document analysis | 12x |
| Agent conversations | 8x |
9. Continuous Batching¶
Iteration-level scheduling вместо static batching.
class ContinuousBatcher:
def __init__(self, model, max_batch_size):
self.running = []
self.waiting = []
def step(self):
self.running = [r for r in self.running if not r.finished]
while len(self.running) < self.max_batch_size and self.waiting:
self.running.append(self.waiting.pop(0))
if self.running:
tokens = [r.next_token for r in self.running]
outputs = self.model.forward_batch(tokens)
for r, out in zip(self.running, outputs):
r.update(out)
- No waiting for batch fill
- Variable batch size adapts to load
- Lower first-token latency
10. Security: Cache Timing Attacks¶
User A: [System Prompt] + [Query A]
User B: [System Prompt] + [Query B]
If User B's request is faster than expected,
attacker learns User A sent same system prompt.
Mitigations: constant-time cache lookup, per-user cache isolation, cache hit masking, differential privacy.
11. Best Practices¶
Prompt Design for Cache Hits¶
# BAD: User-specific data in system prompt
system_prompt = f"You are helpful. User: {username}" # Changes every request!
# GOOD: Stable prefix, variable suffix
system_prompt = "You are a helpful assistant." # Cacheable!
user_context = f"User: {username}" # Variable part after
Cache-Aware API Usage¶
# Anthropic: explicit cache control
response = client.messages.create(
model="claude-sonnet-4.5",
system=[{
"type": "text",
"text": LONG_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}
}],
messages=messages,
)
12. Production Benchmarks¶
Stripe Case Study (Dec 2025)¶
- 50M API calls, migrated from HF Transformers to vLLM
- Same workload on ⅓ GPU fleet
- Cost reduction: 73%
Throughput (tokens/sec)¶
| Model | Traditional | vLLM | SGLang |
|---|---|---|---|
| Llama-7B | 500 | 2,000 | 2,500 |
| Llama-70B | 50 | 300 | 400 |
| Mixtral-8x7B | 100 | 800 | 1,000 |
Memory Efficiency¶
| System | KV Cache Waste | Effective Batch Size |
|---|---|---|
| HF Transformers | 60-80% | 8 |
| vLLM (PagedAttention) | <4% | 32 |
| SGLang (RadixAttention) | <1% | 64 |
13. Decision Framework¶
| Scenario | Recommended |
|---|---|
| Simple chatbot | vLLM + PagedAttention |
| Multi-turn conversations | vLLM + Prefix Caching |
| Agent workflows | SGLang + RadixAttention |
| Multi-agent systems | KVFlow |
| Very long contexts | LMCache + CPU offload |
| Large-scale multi-tenant | Mooncake + isolation |
| Maximum throughput | SGLang + Continuous batching |
| Simple deployment | vLLM (OpenAI-compatible API) |
Framework Comparison¶
| Feature | vLLM | SGLang | KVFlow | Mooncake |
|---|---|---|---|---|
| Granularity | Block | Token | Workflow | Multi-tier |
| Structure | Hash | Radix tree | Graph | Tiered |
| Agent-aware | No | No | Yes | Partial |
| Distributed | Yes | Yes | Yes | Yes |
Для интервью¶
Q: "Что такое KV cache и зачем он нужен?"¶
KV cache хранит key-value векторы предыдущих токенов, чтобы при генерации нового токена не пересчитывать attention над всей последовательностью. Без cache: O(t^2), с cache: O(t). Размер: 2 x layers x heads x dim x seq_len x bytes. Для Llama-70B при 4K tokens -- ~10 GB на один запрос. Проблема: при static allocation 60-80% памяти тратится впустую из-за фрагментации.
Q: "Объясните PagedAttention и как он решает проблему памяти."¶
PagedAttention (vLLM, SOSP 2023) -- аналогия с виртуальной памятью ОС. GPU память делится на fixed-size blocks (16 tokens). Каждая sequence имеет block table (логический -> физический блок). Блоки аллоцируются on-demand, могут быть non-contiguous. Результат: waste <4% (vs 60-80%), throughput 4-6x, batch size 2-4x больше. Stripe: 73% cost reduction на 50M API calls. Prefix caching через Copy-on-Write: общие prefix blocks shared между запросами, копируются только при записи.
Q: "Сравните vLLM и SGLang для prefix caching."¶
vLLM: block-level hashing (16 tokens/block), hash table, explicit prefix matching. SGLang: RadixAttention -- token-level granularity, radix tree, automatic prefix detection. SGLang лучше для complex branching и agent workflows (2.9-4x speedup). vLLM лучше для simple prefixes и high-throughput (lower memory overhead). Для multi-agent: KVFlow (NeurIPS 2025) -- workflow-aware caching, 2.19x speedup vs SGLang. Для large-scale: Mooncake multi-tier (GPU/CPU/SSD).
Ключевые числа¶
| Факт | Значение |
|---|---|
| Traditional allocation waste | 60-80% |
| PagedAttention waste | <4% |
| RadixAttention waste | <1% |
| PagedAttention throughput gain | 4-6x |
| Prefix caching speedup (system prompt) | 4x |
| KVFlow vs SGLang (multi-workflow) | 2.19x |
| LMCache multi-turn QA speedup | 15x |
| Stripe cost reduction (vLLM) | 73% |
| KV cache Llama-70B 4K tokens | ~10 GB |
| Block size (typical) | 16 tokens |
| Cache hit rate (system prompt reuse) | 80-95% |
| Cache hit rate (agent workflows) | 40-75% |
Формулы¶
KV Cache Memory¶
Memory Waste (Traditional)¶
Block Allocation¶
Prefix Caching Speedup¶
Cache Hit Rate¶
Самопроверка
- Рассчитайте KV cache для Mistral-7B (32 layers, 8 KV heads (GQA), dim=128, FP16) при sequence length 8192. Сравните с LLaMA-7B (32 layers, 32 heads, dim=128).
- Система обслуживает 50 пользователей с общим system prompt (1000 tokens) и уникальными запросами (500 tokens). Посчитайте экономию памяти от prefix caching.
- Объясните, почему RadixAttention (token-level) лучше для agent workflows, чем block-level caching vLLM (подсказка: подумайте о branching).
Источники¶
- Kwon et al. -- "Efficient Memory Management for LLM Serving with PagedAttention" (SOSP 2023)
- Liu et al. -- "LMCache: Distributed KV Cache" (arXiv:2510.09665, Oct 2025)
- "KVFlow: Efficient Prefix Caching for LLM-Based Multi-Agent Workflows" (NeurIPS 2025)
- vLLM Documentation -- docs.vllm.ai
- SGLang Documentation -- github.com/sgl-project/sglang
- Mooncake -- "Serving Platform for Kimi LLM"
- Red Hat -- "How PagedAttention Resolves Memory Waste" (July 2025)
- Introl -- "vLLM Production Deployment" (Feb 2026)
- Subhash Dasyam -- "Long-Context Inference Security: KV-Cache Privacy"
- vLLM Blog -- "Inside vLLM: Anatomy of a High-Throughput LLM Inference System" (2025)
See Also¶
- vLLM PagedAttention -- PagedAttention architecture подробнее
- Inference Engines -- vLLM vs SGLang vs TensorRT-LLM
- Speculative Decoding -- ещё одна техника ускорения инференса
- MQA/GQA Attention -- Multi-Query/Grouped-Query Attention уменьшают KV cache в 8x
- Long Context -- как обрабатывать длинные последовательности