Оптимизация KV Cache¶

~7 минут чтения

KV cache memory formula, PagedAttention (vLLM), prefix caching, RadixAttention (SGLang), KVFlow (NeurIPS 2025), Mooncake multi-tier, LMCache distributed, continuous batching, copy-on-write, block manager, cache timing attacks, production benchmarks (2025-2026)

Ключевые концепции¶

KV Cache -- хранение key-value векторов предыдущих токенов для избежания повторных вычислений attention.

Memory Formula¶

KV Cache Memory -- ключевая формула для system design

\[\text{KV Cache} = 2 \times L \times H \times d_{head} \times T \times \text{bytes}\]

Где \(L\) = layers, \(H\) = attention heads, \(d_{head}\) = head dim, \(T\) = sequence length. Пример: LLaMA-70B, 100 users x 4K context = ~1 TB KV cache. Без PagedAttention это не влезет ни в какой GPU cluster.

Model	Seq Len	KV Cache per Request
LLaMA-7B	4K	2 GB
LLaMA-7B	32K	16 GB
LLaMA-70B	4K	16 GB
LLaMA-70B	4K, 100 users	~1 TB

Optimization Landscape¶

Technique	Memory Gain	Throughput Gain	Complexity
PagedAttention	5x	4-6x	Medium
Prefix Caching	2-4x	2-3x	Low
RadixAttention	~5x	3-4x	Medium
KVFlow	~5x	2.19x vs SGLang	High
KV Quantization	2-8x	1.5x	Low
Multi-Query Attention	8x	2x	High

1. Зачем KV Cache¶

Без KV cache каждый новый токен требует recomputation attention над ВСЕЙ последовательностью:

\[\text{Attention}(Q_t, K_{1:t}, V_{1:t}) = \text{softmax}\left(\frac{Q_t K_{1:t}^T}{\sqrt{d_k}}\right) V_{1:t}\]

Cost без cache: \(O(t^2)\). С cache: \(O(t)\).

# Pre-fill phase: compute K, V for all input tokens
K_cache = model.compute_K(input_tokens)  # (seq_len, d_k)
V_cache = model.compute_V(input_tokens)  # (seq_len, d_v)

# Decode phase: only compute for new token
for new_token in generate:
    K_new = model.compute_K(new_token)  # (1, d_k)
    V_new = model.compute_V(new_token)  # (1, d_v)
    K_cache = concat(K_cache, K_new)
    V_cache = concat(V_cache, V_new)
    output = attention(compute_Q(new_token), K_cache, V_cache)

KV cache масштабируется линейно с batch size

KV cache считается для ОДНОГО запроса, но при serving N одновременных запросов память умножается на N. LLaMA-70B при 4K контексте: 1 запрос = ~10 GB, 100 запросов = ~1 TB. Именно поэтому PagedAttention и MQA/GQA критичны для production: без них даже 8xA100 (640 GB) обслуживают единицы запросов одновременно.

2. Проблема фрагментации памяти¶

Traditional Allocation¶

Request 1: [====____________________] (4K used, 12K alloc)
Request 2: [==______________________] (2K used, 12K alloc)
Request 3: [======__________________] (6K used, 12K alloc)

Problem: Can't compact or share unused memory
Result: 60-80% memory waste

Problem	Description
Internal fragmentation	Request needs 500 tokens, system reserves 4096 -- 87% waste
External fragmentation	Enough total free memory, but no contiguous block
Over-reservation	Each request reserves max length even for short answers

3. PagedAttention (vLLM)¶

Paper: Kwon et al., SOSP 2023

OS Virtual Memory Analogy¶

OS Concept	PagedAttention Equivalent
Virtual memory	Logical KV cache positions
Physical pages	GPU memory blocks
Page table	Block table
Page fault	On-demand allocation
Memory sharing	Prefix sharing (Copy-on-Write)

Block Structure¶

Block = block_size x head_dim x num_heads x 2 (K+V) x bytes

LLaMA-7B: 16 tokens x 128 dim x 32 heads x 2 x 2 bytes = 256 KB/block

Block Table (per sequence):
  Logical Block 0 -> Physical Block 42
  Logical Block 1 -> Physical Block 17  (non-contiguous!)
  Logical Block 2 -> Physical Block 89

PagedAttention Algorithm¶

def paged_attention(
    query: torch.Tensor,       # [num_tokens, num_heads, head_dim]
    key_cache: torch.Tensor,   # [num_blocks, num_heads, block_size, head_dim]
    value_cache: torch.Tensor,
    block_tables: torch.Tensor,  # [num_seqs, max_blocks_per_seq]
    context_lens: torch.Tensor,  # [num_seqs]
    block_size: int = 16,
) -> torch.Tensor:
    scale = 1.0 / (head_dim ** 0.5)
    output = torch.zeros_like(query)
    token_idx = 0

    for seq_idx, seq_len in enumerate(context_lens):
        block_table = block_tables[seq_idx]
        num_blocks = (seq_len + block_size - 1) // block_size

        # Gather keys/values from non-contiguous blocks
        keys = [key_cache[block_table[i]] for i in range(num_blocks)]
        values = [value_cache[block_table[i]] for i in range(num_blocks)]

        seq_keys = torch.cat(keys, dim=1)[:, :seq_len, :]
        seq_values = torch.cat(values, dim=1)[:, :seq_len, :]
        seq_query = query[token_idx:token_idx + seq_len]

        # Standard attention on gathered KV
        attn = F.softmax(
            torch.einsum('thd,hsd->ths', seq_query, seq_keys) * scale, dim=-1
        )
        output[token_idx:token_idx + seq_len] = torch.einsum(
            'ths,hsd->thd', attn, seq_values
        )
        token_idx += seq_len
    return output

Block Manager¶

class BlockManager:
    def __init__(self, num_blocks, block_size, num_heads, head_dim):
        self.block_size = block_size
        self.key_cache = torch.zeros(num_blocks, num_heads, block_size, head_dim)
        self.value_cache = torch.zeros(num_blocks, num_heads, block_size, head_dim)
        self.free_blocks = list(range(num_blocks))
        self.block_tables = {}

    def allocate(self, seq_id):
        block_id = self.free_blocks.pop()
        self.block_tables.setdefault(seq_id, []).append(block_id)
        return block_id

    def free(self, seq_id):
        for block_id in self.block_tables.pop(seq_id, []):
            self.free_blocks.append(block_id)

    def write(self, seq_id, token_idx, keys, values):
        block_idx = token_idx // self.block_size
        offset = token_idx % self.block_size
        while len(self.block_tables[seq_id]) <= block_idx:
            self.allocate(seq_id)
        physical = self.block_tables[seq_id][block_idx]
        self.key_cache[physical, :, offset] = keys
        self.value_cache[physical, :, offset] = values

Performance¶

System	Memory Utilization	Throughput
HuggingFace Transformers	20-40%	1x
Orca	40-60%	1.5x
vLLM (PagedAttention)	90%+	4-6x

Model	Batch	HF Transformers	vLLM	Speedup
LLaMA-7B	32	1.2 req/s	5.8 req/s	4.8x
LLaMA-7B	128	OOM	14.2 req/s	--
LLaMA-13B	32	0.6 req/s	3.1 req/s	5.2x
LLaMA-70B	8	OOM	0.8 req/s	--

4. Prefix Caching¶

Концепция¶

Если несколько запросов имеют общий префикс (system prompt, few-shot, conversation history), можно reuse KV cache.

\[\text{Speedup} = \frac{T_{\text{prefix}} + T_{\text{query}}}{T_{\text{query}}}\]

Without: 100 users x 1000 tokens system prompt = 100,000 prefill tokens
With:    1 prefill + 100 users x 0 redundant = savings 99%

vLLM: Block Hashing¶

class PrefixCache:
    def __init__(self):
        self.cache = {}  # hash -> block_ids

    def compute_hash(self, tokens):
        return hashlib.md5(str(tokens).encode()).hexdigest()

    def get_cached_blocks(self, tokens):
        for prefix_len in range(len(tokens), 0, -1):
            h = self.compute_hash(tokens[:prefix_len])
            if h in self.cache:
                return self.cache[h], prefix_len
        return [], 0

Copy-on-Write (CoW)¶

Block Table 1: [0,1,2,3] -> [unique blocks]
Block Table 2: [0,1,2,3] -> [SHARED] -> [unique]
Block Table 3: [0,1,2,3] -> [SHARED] -> [unique]

Blocks 0-3 are physically shared.
When request writes to shared block:
  1. Copy the block
  2. Update copy (not original)
Memory savings: 2-4x for shared prefixes

5. RadixAttention (SGLang)¶

Radix tree для автоматического prefix sharing с token-level granularity.

                [root]
               /      \
          "You are"   "The task"
          /    \           \
    "a helpful" "concise"  "is to"
       /                      \
  "assistant"              "analyze"

Features¶

Token-level granularity (vs block-level в vLLM)
Automatic prefix detection (no manual specification)
LRU eviction policy
O(k) lookup, k = prefix length

Code¶

class RadixTreeNode:
    def __init__(self):
        self.children = {}
        self.kv_cache = None
        self.ref_count = 0

class RadixAttention:
    def __init__(self):
        self.root = RadixTreeNode()

    def insert(self, tokens, kv_cache):
        node = self.root
        for token in tokens:
            if token not in node.children:
                node.children[token] = RadixTreeNode()
            node = node.children[token]
        node.kv_cache = kv_cache
        node.ref_count += 1

    def lookup(self, tokens):
        node = self.root
        last_match, matched_len = None, 0
        for i, token in enumerate(tokens):
            if token in node.children:
                node = node.children[token]
                if node.kv_cache is not None:
                    last_match, matched_len = node, i + 1
            else:
                break
        return last_match, matched_len

SGLang vs vLLM¶

Feature	vLLM	SGLang
Granularity	Block-level (16 tokens)	Token-level
Structure	Hash table	Radix tree
Sharing	Explicit prefix	Automatic
Best for	Simple prefixes, high throughput	Complex branching, agents

Performance¶

Scenario	No Cache	RadixAttention	Speedup
Multi-turn chat	100ms	35ms	2.9x
System prompt reuse	200ms	50ms	4x
RAG with same context	500ms	150ms	3.3x

6. KVFlow (NeurIPS 2025)¶

Workflow-aware KV cache для multi-agent систем.

       [User Request]
             |
       [Agent A] <-- shared prefix (system prompt)
            / \
    [Agent B] [Agent C] <-- different continuations
         \     /
       [Agent D] <-- aggregation

Innovations¶

Innovation	Description
Workflow prediction	Anticipate agent execution paths
Smart preemption	Evict unlikely-to-be-used caches
Prefix sharing	Share common prefixes across agents
Priority scheduling	Prioritize high-probability paths

Performance vs SGLang¶

Scenario	SGLang	KVFlow	Speedup
Single workflow, large prompt	1.0x	1.83x	+83%
Multi-workflow	1.0x	2.19x	+119%
Agentic RAG	1.0x	1.65x	+65%

7. Mooncake: Multi-Tier KV Storage¶

Tier 0: GPU Memory (hot prefixes)     | Latency: <1ms
Tier 1: CPU Memory (warm prefixes)    | Latency: 5-10ms
Tier 2: SSD/Disk (cold prefixes)      | Latency: 50-100ms

LRU + frequency-based eviction across tiers

8. LMCache: Distributed KV Cache¶

Paper: Liu et al., arXiv:2510.09665 (Oct 2025)

vLLM Engine / SGLang Engine / Other Engines
                    |
             LMCache Connector
              /     |      \
        GPU Memory  CPU Memory  Disk/Network

Workload	Speedup
Multi-turn QA	15x
Document analysis	12x
Agent conversations	8x

9. Continuous Batching¶

Iteration-level scheduling вместо static batching.

class ContinuousBatcher:
    def __init__(self, model, max_batch_size):
        self.running = []
        self.waiting = []

    def step(self):
        self.running = [r for r in self.running if not r.finished]
        while len(self.running) < self.max_batch_size and self.waiting:
            self.running.append(self.waiting.pop(0))
        if self.running:
            tokens = [r.next_token for r in self.running]
            outputs = self.model.forward_batch(tokens)
            for r, out in zip(self.running, outputs):
                r.update(out)

No waiting for batch fill
Variable batch size adapts to load
Lower first-token latency

10. Security: Cache Timing Attacks¶

User A: [System Prompt] + [Query A]
User B: [System Prompt] + [Query B]

If User B's request is faster than expected,
attacker learns User A sent same system prompt.

Mitigations: constant-time cache lookup, per-user cache isolation, cache hit masking, differential privacy.

11. Best Practices¶

Prompt Design for Cache Hits¶

# BAD: User-specific data in system prompt
system_prompt = f"You are helpful. User: {username}"  # Changes every request!

# GOOD: Stable prefix, variable suffix
system_prompt = "You are a helpful assistant."  # Cacheable!
user_context = f"User: {username}"  # Variable part after

Cache-Aware API Usage¶

# Anthropic: explicit cache control
response = client.messages.create(
    model="claude-sonnet-4.5",
    system=[{
        "type": "text",
        "text": LONG_SYSTEM_PROMPT,
        "cache_control": {"type": "ephemeral"}
    }],
    messages=messages,
)

12. Production Benchmarks¶

Stripe Case Study (Dec 2025)¶

50M API calls, migrated from HF Transformers to vLLM
Same workload on ⅓ GPU fleet
Cost reduction: 73%

Throughput (tokens/sec)¶

Model	Traditional	vLLM	SGLang
Llama-7B	500	2,000	2,500
Llama-70B	50	300	400
Mixtral-8x7B	100	800	1,000

Memory Efficiency¶

System	KV Cache Waste	Effective Batch Size
HF Transformers	60-80%	8
vLLM (PagedAttention)	<4%	32
SGLang (RadixAttention)	<1%	64

13. Decision Framework¶

Scenario	Recommended
Simple chatbot	vLLM + PagedAttention
Multi-turn conversations	vLLM + Prefix Caching
Agent workflows	SGLang + RadixAttention
Multi-agent systems	KVFlow
Very long contexts	LMCache + CPU offload
Large-scale multi-tenant	Mooncake + isolation
Maximum throughput	SGLang + Continuous batching
Simple deployment	vLLM (OpenAI-compatible API)

Framework Comparison¶

Feature	vLLM	SGLang	KVFlow	Mooncake
Granularity	Block	Token	Workflow	Multi-tier
Structure	Hash	Radix tree	Graph	Tiered
Agent-aware	No	No	Yes	Partial
Distributed	Yes	Yes	Yes	Yes

Для интервью¶

Q: "Что такое KV cache и зачем он нужен?"¶

KV cache хранит key-value векторы предыдущих токенов, чтобы при генерации нового токена не пересчитывать attention над всей последовательностью. Без cache: O(t^2), с cache: O(t). Размер: 2 x layers x heads x dim x seq_len x bytes. Для Llama-70B при 4K tokens -- ~10 GB на один запрос. Проблема: при static allocation 60-80% памяти тратится впустую из-за фрагментации.

Q: "Объясните PagedAttention и как он решает проблему памяти."¶

PagedAttention (vLLM, SOSP 2023) -- аналогия с виртуальной памятью ОС. GPU память делится на fixed-size blocks (16 tokens). Каждая sequence имеет block table (логический -> физический блок). Блоки аллоцируются on-demand, могут быть non-contiguous. Результат: waste <4% (vs 60-80%), throughput 4-6x, batch size 2-4x больше. Stripe: 73% cost reduction на 50M API calls. Prefix caching через Copy-on-Write: общие prefix blocks shared между запросами, копируются только при записи.

Q: "Сравните vLLM и SGLang для prefix caching."¶

vLLM: block-level hashing (16 tokens/block), hash table, explicit prefix matching. SGLang: RadixAttention -- token-level granularity, radix tree, automatic prefix detection. SGLang лучше для complex branching и agent workflows (2.9-4x speedup). vLLM лучше для simple prefixes и high-throughput (lower memory overhead). Для multi-agent: KVFlow (NeurIPS 2025) -- workflow-aware caching, 2.19x speedup vs SGLang. Для large-scale: Mooncake multi-tier (GPU/CPU/SSD).

Ключевые числа¶

Факт	Значение
Traditional allocation waste	60-80%
PagedAttention waste	<4%
RadixAttention waste	<1%
PagedAttention throughput gain	4-6x
Prefix caching speedup (system prompt)	4x
KVFlow vs SGLang (multi-workflow)	2.19x
LMCache multi-turn QA speedup	15x
Stripe cost reduction (vLLM)	73%
KV cache Llama-70B 4K tokens	~10 GB
Block size (typical)	16 tokens
Cache hit rate (system prompt reuse)	80-95%
Cache hit rate (agent workflows)	40-75%

Формулы¶

KV Cache Memory¶

\[\text{Memory} = 2 \times L \times H \times d_{head} \times T \times \text{bytes}\]

Memory Waste (Traditional)¶

\[\text{Waste} = \frac{T_{max} - T_{actual}}{T_{max}}\]

Block Allocation¶

\[\text{Blocks} = \lceil T / B \rceil\]

Prefix Caching Speedup¶

\[\text{Speedup} = \frac{T_{\text{prefix}} + T_{\text{query}}}{T_{\text{query}}}\]

Cache Hit Rate¶

\[\text{Hit Rate} = \frac{\text{Cached Prefix Tokens}}{\text{Total Prefix Tokens}}\]

Самопроверка

Рассчитайте KV cache для Mistral-7B (32 layers, 8 KV heads (GQA), dim=128, FP16) при sequence length 8192. Сравните с LLaMA-7B (32 layers, 32 heads, dim=128).
Система обслуживает 50 пользователей с общим system prompt (1000 tokens) и уникальными запросами (500 tokens). Посчитайте экономию памяти от prefix caching.
Объясните, почему RadixAttention (token-level) лучше для agent workflows, чем block-level caching vLLM (подсказка: подумайте о branching).

Источники¶

Kwon et al. -- "Efficient Memory Management for LLM Serving with PagedAttention" (SOSP 2023)
Liu et al. -- "LMCache: Distributed KV Cache" (arXiv:2510.09665, Oct 2025)
"KVFlow: Efficient Prefix Caching for LLM-Based Multi-Agent Workflows" (NeurIPS 2025)
vLLM Documentation -- docs.vllm.ai
SGLang Documentation -- github.com/sgl-project/sglang
Mooncake -- "Serving Platform for Kimi LLM"
Red Hat -- "How PagedAttention Resolves Memory Waste" (July 2025)
Introl -- "vLLM Production Deployment" (Feb 2026)
Subhash Dasyam -- "Long-Context Inference Security: KV-Cache Privacy"
vLLM Blog -- "Inside vLLM: Anatomy of a High-Throughput LLM Inference System" (2025)

Оптимизация KV Cache¶

Ключевые концепции¶

Memory Formula¶

Optimization Landscape¶

1. Зачем KV Cache¶

2. Проблема фрагментации памяти¶

Traditional Allocation¶

3. PagedAttention (vLLM)¶

OS Virtual Memory Analogy¶

Block Structure¶

PagedAttention Algorithm¶

Block Manager¶

Performance¶

4. Prefix Caching¶

Концепция¶

vLLM: Block Hashing¶

Copy-on-Write (CoW)¶

5. RadixAttention (SGLang)¶

Features¶

Code¶

SGLang vs vLLM¶

Performance¶

6. KVFlow (NeurIPS 2025)¶

Innovations¶

Performance vs SGLang¶

7. Mooncake: Multi-Tier KV Storage¶

8. LMCache: Distributed KV Cache¶

9. Continuous Batching¶

10. Security: Cache Timing Attacks¶

11. Best Practices¶

Prompt Design for Cache Hits¶

Cache-Aware API Usage¶

12. Production Benchmarks¶

Stripe Case Study (Dec 2025)¶

Throughput (tokens/sec)¶

Memory Efficiency¶

13. Decision Framework¶

Framework Comparison¶

Для интервью¶

Q: "Что такое KV cache и зачем он нужен?"¶

Q: "Объясните PagedAttention и как он решает проблему памяти."¶

Q: "Сравните vLLM и SGLang для prefix caching."¶

Ключевые числа¶

Формулы¶

KV Cache Memory¶

Memory Waste (Traditional)¶

Block Allocation¶

Prefix Caching Speedup¶

Cache Hit Rate¶

Источники¶

See Also¶