Attention Sinks & Streaming LLM¶
~4 минуты чтения
Предварительно: Эффективные трансформеры | vLLM & Paged Attention
Emerging 2026: Training-free infinite context generation через attention sink preservation
The Problem: Streaming Inference Fails¶
Challenge: LLMs для infinite input streams (chatbots, long-form generation) сталкиваются с: 1. KV Cache explosion -- Memory grows linearly with sequence length 2. Performance collapse -- PPL dramatically increases when sequence exceeds training window
Naive eviction fails:
# Naive: Evict oldest tokens when cache full
# Result: Perplexity EXPLODES (45.2 -> unusable)
def naive_eviction(cache, new_tokens, max_size):
cache.extend(new_tokens)
if len(cache) > max_size:
cache = cache[-max_size:] # Keep only recent
return cache
# WARNING: This breaks the model completely!
Why it breaks: Attention scores must sum to 1 (softmax). When query doesn't strongly relate to any previous token, model allocates attention to early tokens as "fallback" -- removing them destroys the attention distribution.
The Attention Sink Phenomenon¶
Discovery (Xiao et al., 2024): First tokens receive disproportionately high attention scores, regardless of content.
Why it happens: 1. Softmax constraint -- Attention must sum to 1 2. First token visibility -- Visible to ALL subsequent tokens 3. Fallback behavior -- When no strong match exists, model "dumps" attention on first token
Visual:
Token positions: [0] [1] [2] [3] [4] ... [N]
| | | | |
Attention weights: 0.35 0.02 0.03 0.05 0.04 ... 0.02
^^^^
SINK - absorbs excess attention
StreamingLLM Solution (MIT HAN Lab)¶
Core idea: Preserve attention sinks + recent tokens, discard middle.
Cache structure:
+------------+---------------------------+---------------------+
| Sink Tokens| Evicted (not in cache) | Recent Tokens |
| (always | | (sliding window) |
| kept) | | |
+------------+---------------------------+---------------------+
| [0,1,2,3] | [4, 5, 6, ... N-50] | [N-49, ... N-1, N] |
+------------+---------------------------+---------------------+
Implementation:
class StreamingLLMCache:
"""Streaming LLM with attention sink preservation"""
def __init__(self, model, n_sinks=4, window_size=2040):
self.model = model
self.n_sinks = n_sinks # Keep first N tokens
self.window_size = window_size
self.max_cache = n_sinks + window_size
self.kv_cache = []
def generate(self, prompt: str, max_new_tokens: int):
"""Generate with streaming cache management"""
tokens = self.model.tokenize(prompt)
self.kv_cache = []
for i, token in enumerate(tokens):
self._add_token(token)
generated = []
for _ in range(max_new_tokens):
logits = self.model.forward(self.kv_cache)
next_token = self.model.sample(logits)
generated.append(next_token)
self._add_token(next_token)
return self.model.detokenize(generated)
def _add_token(self, token):
"""Add token, evicting middle tokens if needed"""
self.kv_cache.append(token)
if len(self.kv_cache) > self.max_cache:
sink_tokens = self.kv_cache[:self.n_sinks]
recent_tokens = self.kv_cache[-(self.window_size):]
self.kv_cache = sink_tokens + recent_tokens
Perplexity Impact¶
| Sink Tokens | Perplexity | Status |
|---|---|---|
| 0 (naive eviction) | 45.2 | |
| 1 | 18.7 | |
| 2 | 12.3 | |
| 4 | 9.8 | |
| 8 | 9.7 |
Rule of thumb: 4 sink tokens sufficient for most models
Memory Savings¶
Example: Generating 10,000 tokens - Full KV cache: 703 MB - StreamingLLM (4 sinks + 2044 window): 144 MB - Savings: 79.5%
What StreamingLLM Does NOT Do¶
Important clarification:
- Does NOT extend effective context window
-
Does NOT let model "remember" beyond cache size
-
DOES enable infinite generation without crash
-
DOES maintain stable perplexity
Example scenario:
Input: 1000 tokens
Attention window: 500
Answer location: tokens 100-200
Result: Model CANNOT see the answer
-> StreamingLLM doesn't solve this (use RAG instead)
Related Techniques¶
| Technique | Problem Solved | How It Works |
|---|---|---|
| StreamingLLM | Infinite generation crash | Preserve sink tokens |
| RoPE Scaling | Extend context window | Interpolate position embeddings |
| ALiBi | Position extrapolation | Relative position bias |
| LongLoRA | Efficient long-context fine-tuning | Sparse local attention |
| Ring Attention | Distributed long context | Sequence parallelism |
Interview Q&A¶
Q: Что такое attention sink и почему он важен?
A: Attention sink -- phenomenon где первые tokens в sequence получают disproportionately high attention scores, независимо от их content. Причина: softmax constraint (attention scores sum to 1) + visibility of early tokens. Когда query не имеет strong match с предыдущими tokens, model "dumps" excess attention на первый token. Если удалить sink tokens -- perplexity explodes (45.2 vs 9.8 normal).
Q: Как работает StreamingLLM?
A: StreamingLLM (MIT HAN Lab) -- training-free method для infinite generation. Cache structure: [sink tokens (4)] + [recent window (2040)]. При eviction удаляются middle tokens, sink и recent сохраняются. Memory savings: 79.5% (703MB -> 144MB для 10K tokens). Key insight: модель работает стабильно пока attention distribution близка к normal.
Q: Что StreamingLLM НЕ делает?
A: (1) НЕ расширяет effective context window -- model видит только то что в cache. (2) НЕ решает "needle in haystack" -- если answer вне cache, model его не найдёт. (3) НЕ заменяет RAG для retrieval. StreamingLLM решает только одну проблему: infinite generation без crash. Для длинного контекста нужны другие techniques (RoPE scaling, Ring Attention, etc.).
Q: Сколько sink tokens нужно?
A: Empirically: 4 tokens достаточно для большинства models. Perplexity comparison: 0 sinks = 45.2 (broken), 1 sink = 18.7, 4 sinks = 9.8 (stable), 8 sinks = 9.7 (marginal improvement). Rule of thumb: start with 4, increase если видите degradation на длинных sequences. Position 0 (BOS token) -- почти всегда sink.
See Also¶
- Эффективные трансформеры -- FlashAttention, MQA/GQA
- vLLM & Paged Attention -- KV cache management
- Длинный контекст -- RoPE scaling for context extension