Attention Sinks & Streaming LLM¶

~4 минуты чтения

Предварительно: Эффективные трансформеры | vLLM & Paged Attention

Emerging 2026: Training-free infinite context generation через attention sink preservation

The Problem: Streaming Inference Fails¶

Challenge: LLMs для infinite input streams (chatbots, long-form generation) сталкиваются с: 1. KV Cache explosion -- Memory grows linearly with sequence length 2. Performance collapse -- PPL dramatically increases when sequence exceeds training window

Naive eviction fails:

# Naive: Evict oldest tokens when cache full
# Result: Perplexity EXPLODES (45.2 -> unusable)

def naive_eviction(cache, new_tokens, max_size):
    cache.extend(new_tokens)
    if len(cache) > max_size:
        cache = cache[-max_size:]  # Keep only recent
    return cache
# WARNING: This breaks the model completely!

Why it breaks: Attention scores must sum to 1 (softmax). When query doesn't strongly relate to any previous token, model allocates attention to early tokens as "fallback" -- removing them destroys the attention distribution.

The Attention Sink Phenomenon¶

Discovery (Xiao et al., 2024): First tokens receive disproportionately high attention scores, regardless of content.

Why it happens: 1. Softmax constraint -- Attention must sum to 1 2. First token visibility -- Visible to ALL subsequent tokens 3. Fallback behavior -- When no strong match exists, model "dumps" attention on first token

Visual:

Token positions:    [0]    [1]    [2]    [3]    [4]    ... [N]
                    |      |      |      |      |
Attention weights:  0.35   0.02   0.03   0.05   0.04   ... 0.02
                    ^^^^
                    SINK - absorbs excess attention

StreamingLLM Solution (MIT HAN Lab)¶

Core idea: Preserve attention sinks + recent tokens, discard middle.

Cache structure:

+------------+---------------------------+---------------------+
| Sink Tokens|  Evicted (not in cache)   | Recent Tokens       |
| (always    |                           | (sliding window)    |
|  kept)     |                           |                     |
+------------+---------------------------+---------------------+
| [0,1,2,3]  |  [4, 5, 6, ... N-50]      | [N-49, ... N-1, N]  |
+------------+---------------------------+---------------------+

Implementation:

class StreamingLLMCache:
    """Streaming LLM with attention sink preservation"""

    def __init__(self, model, n_sinks=4, window_size=2040):
        self.model = model
        self.n_sinks = n_sinks      # Keep first N tokens
        self.window_size = window_size
        self.max_cache = n_sinks + window_size
        self.kv_cache = []

    def generate(self, prompt: str, max_new_tokens: int):
        """Generate with streaming cache management"""
        tokens = self.model.tokenize(prompt)
        self.kv_cache = []

        for i, token in enumerate(tokens):
            self._add_token(token)

        generated = []
        for _ in range(max_new_tokens):
            logits = self.model.forward(self.kv_cache)
            next_token = self.model.sample(logits)
            generated.append(next_token)
            self._add_token(next_token)

        return self.model.detokenize(generated)

    def _add_token(self, token):
        """Add token, evicting middle tokens if needed"""
        self.kv_cache.append(token)

        if len(self.kv_cache) > self.max_cache:
            sink_tokens = self.kv_cache[:self.n_sinks]
            recent_tokens = self.kv_cache[-(self.window_size):]
            self.kv_cache = sink_tokens + recent_tokens

Perplexity Impact¶

Sink Tokens	Perplexity	Status
0 (naive eviction)	45.2	Broken
1	18.7	Degraded
2	12.3	Acceptable
4	9.8	Stable
8	9.7	Optimal

Rule of thumb: 4 sink tokens sufficient for most models

Memory Savings¶

Example: Generating 10,000 tokens - Full KV cache: 703 MB - StreamingLLM (4 sinks + 2044 window): 144 MB - Savings: 79.5%

What StreamingLLM Does NOT Do¶

Important clarification: - Does NOT extend effective context window - Does NOT let model "remember" beyond cache size - DOES enable infinite generation without crash - DOES maintain stable perplexity

Example scenario:

Input: 1000 tokens
Attention window: 500
Answer location: tokens 100-200

Result: Model CANNOT see the answer
-> StreamingLLM doesn't solve this (use RAG instead)

Technique	Problem Solved	How It Works
StreamingLLM	Infinite generation crash	Preserve sink tokens
RoPE Scaling	Extend context window	Interpolate position embeddings
ALiBi	Position extrapolation	Relative position bias
LongLoRA	Efficient long-context fine-tuning	Sparse local attention
Ring Attention	Distributed long context	Sequence parallelism

Interview Q&A¶

Q: Что такое attention sink и почему он важен?

A: Attention sink -- phenomenon где первые tokens в sequence получают disproportionately high attention scores, независимо от их content. Причина: softmax constraint (attention scores sum to 1) + visibility of early tokens. Когда query не имеет strong match с предыдущими tokens, model "dumps" excess attention на первый token. Если удалить sink tokens -- perplexity explodes (45.2 vs 9.8 normal).

Q: Как работает StreamingLLM?

A: StreamingLLM (MIT HAN Lab) -- training-free method для infinite generation. Cache structure: [sink tokens (4)] + [recent window (2040)]. При eviction удаляются middle tokens, sink и recent сохраняются. Memory savings: 79.5% (703MB -> 144MB для 10K tokens). Key insight: модель работает стабильно пока attention distribution близка к normal.

Q: Что StreamingLLM НЕ делает?

A: (1) НЕ расширяет effective context window -- model видит только то что в cache. (2) НЕ решает "needle in haystack" -- если answer вне cache, model его не найдёт. (3) НЕ заменяет RAG для retrieval. StreamingLLM решает только одну проблему: infinite generation без crash. Для длинного контекста нужны другие techniques (RoPE scaling, Ring Attention, etc.).

Q: Сколько sink tokens нужно?

A: Empirically: 4 tokens достаточно для большинства models. Perplexity comparison: 0 sinks = 45.2 (broken), 1 sink = 18.7, 4 sinks = 9.8 (stable), 8 sinks = 9.7 (marginal improvement). Rule of thumb: start with 4, increase если видите degradation на длинных sequences. Position 0 (BOS token) -- почти всегда sink.