Длинный контекст: масштабирование до миллионов токенов¶

~8 минут чтения

Предварительно: KV Cache оптимизация | RoPE и расширение контекста

Контекстные окна LLM выросли с 2K (GPT-3, 2020) до 10M+ (Grok, 2025) -- в 5000 раз за 5 лет. Но длинный контекст -- не бесплатный: KV cache для Llama-70B при 1M токенов занимает ~328 GB (4 GPU A100), а качество recall падает до 26% на дальних позициях (Gemini 3 Pro). Ring Attention, Infini-Attention, Star Attention решают проблему с разных сторон: distributed exact attention, bounded O(1) memory, block-sparse 11x speedup. При этом для большинства production-задач 128K с хорошим RAG обходит 1M stuffing и по качеству, и по стоимости (в 30-100x дешевле).

Ключевые концепции¶

Проблема: O(n^2) память¶

\[\text{KV Cache} = 2 \times L \times H_{kv} \times d_{head} \times n \times \text{bytes\_per\_element}\]

Пример (Llama-70B, 1M tokens, FP16): - $L = 80$, $H_{kv} = 8$ (GQA), $d_{head} = 128$ - KV Cache = $2 \times 80 \times 8 \times 128 \times 1{,}000{,}000 \times 2$ bytes = ~328 GB - Без GQA ($H_{kv} = 64$) было бы ~2.6 TB - Один GPU = 80 GB. Нужны distributed решения.

Эволюция контекстных окон¶

Год	Модель	Контекст	Инновация
2020	GPT-3	2K	Standard attention
2023	Claude 2	100K	Specialized KV cache
2024	Gemini 1.5	1M	Multi-query + efficient
2025	Grok	10M+	Ring Attention
2026	Llama 4	10M	Context Parallelism + Sparse

Сравнение техник¶

Техника	Подход	Max Context	Память	Quality	Hardware
Ring Attention	Distributed blocks	N x devices	Distributed	Exact	Multi-GPU
Context Parallelism	Sequence sharding	N x devices	Distributed	Exact	Multi-GPU
Infini-Attention	Compressive memory	Unlimited	Bounded O(1)	Lossy	Single-GPU
Star Attention	Block-sparse	1M+	Reduced	~98%	Multi-host
Sparse Attention	Local + global	1M+	O(n*sqrt(n))	~95%	Single-GPU
KV Compression	Eviction/quantization	100K+	Reduced	~97%	Single-GPU

1. Ring Attention¶

Распределяет последовательность по GPU в кольцевой топологии. Communication перекрывается с compute.

Step 1: Each GPU computes local attention
GPU0: attend(T1, T1)    GPU1: attend(T2, T2)

Step 2: Ring communication (overlapped)
GPU0 sends K,V(T1) -> GPU1, receives K,V(T4) <- GPU3

Step 3: After N steps -> full attention

Blockwise attention: $\text{Attn}_i = \sum_{j=1}^{N} \text{BlockwiseAttn}(Q_i, K_j, V_j)$

Scaling:

GPUs	Max Sequence	Total Memory
1	128K	80 GB
8	1M	640 GB
64	8M	5 TB
512	16M	40 TB

class RingAttention(torch.nn.Module):
    def forward(self, x):
        B, T, D = x.shape
        Q = self.q_proj(x).view(B, T, self.num_heads, self.head_dim)
        K = self.k_proj(x).view(B, T, self.num_heads, self.head_dim)
        V = self.v_proj(x).view(B, T, self.num_heads, self.head_dim)

        O = torch.zeros_like(Q)
        l = torch.zeros(B, T, self.num_heads, 1, device=x.device)
        m = torch.ones(B, T, self.num_heads, 1, device=x.device) * float('-inf')

        K_send, V_send = K.clone(), V.clone()
        for step in range(self.ring_size):
            scores = torch.einsum('bthd,bshd->bths', Q, K_send) / (self.head_dim ** 0.5)
            # Online softmax update
            m_new = torch.maximum(m, scores.max(dim=-1, keepdim=True)[0])
            l_new = l * torch.exp(m - m_new) + torch.exp(scores - m_new).sum(dim=-1, keepdim=True)
            O = O * (l / l_new) * torch.exp(m - m_new) + \
                torch.einsum('bths,bshd->bthd', torch.exp(scores - m_new), V_send) / l_new
            m, l = m_new, l_new
            # Ring send/recv (overlapped in real impl)
            if step < self.ring_size - 1:
                dist.send(K_send, dst=(rank + 1) % self.ring_size)
                dist.recv(K_recv, src=(rank - 1) % self.ring_size)
                K_send, V_send = K_recv.clone(), V_recv.clone()
        return self.out_proj(O.flatten(2))

2. Context Parallelism (NVIDIA NeMo)¶

Улучшенная версия Ring Attention: покрывает все layers (LN, FFN), optimized P2P, production-ready.

Seq Length	Without CP	With CP (8 GPU)
16K	150 TFLOPS	145 TFLOPS
128K	OOM	195 TFLOPS
1M	OOM	210 TFLOPS

3. Infini-Attention (Google, 2024)¶

Compressive memory в стандартном attention -> bounded память для бесконечных последовательностей.

\[O = \sigma(\beta) \cdot A_{local} + (1 - \sigma(\beta)) \cdot A_{memory}\]

Memory update: $M_{KV} \leftarrow M_{KV} + \alpha \cdot V \cdot K^T$

Context	Standard Attention	Infini-Attention
128K	32 GB	0.1 GB
1M	256 GB	0.1 GB
Unlimited	Impossible	0.1 GB

Trade-off: lossy compression. Accuracy degrades с числом сегментов. 1M passkey retrieval: 95% (vs 0% standard truncated).

4. Star Attention (NVIDIA, ICML 2025)¶

Two-phase block-sparse: anchor tokens (global) + local blocks. No star-to-star communication.

Metric	Full Attention	Star Attention
Throughput (1M ctx)	15 tok/s	165 tok/s
Memory (1M ctx)	OOM	40 GB
Accuracy (RULER)	100%	98%
Speedup	1x	11x

5. KV Cache Compression¶

Метод	Compression	Quality Loss
H2O	2-4x	<1%
Streaming LLM	2-4x	<2%
Quest	4-8x	2-5%
RocketKV	4-16x	<3%

Детали и сравнения¶

Context Rot¶

Performance деградирует нелинейно с ростом контекста. "Lost in the middle" -- recall хуже для средних chunk'ов.

Context Size	Typical Recall
< 32K	90%+
64K-128K	80-90%
256K-512K	60-80%
1M+	26-60%

Gemini 3 Pro: MRCR v2 @ 128K = 77%, @ 1M = 26.3%.

Mitigation: strategic positioning (important info at start/end), chunking, RAG, summary layers.

RAG vs Long Context¶

Approach	Strengths	Weaknesses	Best For
Long Context (1M+)	Single pass, no retrieval setup	Cost, context rot	Legal contracts, code analysis
RAG	Targeted, low cost	Retrieval quality dependent	Large corpora, dynamic data
Hybrid	Best of both	Complexity	Enterprise production

Cost comparison:

Scenario	Long Context	RAG
100K tokens, 10 queries	$10-30	$1-3
1M tokens, 10 queries	$50-300	$3-10
10M corpus, frequent	Impractical	$10-50/month

Модели 2026¶

Model	Context	Pricing (input $/1M)	Best For
Gemini 3 Pro	1M-2M	$2.00	Largest available
Claude Opus 4.6	200K (1M beta)	$5.00-$30.00	Best quality within context
GPT-5.2	400K	$1.50	Balanced
Grok 2	2M	Varies	X.AI flagship

Open-source: DeepSeek-V3 128K (MoE), Qwen2.5-72B 128K, Llama 3.1-405B 128K.

Benchmarks (ICLR 2026)¶

96 GPUs, accuracy by method:

Метод	32K	128K	512K	1M
FlashAttention-2	100%	95%	OOM	OOM
Ring Attention	100%	100%	98%	95%
Context Parallel	100%	100%	99%	97%
Star Attention	98%	98%	97%	96%

Memory Requirements (KV Cache)¶

Context	7B model	70B model	GPUs Needed
4K	2 GB	16 GB	1
128K	64 GB	512 GB	8
1M	512 GB	4 TB	64

Latency¶

Context	Flash Attn 2	Ring Attn (8 GPU)	Infini-Attn
32K	0.5s	1.2s	0.6s
128K	2s	2.5s	1s
1M	OOM	8s	4s

Дерево решений¶

graph TD
    START{"Hardware?"} -->|"Single GPU"| SINGLE{"Контекст?"}
    START -->|"Multi-GPU (NVLink)"| MULTI{"Точный attention?"}
    START -->|"Multi-Node"| CP["Context Parallelism (NeMo)"]

    SINGLE -->|"< 128K"| FA["FlashAttention-2"]
    SINGLE -->|"128K-512K"| INFINI["Infini-Attention / KV Compression"]
    SINGLE -->|"> 512K"| SPARSE["Sparse Attention"]

    MULTI -->|"Да"| RING["Ring Attention / Context Parallel"]
    MULTI -->|"Approximate OK"| STAR["Star Attention"]

    style FA fill:#e8f5e9,stroke:#4caf50
    style RING fill:#e8f5e9,stroke:#4caf50
    style CP fill:#e8eaf6,stroke:#3f51b5
    style STAR fill:#fff3e0,stroke:#ef6c00
    style INFINI fill:#fff3e0,stroke:#ef6c00
    style SPARSE fill:#fce4ec,stroke:#c62828

Interview Questions¶

1. Почему KV cache -- bottleneck длинного контекста?¶

Red flag: "Attention квадратичен, поэтому медленно"

Strong answer: "Проблема не только в вычислениях ($O(n^2)$), но и в памяти. KV cache хранит key/value для всех позиций во всех слоях: $2 \times L \times H_{kv} \times d_{head} \times n \times bytes$. Для Llama-70B @ 1M tokens с GQA: ~328 GB -- 4 GPU A100. Без GQA было бы 2.6 TB. FlashAttention решает compute, но не memory. Для memory нужны: distributed attention (Ring), compression (Infini), или eviction (H2O, Streaming LLM)."

2. Ring Attention vs Infini-Attention?¶

Red flag: "Ring Attention быстрее, надо его использовать"

Strong answer: "Ring Attention: exact attention, distributed по $P$ GPU в кольце. Коммуникация перекрыта с compute. Масштабирование: $n_{max} = n_{local} \times P$. Нужен быстрый interconnect (NVLink). Infini-Attention: bounded $O(1)$ memory для бесконечных последовательностей, single GPU. Но lossy: compressive memory деградирует. 1M passkey retrieval: 95% (vs 100% Ring). Выбор: нужна точность -> Ring + multi-GPU. Ограничен один GPU -> Infini."

3. Context rot -- что это и как бороться?¶

Red flag: "Модель с 1M контекстом может обработать 1M токенов"

Strong answer: "Context rot: performance деградирует нелинейно с ростом контекста. 'Lost in the middle': recall хуже для средних частей документа. Gemini 3 Pro: 77% recall @ 128K -> 26.3% @ 1M. Техническая возможность 1M != качественная работа на 1M. Mitigation: (1) strategic positioning -- важную информацию в начало/конец, (2) RAG вместо stuffing всего в контекст, (3) chunking + summary layers, (4) Star Attention (block-sparse с anchor tokens)."

4. RAG vs Long Context -- когда что?¶

Red flag: "Long context решает всё, RAG больше не нужен"

Strong answer: "Long Context: single document analysis, legal contracts, code review -- нужна полная картина. Но дорого: $50-300 за 1M tokens x 10 queries. RAG: large corpora, dynamic data, cost-sensitive -- $3-10 за тот же сценарий. В 30-100x дешевле. Production: hybrid -- RAG для retrieval релевантных chunks + long context для synthesis. Ещё фактор: context rot деградирует quality при stuffing, а RAG подаёт только релевантное."

Ключевые числа¶

Факт	Значение
Llama-70B @ 1M KV cache (GQA)	~328 GB
Ring Attention scaling	Linear: N x devices
Infini-Attention memory @ any context	~0.1 GB (bounded)
Star Attention speedup	11x
Star Attention accuracy	98%
Context rot: Gemini @ 1M recall	26.3%
H2O KV compression	2-4x, <1% quality loss
Long Context cost (1M, 10 queries)	$50-300
RAG cost (same scenario)	$3-10

1M контекст != 1M качественного понимания

Маркетинговые цифры контекстного окна (1M, 2M, 10M) -- это техническая возможность, не гарантия quality retrieval. Gemini 3 Pro при 1M контексте: recall падает до 26%. На практике для большинства задач 128K с хорошим RAG лучше чем 1M stuffing. Needle-in-a-haystack тесты -- минимум для валидации.

Заблуждение: FlashAttention решает проблему длинного контекста

FlashAttention оптимизирует compute (IO-aware tiling, меньше обращений к HBM), но не решает memory -- KV cache для всех слоев по-прежнему растет линейно с длиной последовательности. Для 1M токенов на 70B модели KV cache = 328 GB, и FlashAttention-2 просто дает OOM. Для по-настоящему длинных контекстов нужны distributed (Ring Attention) или compressed (Infini-Attention) подходы.

Заблуждение: Ring Attention не имеет overhead по сравнению с обычным attention

Ring Attention перекрывает коммуникацию с compute, но это работает только при достаточно больших блоках и быстром interconnect (NVLink). На слабом interconnect или малых batch size коммуникация становится bottleneck. На 16K tokens Ring Attention медленнее обычного FlashAttention (1.2s vs 0.5s при 32K). Выигрыш начинается там, где обычный attention дает OOM.

Самопроверка

Вычислите размер KV cache для модели Llama-7B ($L=32$, $H_{kv}=32$, $d_{head}=128$, FP16) при контексте 128K. Сколько GPU A100 (80 GB) нужно только для KV cache?
Ring Attention на 8 GPU обрабатывает 1M tokens. Каждый GPU хранит $1M/8 = 125K$ local chunk. Сколько шагов кольца нужно для полного attention? Какой overhead communication если оно перекрыто с compute?
Для enterprise RAG-приложения: корпус 10M tokens, 100 queries/day. Сравните стоимость (1) long context approach (all-in-one) vs (2) RAG (top-k chunks). Используйте цены $2/1M input tokens.

Источники¶

arXiv -- "Ring Attention" (Liu, Zaharia, Abbeel, 2310.01889)
arXiv -- "Infini-Attention: Leave No Context Behind" (Google, 2404.07143)
arXiv -- Star Attention (Acharya, Jia, Ginsburg) -- ICML 2025
ICLR 2026 -- "Long-Context Attention Benchmark" (OpenReview:W7sVYFJAEp)
NVIDIA Developer Blog -- "Scaling to Millions of Tokens" (June 2025)
TryChroma Research -- "Context Rot: How Increasing Input Tokens Impacts LLM Performance"
Claude5 -- "Context Window Race 2026"
AIMultiple -- "Best LLMs for Extended Context Windows 2026"
Exxact Corp -- "Context Parallelism & Ring Attention"
Towards Data Science -- "How LLMs Handle Infinite Context With Finite Memory"
TeraContext -- "Why 1M Tokens Isn't Enough: The Mathematics of Context Windows"
arXiv -- "Infinite Retrieval: Attention Enhanced LLMs"