Шпаргалка: позиционное кодирование и внимание¶

~9 минут чтения

Предварительно: Архитектура трансформера

Справочная карточка по механизмам внимания и позиционным кодированиям в трансформерах. Full self-attention имеет сложность O(T^2) по памяти и compute -- при 128K контексте это 16 миллиардов элементов в attention matrix. RoPE стал стандартом де-факто для позиционирования во всех LLM 2024-2026 (LLaMA, Mistral, Qwen), GQA сокращает KV cache в 8x при потере <1% качества, а FlashAttention-3 снижает memory с O(T^2) до O(T). Эта шпаргалка покрывает RoPE, ALiBi, MHA/MQA/GQA, FlashAttention, sparse/linear attention, sliding window и attention sinks -- обязательные темы для любого LLM-интервью.

Тип: synthesis / interview cheat sheet Дата: Февраль 2026 Synthesis of: Transformer architecture, attention variants, positional encoding methods

Quick Reference: Key Numbers¶

Metric	Value	Context
Full attention complexity	\(O(T^2)\)	Memory and compute
Linear attention complexity	\(O(T)\)	Performer, Linear Transformer
FlashAttention speedup	2-4×	Memory + compute optimized
MQA memory reduction	~8×	1 K/V head vs all heads
GQA quality loss	<1%	8 groups vs full heads
RoPE extrapolation	2-4×	Beyond training length

1. Positional Encoding Methods¶

Comparison Table¶

Method	Extrapolation	Relative Position	Quality	Usage
Sinusoidal	Poor	Implicit	Good	Original Transformer
Learned	Poor	No	Good	GPT-2, BERT
RoPE	Good	Implicit	Excellent	LLaMA, Mistral, Qwen
ALiBi	Excellent	Explicit	Good	MPT, BLOOM
NoPE	N/A	Implicit	Variable	Some 2025 models

Winner (2025-2026): RoPE¶

RoPE (Rotary Position Embedding) is the de facto standard for all modern LLMs.

2. RoPE (Rotary Position Embedding)¶

Core Idea¶

Rotate query and key vectors based on position:

\[\text{RoPE}(x_m, m) = \begin{pmatrix} x_m^{(1)} \\ x_m^{(2)} \\ \vdots \\ x_m^{(d-1)} \\ x_m^{(d)} \end{pmatrix} \odot \begin{pmatrix} \cos(m\theta_1) \\ \cos(m\theta_1) \\ \vdots \\ \cos(m\theta_{d/2}) \\ \cos(m\theta_{d/2}) \end{pmatrix} + \begin{pmatrix} -x_m^{(2)} \\ x_m^{(1)} \\ \vdots \\ -x_m^{(d)} \\ x_m^{(d-1)} \end{pmatrix} \odot \begin{pmatrix} \sin(m\theta_1) \\ \sin(m\theta_1) \\ \vdots \\ \sin(m\theta_{d/2}) \\ \sin(m\theta_{d/2}) \end{pmatrix}\]

Where \(\theta_i = 10000^{-2i/d}\)

Properties¶

Relative position encoding: \(\text{Attn}(q_m, k_n)\) depends only on \(m - n\)
Long-term decay: Distant positions have less interaction
Extrapolation: Can extend 2-4× beyond training length

Code: RoPE Implementation¶

def precompute_freqs_cis(dim, max_seq_len, theta=10000.0):
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))
    t = torch.arange(max_seq_len)
    freqs = torch.outer(t, freqs)
    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex
    return freqs_cis

def apply_rotary_emb(xq, xk, freqs_cis):
    # Reshape to complex
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))

    # Apply rotation
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(-2)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(-2)

    return xq_out.type_as(xq), xk_out.type_as(xk)

RoPE Extensions (2025-2026)¶

Extension	Description	Benefit
RoPE-Scaling	Interpolation for longer context	8× extrapolation
YaRN	Yet another RoPE extension	Better length extrapolation
Dynamic NTK	Adjust base during inference	Adaptive scaling

3. ALiBi (Attention with Linear Biases)¶

Core Idea¶

Add bias to attention scores based on relative position:

\[\text{Attn}(q, k, m) = \text{softmax}(qK^T + m \cdot B) \cdot V\]

Where \(B_{ij} = -|i - j|\) (linear decay)

Implementation¶

def alibi_bias(seq_len, num_heads):
    # Different slopes per head (geometric sequence)
    slopes = torch.pow(2.0, torch.arange(1, num_heads + 1) * (-8.0 / num_heads))

    # Relative positions
    positions = torch.arange(seq_len)
    relative_pos = positions[None, :] - positions[:, None]

    # Linear bias (negative)
    bias = -torch.abs(relative_pos).float()

    # Scale by head slope
    return bias * slopes[:, None, None]

ALiBi vs RoPE¶

Aspect	ALiBi	RoPE
Extrapolation	Excellent (trained on 1024, works on 8192)	Good (needs scaling)
Quality	Slightly lower	Higher
Adoption	MPT, BLOOM	LLaMA, Mistral, Qwen
Implementation	Simpler	More complex

4. Attention Mechanisms¶

Full Self-Attention¶

\[\text{Attn}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Complexity: \(O(T^2)\) memory and compute

FlashAttention¶

Key insight: Never materialize the full \(T \times T\) attention matrix.

Standard:  Compute full matrix → Softmax → Multiply → Store
Flash:     Block-wise compute → Online softmax → Never store full matrix

Memory: \(O(T)\) instead of \(O(T^2)\)

FlashAttention-3 (2025)¶

Hopper (H100) optimized
Uses Tensor Cores + TMA
1.5-2× faster than FlashAttention-2
FP8 support

5. Multi-Query Attention (MQA) & Grouped-Query Attention (GQA)¶

Standard Multi-Head Attention¶

\(H\) query heads
\(H\) key heads
\(H\) value heads
Memory: \(O(H \times d \times T)\) for KV cache

Multi-Query Attention (MQA)¶

\(H\) query heads
1 key head
1 value head
Memory: \(O(d \times T)\) for KV cache
8× memory reduction (for 8 heads)

Grouped-Query Attention (GQA)¶

\(H\) query heads
\(G\) key heads (\(1 < G < H\))
\(G\) value heads
Balance between MQA and MHA

Comparison¶

Method	KV Heads	Memory	Quality	Speed
MHA	\(H\)	100%	Best	Baseline
GQA	\(G\)	\(G/H\)	~99%	1.5-2×
MQA	1	\(1/H\)	~97%	2-3×

Code: GQA¶

class GroupedQueryAttention(nn.Module):
    def __init__(self, d_model, num_heads, num_groups):
        super().__init__()
        self.num_heads = num_heads
        self.num_groups = num_groups
        self.head_dim = d_model // num_heads

        self.q_proj = nn.Linear(d_model, d_model)
        self.k_proj = nn.Linear(d_model, num_groups * self.head_dim)  # Fewer K
        self.v_proj = nn.Linear(d_model, num_groups * self.head_dim)  # Fewer V
        self.o_proj = nn.Linear(d_model, d_model)

    def forward(self, x):
        B, T, C = x.shape

        q = self.q_proj(x).view(B, T, self.num_heads, self.head_dim)
        k = self.k_proj(x).view(B, T, self.num_groups, self.head_dim)
        v = self.v_proj(x).view(B, T, self.num_groups, self.head_dim)

        # Repeat K, V for each group
        k = k.repeat_interleave(self.num_heads // self.num_groups, dim=2)
        v = v.repeat_interleave(self.num_heads // self.num_groups, dim=2)

        # Standard attention
        attn = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim)
        attn = F.softmax(attn, dim=-1)
        out = torch.matmul(attn, v)

        return self.o_proj(out.view(B, T, C))

6. Cross-Attention vs Self-Attention¶

Self-Attention¶

\[\text{SelfAttn}(X) = \text{Attn}(XW_Q, XW_K, XW_V)\]

Q, K, V all from same input
Captures internal relationships

Cross-Attention¶

\[\text{CrossAttn}(X, Y) = \text{Attn}(XW_Q, YW_K, YW_V)\]

Q from one input, K, V from another
Connects different sequences

Use Cases¶

Use Case	Attention Type
Encoder-only (BERT)	Self-attention
Decoder-only (GPT)	Causal self-attention
Encoder-decoder (T5)	Self + Cross
Multi-modal (LLaVA)	Cross-attention (image → text)

7. Sparse Attention Patterns¶

Why Sparse?¶

Full attention \(O(T^2)\) is prohibitive for long sequences.

Patterns¶

Pattern	Description	Complexity
Local	Attend to nearby tokens	\(O(T \times w)\)
Strided	Attend every \(k\) tokens	\(O(T^2/k)\)
Random	Random attention pattern	\(O(T \times r)\)
Block-sparse	Fixed block structure	\(O(T \times b)\)
BigBird	Local + random + global	\(O(T)\)
Longformer	Local + global	\(O(T)\)

Star Attention (Nov 2024)¶

Two-phase attention: 1. Context phase: Each block attends only within itself 2. Query phase: Query attends to all blocks

Result: 11× faster, 128K context on single GPU

8. Linear Attention¶

Core Idea¶

Approximate softmax attention with kernel feature maps:

\[\text{Attn}(Q, K, V) = \text{softmax}(QK^T)V \approx \phi(Q)(\phi(K)^TV)\]

This enables right-to-left associativity:

\[\phi(Q)(\phi(K)^TV) = \phi(Q)\sum_t (\phi(k_t) \otimes v_t)\]

Complexity: \(O(T)\) instead of \(O(T^2)\)

Methods¶

Method	Kernel	Quality
Linear Transformer	\(\phi(x) = \text{elu}(x) + 1\)	Good
Performer	Random features	Good
ZeroS (Feb 2026)	Zero-sum linear	Better

ZeroS (2026)¶

\[w_i^{ZeroS} = \frac{\exp(q_i \cdot k_i) - \bar{w}}{\sum_j (\exp(q_j \cdot k_j) - \bar{w})}\]

Enables negative weights (contrastive operations)
\(O(N)\) complexity
Better expressiveness

9. Sliding Window Attention¶

Concept¶

Only attend to local window of \(w\) tokens:

\[\text{Attn}(q_i, K, V) = \sum_{j=i-w}^{i} \alpha_{ij} v_j\]

Implementation¶

def sliding_window_attention(q, k, v, window_size):
    B, H, T, D = q.shape

    # Create mask
    mask = torch.triu(torch.ones(T, T), diagonal=-window_size)
    mask = mask.tril(window_size)

    # Compute attention with mask
    scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(D)
    scores = scores.masked_fill(mask == 0, float('-inf'))
    attn = F.softmax(scores, dim=-1)

    return torch.matmul(attn, v)

Mistral's Sliding Window¶

Window size: 4096
Still has global information via layer stacking
\(\text{Receptive field} = \text{Layers} \times \text{Window}\)

10. Attention Sinks¶

Problem: Streaming LLM¶

When streaming, removing old tokens breaks attention because: - Softmax needs a "sink" token to absorb attention scores - First token often serves as sink

Solution: StreamingLLM¶

Keep first few tokens ("attention sinks") + recent window:

[SINK][SINK]........[RECENT WINDOW]

Result: Infinite context with fixed memory

11. Типичные заблуждения¶

Заблуждение: MQA (Multi-Query Attention) всегда лучше MHA -- он и быстрее, и дешевле

MQA сокращает KV cache в H раз (8x при 8 головах), но теряет ~3% качества на сложных задачах (multi-hop reasoning, long-context retrieval). GQA с 8 группами на 64 головы дает 8x сокращение при потере <1%. Именно поэтому LLaMA 2 70B перешел на GQA, а не MQA. MQA оправдан только при жестких ограничениях памяти и latency-critical сценариях.

Заблуждение: RoPE позволяет экстраполировать на любую длину контекста

RoPE экстраполирует на 2-4x от training length без fine-tuning, но качество падает логарифмически. Для 8x+ требуется RoPE scaling (NTK-aware) или YaRN с дообучением на длинных данных. Модель, обученная на 4K контексте, при 32K покажет degradation even с scaling. Llama 3.1 обучалась на 128K, чтобы поддерживать длинный контекст нативно.

Заблуждение: FlashAttention меняет математику attention

FlashAttention дает математически идентичный результат стандартному attention. Оптимизация чисто алгоритмическая: block-wise computation с online softmax, без материализации полной T*T матрицы. Это снижает memory с O(T^2) до O(T) и ускоряет в 2-4x за счет лучшего использования GPU SRAM. FlashAttention-3 дополнительно использует Tensor Cores и TMA на H100.

12. Интервью-вопросы¶

Базовые¶

В: Зачем трансформерам позиционное кодирование?

"Чтобы модель знала порядок слов"

"Self-attention permutation-invariant: Attn(Q,K,V) не зависит от порядка токенов. Без positional encoding 'кот съел рыбу' и 'рыба съела кота' дадут одинаковый результат. Sinusoidal (оригинал) и learned (GPT-2) -- абсолютные позиции, плохая экстраполяция. RoPE кодирует relative position через вращение Q,K в комплексном пространстве: Attn(q_m, k_n) зависит только от (m-n). ALiBi добавляет linear bias -|i-j| к attention scores без модификации Q,K"

В: Что такое FlashAttention?

"Это быстрый attention"

"FlashAttention -- IO-aware алгоритм, который никогда не материализует полную T*T attention matrix в HBM. Вместо этого: блочное вычисление в SRAM (на чипе), online softmax (обновление нормализации по мере обработки блоков). Результат математически идентичен стандартному attention, но: O(T) memory вместо O(T^2), 2-4x speedup. FlashAttention-3 (H100): Tensor Core utilization + FP8 support + 1.5-2x ускорение vs FA-2"

В: Сравните MQA, GQA и MHA

"MHA -- стандартный, MQA -- один KV head, GQA -- между ними"

"MHA: H query heads, H key heads, H value heads. KV cache = O(H * d * T). MQA: H query heads, 1 KV head. KV cache сокращается в H раз (8x для 8 голов), но потеря ~3% качества. GQA: H query heads, G KV groups (G < H). Llama 2 70B: 64 query heads, 8 KV groups -- 8x сокращение cache при <1% потере. GQA -- стандарт для всех production LLM. Формула памяти GQA: M_GQA = (G/H) * M_MHA"

Продвинутые¶

В: Почему RoPE лучше learned positional embeddings?

"RoPE новее и популярнее"

"Три преимущества: (1) Relative position: attention score q_m * k_n зависит только от разницы (m-n), а не абсолютных позиций -- лучше обобщение. (2) Экстраполяция: rotation matrix при позиции m+k -- просто дополнительный поворот, работает 2-4x за training length. Learned embeddings фиксированы: модель обучалась на [0, max_len] и не знает позицию max_len+1. (3) Нет дополнительных параметров: частоты theta_i = 10000^(-2i/d) фиксированы"

В: Как работает linear attention?

"Убирает softmax для ускорения"

"Стандартный attention: softmax(QK^T)V -- нужно вычислить T*T матрицу. Linear attention: phi(Q)(phi(K)^T V), где phi -- kernel feature map. Ключевой трюк: ассоциативность умножения. Вместо (phi(Q) * phi(K)^T) * V (T*T матрица), вычисляем phi(Q) * (phi(K)^T * V), где phi(K)^T * V имеет размер d*d -- не зависит от T. Сложность O(T*d^2) ~= O(T). ZeroS (2026) добавляет отрицательные веса для лучшей expressiveness"

В: Что такое attention sinks?

"Токены, которые получают много внимания"

"При streaming inference (удаление старых токенов для экономии памяти) softmax требует 'sink' -- токен для сброса избыточной attention mass. Первые 1-2 токена последовательности обычно становятся sinks (получают непропорционально много внимания даже если семантически не важны). StreamingLLM решение: всегда хранить первые K sink-токенов + sliding window последних N токенов. Результат: infinite context с фиксированной памятью"

System Design¶

В: Спроектируйте attention для контекста 1M токенов

Перечисление методов без анализа trade-offs

"Три подхода с разными trade-offs: (1) Ring Attention -- распределяет T/N токенов по N GPU, масштабируется линейно (128K * 8 GPU = 1M), но требует multi-GPU и высокую interconnect bandwidth. (2) Star Attention -- two-phase block-sparse на одном GPU, 11x быстрее для 128K, но approximate. (3) Infini-Attention -- compressive memory с O(1) на новый токен, unlimited context, но потеря качества на distant retrieval. Production-выбор: Ring Attention для quality-critical, Star Attention для single-GPU, FlashAttention-3 для < 128K"

В: Оптимизируйте inference для multi-turn conversations

"Используем KV cache"

"Три уровня: (1) KV cache с PagedAttention (vLLM) -- блочное выделение, <4% waste vs 60-80% при static allocation. (2) Prefix caching -- общий system prompt и история диалога кэшируются между запросами, экономия 3-30x на повторных контекстах. RadixAttention (SGLang) -- token-level radix tree для максимального переиспользования. (3) GQA вместо MHA для 8x сокращения KV cache. Для agent workflows: RadixAttention + EAGLE-3 speculative decoding дает 4-40x ускорение vs наивный подход"

12. Formulas Quick Reference¶

Self-Attention¶

\[\text{Attn}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

RoPE¶

\[\text{RoPE}(x_m, m) = R_m x_m\]

Where \(R_m\) is rotation matrix at position \(m\)

ALiBi¶

\[\text{score}(q_i, k_j) = q_i \cdot k_j - m \cdot |i - j|\]

Linear Attention¶

\[\text{Attn}(Q, K, V) \approx \phi(Q)(\phi(K)^TV)\]

KV Cache Memory¶

\[M = 2 \times L \times H \times d \times T \times 2 \text{ bytes}\]

GQA Memory¶

\[M_{GQA} = \frac{G}{H} \times M_{MHA}\]

13. Sources Synthesized¶

flash-attention-v2-v3.md — FlashAttention evolution
mqa-gqa-attention.md — Multi-query, Grouped-query
long-context-2025-2026.md — Ring Attention, Star Attention
rope-long-context.md — RoPE scaling methods
efficient-transformers-2025-2026.md — ZeroS, linear attention
kv-cache-optimization-2025-2026.md — Cache patterns
inference-engines-comparison-2025-2026.md — Implementation details
state-space-models-2025-2026.md — Attention alternatives