Перейти к содержанию

DL Interview: Архитектуры

~4 минуты чтения

Навигация: Все темы DL интервью | Материалы DL | Математика для ML

CNN, RNN/LSTM, Self-Attention, Positional Encodings, KV-Cache, Flash Attention. Архитектурные вопросы -- одни из самых частых на ML-интервью.


CNN

Q: Как вычислить output size convolution?

A:

\[H_{out} = \lfloor\frac{H_{in} + 2P - K}{S}\rfloor + 1\]

Example: \(H_{in} = 224\), \(K = 3\), \(P = 1\), \(S = 1\) $\(H_{out} = \lfloor\frac{224 + 2 - 3}{1}\rfloor + 1 = 224\)$

Same padding: \(P = (K-1)/2\) для \(S=1\).

Q: Receptive field -- что это и как считается?

A:

Definition: Область input, влияющая на один output нейрон.

Calculation: - Layer 1 (3x3 conv): RF = 3 - Layer 2 (3x3 conv): RF = 3 + (3-1)*1 = 5 - Layer 3 (3x3 conv): RF = 5 + (3-1)*2 = 9

Formula: \(RF_l = RF_{l-1} + (K_l - 1) \times \prod_{i=1}^{l-1} S_i\)


RNN/LSTM

Q: LSTM как решает vanishing gradient?

A:

Key insight: Cell state \(C_t\) имеет linear update: $\(C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\)$

Gradient through time: $\(\frac{\partial C_t}{\partial C_{t-k}} = \prod_{i=0}^{k-1} f_{t-i}\)$

Если \(f_t \approx 1\), gradient сохраняется!

Gates control: Network учится когда forget/remember.

Q: LSTM vs GRU

A:

LSTM GRU
3 gates (forget, input, output) 2 gates (reset, update)
Separate cell state No separate cell state
More parameters Fewer parameters
Slightly better on long sequences Faster training

Practical: GRU для simpler tasks, LSTM для complex long-range dependencies.


Attention

Q: Self-attention -- как работает?

A:

Input: Sequence \(X \in \mathbb{R}^{n \times d}\)

Projections: $\(Q = XW^Q, \quad K = XW^K, \quad V = XW^V\)$

Attention: $\(\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\)$

Intuition: - \(QK^T\) = similarity scores between all pairs - Softmax = attention weights - Weighted sum of \(V\) = context

Q: Почему делить на \(\sqrt{d_k}\)?

A:

Problem: Large \(d_k\) → large dot products → softmax becomes "one-hot" → small gradients.

Example: \(d_k = 512\), dot product ~512 $\(\text{softmax}([500, 510, 490]) \approx [0, 1, 0]\)$

Solution: Scale by \(\sqrt{d_k}\) keeps variance stable: $\(\text{Var}(q \cdot k) = d_k \cdot \text{Var}(q) \cdot \text{Var}(k)\)$

Q: Multi-head attention зачем?

A:

Single head: One set of attention patterns.

Multi-head: Each head learns different relationships: - Head 1: syntactic dependencies - Head 2: semantic similarity - Head 3: position patterns - ...

Formula: $\(\text{MultiHead} = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O\)$

Typical: \(h = 8\) or \(12\), \(d_k = d/h = 64\)

Q: Causal mask в decoder

A:

Problem: При генерации нельзя видеть будущие токены.

Solution: Mask future positions with \(-\infty\):

[[0, -inf, -inf, -inf],
 [0,  0, -inf, -inf],
 [0,  0,  0, -inf],
 [0,  0,  0,  0]]

After softmax: \(e^{-\infty} = 0\) → no attention to future.

Заблуждение: Больше attention heads = лучше

Увеличение числа голов при фиксированном \(d_{model}\) уменьшает \(d_k = d_{model}/h\) -- каждая голова "видит" в менее информативном пространстве. При \(d_k < 32\) quality degradation становится заметным. Llama-3 65B использует 64 головы при \(d_{model}=8192\) (\(d_k=128\)), а не 128 голов с \(d_k=64\). Оптимальное \(d_k\) обычно 64-128.


Positional Encodings

Q: Зачем нужны positional encodings?

A:

Problem: Self-attention is permutation invariant! $\(\text{Attention}(X) = \text{Attention}(\text{permute}(X))\)$

Solution: Add positional information to input: $\(X_{input} = X_{embedding} + PE\)$

Q: Sinusoidal vs Learned vs RoPE

A:

Method Pros Cons
Sinusoidal Fixed, extrapolates Less flexible
Learned Flexible Doesn't extrapolate
RoPE Relative, extrapolates More complex

2025 Standard: RoPE (Rotary Position Embedding) -- LLaMA, Qwen, Mistral.

Q: Как работает RoPE?

A:

Key idea: Encode position as rotation in complex plane.

For 2D: $\(f(x, pos) = \begin{pmatrix} \cos(pos \cdot \theta) & -\sin(pos \cdot \theta) \\ \sin(pos \cdot \theta) & \cos(pos \cdot \theta) \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix}\)$

Property: $\(\langle f(q, m), f(k, n) \rangle = \langle q, k \rangle \cdot \cos((m-n)\theta)\)$

Attention depends only on relative position \(m-n\)!


Сложные вопросы (Senior+)

Q: KV-Cache что это и зачем?

A:

Problem: При генерации каждый новый токен требует пересчета attention для всех предыдущих.

Solution: Cache \(K\) и \(V\) для предыдущих токенов.

# Without cache: O(n^2) for each new token
# With cache: O(n) for each new token

# Generation
for i in range(max_len):
    # Only compute Q for new token
    q_new = compute_q(x[i])
    # Reuse cached K, V
    k_cache = torch.cat([k_cache, compute_k(x[i])], dim=1)
    v_cache = torch.cat([v_cache, compute_v(x[i])], dim=1)
    out = attention(q_new, k_cache, v_cache)

Memory per layer: \(O(2 \times seq\_len \times H \times D)\) (K и V). Total: \(O(2 \times L \times seq\_len \times H \times D)\).

Q: Flash Attention -- key idea?

A:

Problem: Standard attention = \(O(N^2)\) memory (attention matrix).

Key insight: Never materialize full \(N \times N\) attention matrix!

Algorithm: 1. Divide Q, K, V into blocks 2. Compute attention block by block 3. Use online softmax trick (log-sum-exp) 4. Write only output, not attention weights

Result: \(O(N)\) memory, same computation!

Q: Gradient checkpointing -- tradeoffs?

A:

Idea: Don't store all activations, recompute when needed.

Trade-off: - Memory: 50-70% reduction - Compute: 20-30% increase (recomputation)

When to use: - Model doesn't fit in GPU - Training very deep networks - Batch size limited by memory