DL Interview: Архитектуры¶

~4 минуты чтения

Навигация: Все темы DL интервью | Материалы DL | Математика для ML

CNN, RNN/LSTM, Self-Attention, Positional Encodings, KV-Cache, Flash Attention. Архитектурные вопросы -- одни из самых частых на ML-интервью.

CNN¶

Q: Как вычислить output size convolution?¶

A:

\[H_{out} = \lfloor\frac{H_{in} + 2P - K}{S}\rfloor + 1\]

Example: $H_{in} = 224$, $K = 3$, $P = 1$, $S = 1$ $$H_{out} = \lfloor\frac{224 + 2 - 3}{1}\rfloor + 1 = 224$$

Same padding: $P = (K-1)/2$ для $S=1$.

Q: Receptive field -- что это и как считается?¶

A:

Definition: Область input, влияющая на один output нейрон.

Calculation: - Layer 1 (3x3 conv): RF = 3 - Layer 2 (3x3 conv): RF = 3 + (3-1)*1 = 5 - Layer 3 (3x3 conv): RF = 5 + (3-1)*2 = 9

Formula: $RF_l = RF_{l-1} + (K_l - 1) \times \prod_{i=1}^{l-1} S_i$

RNN/LSTM¶

Q: LSTM как решает vanishing gradient?¶

A:

Key insight: Cell state $C_t$ имеет linear update: $$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$

Gradient through time: $$\frac{\partial C_t}{\partial C_{t-k}} = \prod_{i=0}^{k-1} f_{t-i}$$

Если $f_t \approx 1$, gradient сохраняется!

Gates control: Network учится когда forget/remember.

Q: LSTM vs GRU¶

A:

LSTM	GRU
3 gates (forget, input, output)	2 gates (reset, update)
Separate cell state	No separate cell state
More parameters	Fewer parameters
Slightly better on long sequences	Faster training

Practical: GRU для simpler tasks, LSTM для complex long-range dependencies.

Attention¶

Q: Self-attention -- как работает?¶

A:

Input: Sequence $X \in \mathbb{R}^{n \times d}$

Projections: $$Q = XW^Q, \quad K = XW^K, \quad V = XW^V$$

Attention: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Intuition: - $QK^T$ = similarity scores between all pairs - Softmax = attention weights - Weighted sum of $V$ = context

Q: Почему делить на $\sqrt{d_k}$?¶

A:

Problem: Large $d_k$ → large dot products → softmax becomes "one-hot" → small gradients.

Example: $d_k = 512$, dot product ~512 $$\text{softmax}([500, 510, 490]) \approx [0, 1, 0]$$

Solution: Scale by $\sqrt{d_k}$ keeps variance stable: $$\text{Var}(q \cdot k) = d_k \cdot \text{Var}(q) \cdot \text{Var}(k)$$

Q: Multi-head attention зачем?¶

A:

Single head: One set of attention patterns.

Multi-head: Each head learns different relationships: - Head 1: syntactic dependencies - Head 2: semantic similarity - Head 3: position patterns - ...

Formula: $$\text{MultiHead} = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$

Typical: $h = 8$ or $12$, $d_k = d/h = 64$

Q: Causal mask в decoder¶

A:

Problem: При генерации нельзя видеть будущие токены.

Solution: Mask future positions with $-\infty$:

[[0, -inf, -inf, -inf],
 [0,  0, -inf, -inf],
 [0,  0,  0, -inf],
 [0,  0,  0,  0]]

After softmax: $e^{-\infty} = 0$ → no attention to future.

Заблуждение: Больше attention heads = лучше

Увеличение числа голов при фиксированном $d_{model}$ уменьшает $d_k = d_{model}/h$ -- каждая голова "видит" в менее информативном пространстве. При $d_k < 32$ quality degradation становится заметным. Llama-3 65B использует 64 головы при $d_{model}=8192$ ($d_k=128$), а не 128 голов с $d_k=64$. Оптимальное $d_k$ обычно 64-128.

Positional Encodings¶

Q: Зачем нужны positional encodings?¶

A:

Problem: Self-attention is permutation invariant! $$\text{Attention}(X) = \text{Attention}(\text{permute}(X))$$

Solution: Add positional information to input: $$X_{input} = X_{embedding} + PE$$

Q: Sinusoidal vs Learned vs RoPE¶

A:

Method	Pros	Cons
Sinusoidal	Fixed, extrapolates	Less flexible
Learned	Flexible	Doesn't extrapolate
RoPE	Relative, extrapolates	More complex

2025 Standard: RoPE (Rotary Position Embedding) -- LLaMA, Qwen, Mistral.

Q: Как работает RoPE?¶

A:

Key idea: Encode position as rotation in complex plane.

For 2D: $$f(x, pos) = \begin{pmatrix} \cos(pos \cdot \theta) & -\sin(pos \cdot \theta) \\ \sin(pos \cdot \theta) & \cos(pos \cdot \theta) \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix}$$

Property: $$\langle f(q, m), f(k, n) \rangle = \langle q, k \rangle \cdot \cos((m-n)\theta)$$

Attention depends only on relative position $m-n$!

Сложные вопросы (Senior+)¶

Q: KV-Cache что это и зачем?¶

A:

Problem: При генерации каждый новый токен требует пересчета attention для всех предыдущих.

Solution: Cache $K$ и $V$ для предыдущих токенов.

# Without cache: O(n^2) for each new token
# With cache: O(n) for each new token

# Generation
for i in range(max_len):
    # Only compute Q for new token
    q_new = compute_q(x[i])
    # Reuse cached K, V
    k_cache = torch.cat([k_cache, compute_k(x[i])], dim=1)
    v_cache = torch.cat([v_cache, compute_v(x[i])], dim=1)
    out = attention(q_new, k_cache, v_cache)

Memory per layer: $O(2 \times seq\_len \times H \times D)$ (K и V). Total: $O(2 \times L \times seq\_len \times H \times D)$.

Q: Flash Attention -- key idea?¶

A:

Problem: Standard attention = $O(N^2)$ memory (attention matrix).

Key insight: Never materialize full $N \times N$ attention matrix!

Algorithm: 1. Divide Q, K, V into blocks 2. Compute attention block by block 3. Use online softmax trick (log-sum-exp) 4. Write only output, not attention weights

Result: $O(N)$ memory, same computation!

Q: Gradient checkpointing -- tradeoffs?¶

A:

Idea: Don't store all activations, recompute when needed.

Trade-off: - Memory: 50-70% reduction - Compute: 20-30% increase (recomputation)

When to use: - Model doesn't fit in GPU - Training very deep networks - Batch size limited by memory