DL Interview: Архитектуры¶
~4 минуты чтения
Навигация: Все темы DL интервью | Материалы DL | Математика для ML
CNN, RNN/LSTM, Self-Attention, Positional Encodings, KV-Cache, Flash Attention. Архитектурные вопросы -- одни из самых частых на ML-интервью.
CNN¶
Q: Как вычислить output size convolution?¶
A:
Example: \(H_{in} = 224\), \(K = 3\), \(P = 1\), \(S = 1\) $\(H_{out} = \lfloor\frac{224 + 2 - 3}{1}\rfloor + 1 = 224\)$
Same padding: \(P = (K-1)/2\) для \(S=1\).
Q: Receptive field -- что это и как считается?¶
A:
Definition: Область input, влияющая на один output нейрон.
Calculation: - Layer 1 (3x3 conv): RF = 3 - Layer 2 (3x3 conv): RF = 3 + (3-1)*1 = 5 - Layer 3 (3x3 conv): RF = 5 + (3-1)*2 = 9
Formula: \(RF_l = RF_{l-1} + (K_l - 1) \times \prod_{i=1}^{l-1} S_i\)
RNN/LSTM¶
Q: LSTM как решает vanishing gradient?¶
A:
Key insight: Cell state \(C_t\) имеет linear update: $\(C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\)$
Gradient through time: $\(\frac{\partial C_t}{\partial C_{t-k}} = \prod_{i=0}^{k-1} f_{t-i}\)$
Если \(f_t \approx 1\), gradient сохраняется!
Gates control: Network учится когда forget/remember.
Q: LSTM vs GRU¶
A:
| LSTM | GRU |
|---|---|
| 3 gates (forget, input, output) | 2 gates (reset, update) |
| Separate cell state | No separate cell state |
| More parameters | Fewer parameters |
| Slightly better on long sequences | Faster training |
Practical: GRU для simpler tasks, LSTM для complex long-range dependencies.
Attention¶
Q: Self-attention -- как работает?¶
A:
Input: Sequence \(X \in \mathbb{R}^{n \times d}\)
Projections: $\(Q = XW^Q, \quad K = XW^K, \quad V = XW^V\)$
Attention: $\(\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\)$
Intuition: - \(QK^T\) = similarity scores between all pairs - Softmax = attention weights - Weighted sum of \(V\) = context
Q: Почему делить на \(\sqrt{d_k}\)?¶
A:
Problem: Large \(d_k\) → large dot products → softmax becomes "one-hot" → small gradients.
Example: \(d_k = 512\), dot product ~512 $\(\text{softmax}([500, 510, 490]) \approx [0, 1, 0]\)$
Solution: Scale by \(\sqrt{d_k}\) keeps variance stable: $\(\text{Var}(q \cdot k) = d_k \cdot \text{Var}(q) \cdot \text{Var}(k)\)$
Q: Multi-head attention зачем?¶
A:
Single head: One set of attention patterns.
Multi-head: Each head learns different relationships: - Head 1: syntactic dependencies - Head 2: semantic similarity - Head 3: position patterns - ...
Formula: $\(\text{MultiHead} = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O\)$
Typical: \(h = 8\) or \(12\), \(d_k = d/h = 64\)
Q: Causal mask в decoder¶
A:
Problem: При генерации нельзя видеть будущие токены.
Solution: Mask future positions with \(-\infty\):
After softmax: \(e^{-\infty} = 0\) → no attention to future.
Заблуждение: Больше attention heads = лучше
Увеличение числа голов при фиксированном \(d_{model}\) уменьшает \(d_k = d_{model}/h\) -- каждая голова "видит" в менее информативном пространстве. При \(d_k < 32\) quality degradation становится заметным. Llama-3 65B использует 64 головы при \(d_{model}=8192\) (\(d_k=128\)), а не 128 голов с \(d_k=64\). Оптимальное \(d_k\) обычно 64-128.
Positional Encodings¶
Q: Зачем нужны positional encodings?¶
A:
Problem: Self-attention is permutation invariant! $\(\text{Attention}(X) = \text{Attention}(\text{permute}(X))\)$
Solution: Add positional information to input: $\(X_{input} = X_{embedding} + PE\)$
Q: Sinusoidal vs Learned vs RoPE¶
A:
| Method | Pros | Cons |
|---|---|---|
| Sinusoidal | Fixed, extrapolates | Less flexible |
| Learned | Flexible | Doesn't extrapolate |
| RoPE | Relative, extrapolates | More complex |
2025 Standard: RoPE (Rotary Position Embedding) -- LLaMA, Qwen, Mistral.
Q: Как работает RoPE?¶
A:
Key idea: Encode position as rotation in complex plane.
For 2D: $\(f(x, pos) = \begin{pmatrix} \cos(pos \cdot \theta) & -\sin(pos \cdot \theta) \\ \sin(pos \cdot \theta) & \cos(pos \cdot \theta) \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix}\)$
Property: $\(\langle f(q, m), f(k, n) \rangle = \langle q, k \rangle \cdot \cos((m-n)\theta)\)$
Attention depends only on relative position \(m-n\)!
Сложные вопросы (Senior+)¶
Q: KV-Cache что это и зачем?¶
A:
Problem: При генерации каждый новый токен требует пересчета attention для всех предыдущих.
Solution: Cache \(K\) и \(V\) для предыдущих токенов.
# Without cache: O(n^2) for each new token
# With cache: O(n) for each new token
# Generation
for i in range(max_len):
# Only compute Q for new token
q_new = compute_q(x[i])
# Reuse cached K, V
k_cache = torch.cat([k_cache, compute_k(x[i])], dim=1)
v_cache = torch.cat([v_cache, compute_v(x[i])], dim=1)
out = attention(q_new, k_cache, v_cache)
Memory per layer: \(O(2 \times seq\_len \times H \times D)\) (K и V). Total: \(O(2 \times L \times seq\_len \times H \times D)\).
Q: Flash Attention -- key idea?¶
A:
Problem: Standard attention = \(O(N^2)\) memory (attention matrix).
Key insight: Never materialize full \(N \times N\) attention matrix!
Algorithm: 1. Divide Q, K, V into blocks 2. Compute attention block by block 3. Use online softmax trick (log-sum-exp) 4. Write only output, not attention weights
Result: \(O(N)\) memory, same computation!
Q: Gradient checkpointing -- tradeoffs?¶
A:
Idea: Don't store all activations, recompute when needed.
Trade-off: - Memory: 50-70% reduction - Compute: 20-30% increase (recomputation)
When to use: - Model doesn't fit in GPU - Training very deep networks - Batch size limited by memory