Deep Learning: Учебные материалы¶
~22 минуты чтения
Предварительно: Математика для ML | Классический ML
Deep Learning покрывает ~35% вопросов на ML-интервью уровня Middle+ (по данным Blind и levels.fyi за 2025). В этом разделе собраны 18 ключевых тем: от backpropagation и loss functions до MoE, SSM/Mamba и Vision Transformers. Каждая тема содержит формулы, код, лучшие источники и вопросы для самопроверки.
Материалы для 18 тем из категории Deep Learning Обновлено: 2026-02-11
1. Loss Functions (dl_005_loss_functions)¶
Лучшие источники¶
Блоги: - Lil'Log: Loss Functions — визуализация - Distill: Feature Visualization
Papers: - Focal Loss Paper — Lin et al., 2017 - Contrastive Learning Survey
Ключевые формулы¶
| Loss | Formula | Use Case |
|---|---|---|
| MSE | \(\frac{1}{n}\sum(y-\hat{y})^2\) | Regression |
| BCE | \(-[y\log\hat{y} + (1-y)\log(1-\hat{y})]\) | Binary classification |
| CE | \(-\sum y_i\log\hat{y}_i\) | Multiclass |
| Focal | \(-(1-\hat{y}_t)^\gamma \log(\hat{y}_t)\) | Imbalanced |
| Contrastive | \(\max(0, d_{pos} - d_{neg} + m)\) | Metric learning |
| Triplet | \(\max(0, d(a,p) - d(a,n) + m)\) | Face recognition |
Заблуждение: MSE подходит для классификации
MSE + sigmoid дает vanishing gradients при насыщении: \(\sigma'(z) \to 0\) при \(|z| > 5\). Gradient MSE содержит множитель \(\sigma(z)(1-\sigma(z))\), который стремится к 0. Cross-entropy не имеет этой проблемы -- gradient пропорционален \((\sigma(z) - y)\), не затухает. На практике переход с MSE на CE ускоряет сходимость в 3-5 раз для classification задач.
Заблуждение: Focal Loss всегда лучше CE для imbalanced данных
Focal Loss с \(\gamma=2\) уменьшает loss для easy examples в \(\sim\)25 раз (при \(p_t=0.9\): \((1-0.9)^2 = 0.01\)). Но если дисбаланс умеренный (1:10), обычный CE с class weights часто работает не хуже. Focal Loss критичен при extreme imbalance (1:1000+), например object detection где >99% якорей -- фон.
2. Backpropagation (nn_001_backprop)¶
Лучшие источники¶
MUST DO: - Karpathy: micrograd — autograd с нуля - Karpathy: nn-zero-to-hero — курс
YouTube: - 3Blue1Brown: Backpropagation — визуализация - Karpathy: Let's build GPT
Блоги: - Colah's Blog: Backprop — canonical explanation - Chain Rule Explained
Ключевые концепции¶
Chain Rule: $\(\frac{\partial L}{\partial w} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial z} \cdot \frac{\partial z}{\partial w}\)$
Computational Graph:
graph LR
x["x"] --> f["f"]
f --> z["z"]
z --> g["g"]
g --> y["y"]
y --> L["L"]
L --> loss["loss"]
f -. "dz/dx" .-> fd["grad"]
g -. "dy/dz" .-> gd["grad"]
L -. "dL/dy" .-> Ld["grad"]
style x fill:#e8eaf6,stroke:#3f51b5
style z fill:#e8eaf6,stroke:#3f51b5
style y fill:#e8eaf6,stroke:#3f51b5
style loss fill:#fce4ec,stroke:#c62828
style f fill:#e8f5e9,stroke:#4caf50
style g fill:#e8f5e9,stroke:#4caf50
style L fill:#e8f5e9,stroke:#4caf50
Topological Sort: Backward pass visits nodes in reverse topological order.
micrograd pattern:
class Value:
def backward(self):
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1.0
for v in reversed(topo):
v._backward()
3. Optimizers (dl_004_optimizers)¶
Лучшие источники¶
Papers: - Adam Paper — Kingma & Ba, 2014 - AdamW Paper — Loshchilov & Hutter
Блоги: - Ruder: Optimization Overview — MUST READ - Sebastian Raschka: Optimizers
Evolution of Optimizers¶
| Optimizer | Update Rule | Innovation |
|---|---|---|
| SGD | \(w = w - \eta \nabla L\) | Baseline |
| Momentum | \(v = \gamma v + \eta \nabla L\) | Accumulate velocity |
| AdaGrad | \(w = w - \frac{\eta}{\sqrt{G}} \nabla L\) | Per-param LR |
| RMSprop | \(E[g^2] = \gamma E[g^2] + (1-\gamma)g^2\) | Fix AdaGrad |
| Adam | \(m = \beta_1 m + (1-\beta_1)g\), \(v = \beta_2 v + (1-\beta_2)g^2\) | Combine all |
Adam formulas: $\(m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t\)$
Заблуждение: Adam не требует weight decay
Оригинальный Adam (2014) реализует L2-регуляризацию неправильно -- добавляет penalty к градиенту ДО adaptive scaling, что ослабляет регуляризацию для параметров с большим \(v_t\). AdamW (2017) исправляет это, применяя weight decay напрямую к весам: \(w_t = (1 - \lambda)w_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\). На LLM pretraining разница в perplexity может достигать 5-10%.
4. Weight Initialization (dl_006_weight_init)¶
Лучшие источники¶
Papers: - Xavier Initialization — Glorot & Bengio - He Initialization — He et al.
Блоги: - Deep Learning Book: Initialization
Key Formulas¶
| Init | Variance | Activation |
|---|---|---|
| Xavier | \(\frac{1}{n_{in}}\) или \(\frac{2}{n_{in}+n_{out}}\) | tanh, sigmoid |
| He | \(\frac{2}{n_{in}}\) | ReLU |
| LeCun | \(\frac{1}{n_{in}}\) | SELU |
Why not zeros? - All neurons compute same output - Same gradients → same updates - No symmetry breaking
# PyTorch
nn.init.xavier_uniform_(layer.weight)
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
5. Normalization (dl_009_batch_norm_layernorm)¶
Лучшие источники¶
Papers: - Batch Normalization — Ioffe & Szegedy, 2015 - Layer Normalization — Ba et al., 2016 - RMSNorm — Zhang & Sennrich, 2019
Блоги: - Lil'Log: Normalization
Comparison¶
| Method | Normalize over | Statistics | Use Case |
|---|---|---|---|
| BatchNorm | Batch dim | \(\mu_B, \sigma_B\) | CNNs |
| LayerNorm | Feature dim | \(\mu_L, \sigma_L\) | Transformers, RNNs |
| InstanceNorm | Spatial dim | \(\mu_I, \sigma_I\) | Style transfer |
| GroupNorm | Group of channels | \(\mu_G, \sigma_G\) | Small batches |
| RMSNorm | Feature dim | No mean | LLMs (LLaMA, Qwen) |
RMSNorm (2025 standard): $\(\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2}} \cdot \gamma\)$
Simpler than LayerNorm (no mean subtraction), works better for LLMs.
Заблуждение: BatchNorm всегда лучше чем без нормализации
BatchNorm зависит от batch size. При batch_size=1 (inference, некоторые GAN) статистики батча бессмысленны. При batch_size < 8 variance батча нестабильна, что ухудшает обучение на ~2-5% accuracy. Для Transformers и RNN используйте LayerNorm. Для LLMs (2025+) -- RMSNorm. BatchNorm остается стандартом только для CNN с batch_size >= 32.
6. LR Scheduling (dl_007_lr_scheduling)¶
Лучшие источники¶
Papers: - SGDR: Warm Restarts - 1cycle Policy
Common Schedules¶
| Schedule | Formula | Use Case |
|---|---|---|
| Step | \(\eta \cdot \gamma^{\lfloor epoch/d \rfloor}\) | Simple baseline |
| Cosine | \(\eta_{min} + \frac{1}{2}(\eta_{max}-\eta_{min})(1+\cos(\frac{t}{T}\pi))\) | Standard for LLMs |
| Linear Warmup | \(\eta = \eta_{base} \cdot \frac{t}{T_{warmup}}\) | Transformers |
| 1cycle | warmup → peak → anneal | Fast training |
Warmup + Cosine Decay (LLM standard):
def lr_lambda(step):
if step < warmup_steps:
return step / warmup_steps
else:
progress = (step - warmup_steps) / (total_steps - warmup_steps)
return 0.5 * (1 + math.cos(math.pi * progress))
7. PyTorch Training Loop (dl_002_pytorch_training_loop)¶
Лучшие источники¶
Документация: - PyTorch Tutorials - PyTorch Recipes
Блоги: - PyTorch Best Practices
Standard Training Loop¶
model.train()
for epoch in range(num_epochs):
for batch_idx, (X, y) in enumerate(dataloader):
X, y = X.to(device), y.to(device)
# Forward
optimizer.zero_grad()
output = model(X)
loss = criterion(output, y)
# Backward
loss.backward()
optimizer.step()
# Validation
model.eval()
with torch.no_grad():
val_loss = evaluate(model, val_loader)
# LR scheduling
scheduler.step()
Common bugs:
- model.eval() missing for validation
- optimizer.zero_grad() missing
- Not using with torch.no_grad() for inference
8. CNN from Scratch (nn_002_cnn)¶
Лучшие источники¶
Курсы: - CS231n: CNNs — MUST DO - d2l.ai: CNNs
YouTube: - 3Blue1Brown: CNNs
Key Concepts¶
Convolution: $\((f * g)[n] = \sum_{m} f[m] \cdot g[n-m]\)$
Output size: $\(H_{out} = \lfloor\frac{H_{in} + 2P - K}{S}\rfloor + 1\)$
where \(P\) = padding, \(K\) = kernel size, \(S\) = stride.
Backward through conv: - Gradient w.r.t. input = full convolution of error with flipped kernel - Gradient w.r.t. kernel = convolution of input with error
9. RNN/LSTM (nn_003_rnn_lstm)¶
Лучшие источники¶
Papers: - LSTM Paper — Hochreiter & Schmidhuber, 1997 - GRU Paper — Cho et al., 2014
Блоги: - Colah's Blog: LSTM — MUST READ - Lil'Log: RNN
Vanishing Gradient Problem¶
Vanilla RNN: $\(h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t + b)\)$
Gradient: \(\frac{\partial h_t}{\partial h_{t-k}} = \prod_{i=0}^{k-1} W_{hh} \cdot \text{diag}(\tanh')\)
If \(\|W_{hh}\| < 1\), gradient vanishes exponentially.
LSTM Gates¶
10. Attention Mechanism (dl_001_attention_mechanism)¶
Лучшие источники¶
Paper: - Attention Is All You Need — Vaswani et al., 2017
Блоги: - The Illustrated Transformer — MUST READ - Lil'Log: Attention
Scaled Dot-Product Attention¶
Why \(\sqrt{d_k}\)? - Large \(d_k\) → dot products grow → softmax becomes peaky → small gradients - Scaling prevents this
Multi-Head Attention¶
where \(\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\)
def attention(Q, K, V, mask=None):
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attn = F.softmax(scores, dim=-1)
return torch.matmul(attn, V)
11. Positional Encodings (dl_003_positional)¶
Лучшие источники¶
Papers: - Attention Is All You Need — Sinusoidal - RoPE — Rotary Position Embedding
Блоги: - RoPE Explained
Sinusoidal Encoding¶
Properties: - Model can generalize to longer sequences - Relative positions can be computed - Fixed (not learned)
RoPE (Rotary Position Embedding)¶
Key Idea: Encode position via rotation in complex plane.
Advantages: - Relative position via rotation composition - Better length extrapolation - Standard in LLaMA, Qwen, Mistral
12. KV-Cache Theory for LLM Inference (Gap Filler)¶
Лучшие источники¶
Статьи (2025): - KV Cache Optimization via Multi-Head Latent Attention — PyImageSearch, Oct 2025 - vLLM Tutorial 2025: The Ultimate Guide — vife.ai, Jan 2026 - KV Cache System-level Optimizations — Sara Zan, Oct 2025
Papers: - vLLM: Easy, Fast, and Cheap LLM Serving — Kwon et al., 2023 - DeepSeek-V2: MLA — Multi-Head Latent Attention
What is KV-Cache?¶
Problem: Autoregressive generation computes attention for ALL previous tokens at each step.
Without cache: \(O(n^2)\) attention computation per token.
With cache: \(O(n)\) — only compute K/V for new token, reuse cached values.
Memory formula: $\(\text{KV Memory} = 2 \times L \times B \times S \times H \times D_h \times \text{bytes}\)$
Where: \(L\) = layers, \(B\) = batch, \(S\) = sequence length, \(H\) = heads, \(D_h\) = head dim.
Example (Llama-2-7B, 28K context): - KV cache ≈ 14 GB (comparable to model weights!) - This is why inference is memory-bandwidth bound, not compute bound.
PagedAttention (vLLM)¶
Idea: OS-inspired virtual memory paging for KV cache.
Before: Pre-allocated contiguous blocks → 60-80% memory waste due to fragmentation.
PagedAttention: 1. Divide KV cache into fixed-size pages (e.g., 16-512 tokens per page) 2. Store pages non-contiguously 3. Virtual-to-physical mapping via page table
Results: - 24x higher throughput vs HuggingFace - <4% memory waste (vs 60-80%) - More concurrent requests on same hardware
# vLLM usage example
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3-8b-hf")
sampling_params = SamplingParams(temperature=0.8, max_tokens=100)
outputs = llm.generate(["Hello, my name is"], sampling_params)
Multi-Head Latent Attention (MLA) — DeepSeek-V2¶
Idea: Compress KV into shared latent space instead of caching full-resolution tensors.
Standard MHA: Cache \((n_{heads} \times d_h)\) per token per layer.
MLA: Compress to latent dimension \(D_{KV} \ll n_{heads} \times d_h\).
Math: $\(C_{KV} = X W^{KV}_{down} \quad \text{(compress to latent)}\)$
Benefits: - 8x reduction in cache size (e.g., 512 vs 4096 values per token) - On-demand reconstruction preserves accuracy - Used in DeepSeek-V2, DeepSeek-V3
MQA vs GQA vs MHA¶
| Method | KV Heads | Memory | Quality |
|---|---|---|---|
| MHA (Multi-Head) | \(H\) | Full | Best |
| MQA (Multi-Query) | 1 | \(\frac{1}{H}\) | Slight drop |
| GQA (Grouped Query) | \(G\) where \(1 < G < H\) | \(\frac{G}{H}\) | Near MHA |
Used in: - Llama-2: MHA - Llama-3: GQA (8 groups for 32 heads) - PaLM: MQA
System-Level Optimizations¶
Memory Management: - PagedAttention (vLLM): Virtual memory paging - vTensor: Flexible tensor management - ChunkAttention: Prefix tree for shared prefixes
Scheduling: - BatchLLM: Group requests by prefix for cache reuse - RadixAttention: Radix tree for KV sharing - FastServe: Preemptive GPU↔CPU cache swapping
Hardware-aware: - FlashAttention: HBM↔SRAM tiling, fused kernels - FlexGen: GPU↔CPU↔SSD tiered offloading - DistServe: Separate prefill/decode across GPUs
Interview Questions¶
Q: Why does KV cache become the bottleneck in LLM inference?
A: For long sequences, KV memory scales as \(O(L \times S \times H \times D)\). In models like Llama-2-7B with 28K context, KV cache can exceed model weights (14GB+). GPUs become memory-bandwidth bound — reading/writing wide matrices is slower than compute.
Q: Explain PagedAttention in one sentence.
A: PagedAttention applies OS virtual memory concepts to KV cache — storing fixed-size pages non-contiguously with page table mapping, eliminating fragmentation and enabling 24x throughput gains.
Q: What's the difference between MQA and GQA?
A: MQA uses a single KV head shared across all query heads (minimal memory, slight quality drop). GQA uses \(G\) KV heads where \(1 < G < H\), balancing memory reduction (\(\frac{G}{H}\)) with near-MHA quality. Llama-3 uses GQA with 8 groups for 32 heads.
Q: How does MLA (Multi-Head Latent Attention) reduce KV cache?
A: MLA projects K and V into a compressed latent space (\(C_{KV}\)) before caching, typically 8x smaller. During attention, latent vectors are up-projected on-demand. This preserves modeling capacity while dramatically reducing memory bandwidth.
13. Attention Variants (Gap Filler)¶
Cross-Attention (Encoder-Decoder)¶
Used in: Translation, T5, original Transformer
Decoder queries attend to encoder outputs.
Decoder-only models (GPT, Llama): Only self-attention, no cross-attention.
Sliding Window Attention¶
Problem: Full attention is \(O(n^2)\) for long sequences.
Solution: Each token only attends to local window of \(W\) tokens.
Used in: - Longformer (global + local attention) - Mistral (sliding window 4096) - Reduces to \(O(n \cdot W)\) complexity
Flash Attention¶
Key insight: Attention is memory-bound (HBM reads/writes), not compute-bound.
Technique: 1. Load Q, K, V blocks into SRAM 2. Compute attention in SRAM (fused softmax, no materialization) 3. Write only final output to HBM
Speedup: 2-4x faster, same numerical results.
Versions: - FlashAttention-1 (2022): Tiling, fused kernels - FlashAttention-2 (2023): Better parallelism, work partitioning - FlashAttention-3 (2024): H100 optimized, FP8 support
Linear Attention¶
Goal: Replace softmax with kernel to get \(O(n)\) complexity.
Where \(\phi\) is a kernel feature map (e.g., ELU+1).
Used in: Linear Transformer, Performer, RWKV (partially).
14. Mixture of Experts (MoE) — Gap Filler¶
Лучшие источники¶
Статьи (2025-2026): - Build a Mixture-of-Experts LLM from Scratch — Into AI, Jan 2026 - Router Wars: Which MoE Routing Strategy Actually Works — Cerebras, Aug 2025 - The Rise of MoE: Comparing 2025's Leading Models — Friendli AI
Papers: - Outrageously Large Neural Networks — Shazeer et al., 2017 (original MoE) - Switch Transformer — Fedus et al., 2021 - Mixtral 8x7B — Mistral AI, 2024 - DeepSeekMoE — DeepSeek, 2024
MoE Architecture Overview¶
Key Idea: Replace large FFN with multiple smaller FFNs ("experts"), route each token to top-k experts.
Benefits: - Scale parameters without scaling compute - Sparse activation (only subset of experts active per token) - Better specialization for different token types
Models using MoE: - Mixtral 8x7B (8 experts, top-2 routing) - Grok-1 (314B params, sparse MoE) - DeepSeek-V3 (256 routed experts + shared) - GPT-OSS (120B, MoE architecture)
Mathematical Details¶
Step 1: Router logits $\(l_i = x \cdot W_g^{(i)} \quad \text{for } i = 1, ..., n\)$
Step 2: Top-k selection $\(\text{Keep top-}k \text{ logits, mask rest with } -\infty\)$
Step 3: Softmax $\(w_i = \frac{\exp(l_i)}{\sum_{j \in \text{top-}k} \exp(l_j)} \quad \text{for selected experts}\)$
Step 4: Weighted combination $\(y = \sum_{i \in \text{top-}k} w_i \cdot E_i(x)\)$
Routing Strategies Comparison¶
| Strategy | Description | Pros | Cons |
|---|---|---|---|
| Hash Routing | Deterministic: \(\text{expert} = \text{token\_id} \mod N\) | Perfect load balance | Ignores context, low specialization |
| Learned Routing | Trainable router with aux loss | 3x better quality than hash | Router collapse risk |
| Sinkhorn Routing | Iterative normalization per layer | Hash-level balance + learned quality | Hard to scale, gradient issues |
Auxiliary Loss (for load balancing): $\(L_{aux} = \text{coeff} \cdot \sum_i f_i \cdot P_i\)$
Where \(f_i\) = fraction of tokens to expert \(i\), \(P_i\) = sum of router weights.
Router Collapse Problem¶
Symptoms: - Most tokens routed to 1-2 experts - Other experts become "dead" - Model degrades to dense performance
Fixes: - Auxiliary loss (load balancing regularization) - Expert capacity factors - Shared experts (always activated) + routed experts - Z-loss regularization (DeepSeek-V3)
DeepSeekMoE Innovations¶
Hybrid architecture: - Shared experts: Always activated (e.g., 1-2 experts) - Routed experts: Top-k selection (e.g., 256 experts, top-8)
Benefits: - Shared experts capture universal knowledge - Routed experts specialize - More stable training, better utilization
Interview Questions¶
Q: What does "8x7B" mean in Mixtral 8x7B?
A: 8 experts × 7B parameters each. But total params ≠ 56B because: (1) shared components (attention, embeddings) aren't replicated, (2) only top-2 experts active per token → ~13B active params per forward pass.
Q: Why does router collapse happen?
A: Early in training, some experts get slightly better at handling common patterns. Router learns to send more tokens to them → they improve faster → positive feedback loop. Without auxiliary loss, this compounds until most experts are unused.
Q: What's the trade-off between Hash and Learned routing?
A: Hash routing: perfect load balance, 0 overhead, but ignores context → low specialization (~1.5% loss improvement). Learned routing: context-aware, 3x better performance (~4% loss improvement), but requires auxiliary loss and can collapse. Production systems use learned + engineering tricks.
Q: How do shared experts help?
A: Shared experts (always activated) provide a stable "base" representation that all tokens receive. This: (1) prevents collapse by ensuring minimum utilization, (2) captures universal patterns, (3) allows routed experts to focus on specialization rather than general knowledge.
15. State Space Models / Mamba — Gap Filler¶
Лучшие источники¶
Статьи (2025): - How Mamba Beats Transformers at Long Sequences — Galileo AI, Sep 2025 - A Visual Guide to Mamba and State Space Models — Maarten Grootendorst
Papers: - Mamba: Linear-Time Sequence Modeling — Gu & Dao, 2023 - Mamba-2 — Dao et al., 2024 - S4: Efficiently Modeling Long Sequences — Gu et al., 2021
The Transformer Problem¶
Self-attention: \(O(T^2)\) time and memory — every token attends to every token.
Consequence: Sequences beyond a few thousand tokens become impractical. KV cache grows linearly with length.
State Space Models (SSM) Basics¶
Continuous-time formulation: $\(h'(t) = Ah(t) + Bx(t)\)$
Discretized (for sequences): $\(h_t = \bar{A}h_{t-1} + \bar{B}x_t\)$
Where \(\bar{A} = \exp(\Delta A)\), \(\bar{B} = (\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B\).
Key properties: - Training: Parallel scan \(O(T \log T)\) - Inference: Recurrent update \(O(1)\) per token - Fixed hidden state size (no growing cache)
S4 vs Mamba¶
| Aspect | S4 (2021) | Mamba (2023) | Mamba-2 (2024) |
|---|---|---|---|
| Parameters | Fixed A, B, C | Input-dependent B, C, Δ | A = scalar × I |
| Selectivity | No | Yes | Yes |
| Complexity | \(O(T \log T)\) | \(O(T)\) | \(O(T)\) |
| Memory | Fixed | Fixed | Fixed |
Selective SSM (Mamba's Innovation)¶
Key insight: Make parameters functions of input, not fixed.
Selectivity means: - Can selectively remember or forget information - Skip uninformative tokens (like "the") - Devote capacity to important content
Mamba vs Transformer¶
| Aspect | Transformer | Mamba |
|---|---|---|
| Time complexity | \(O(T^2)\) | \(O(T)\) |
| Memory (inference) | \(O(T)\) KV cache | \(O(1)\) hidden state |
| Training parallelism | Full | Parallel scan |
| Inference speed | Slower for long sequences | 5× faster (>2k tokens) |
| No positional encoding needed | No (needs PE) | Yes (implicit in recurrence) |
Hybrid Architectures (2025 Trend)¶
Jamba (AI21): Transformer layers + Mamba layers interleaved.
Bamba (IBM): Mamba2 + attention hybrid, 2× faster inference.
When to use Mamba: - Long document Q&A - Audio/video processing - Genomics (million-token sequences) - Constant memory requirement scenarios
When to prefer Transformers: - Dense global token interactions - Complex multi-hop reasoning - Established production pipelines
Interview Questions¶
Q: Why is Mamba faster than Transformers for long sequences?
A: Transformers have \(O(T^2)\) attention complexity. Mamba uses selective SSMs with \(O(T)\) complexity. At inference, Mamba only stores a fixed-size hidden state, while Transformers grow KV cache linearly. For >2k tokens, Mamba runs 5× faster.
Q: What makes Mamba "selective"?
A: Traditional SSMs (like S4) use fixed matrices A, B, C. Mamba makes B, C, and step size Δ functions of the current input. This allows the model to selectively remember important information and forget uninformative tokens on-the-fly.
Q: When would you choose Mamba over Transformers?
A: Mamba excels at long sequences (>2k tokens) where attention becomes prohibitively expensive. Use cases: document processing, audio/video, genomics. Prefer Transformers when you need dense global interactions or have established pipelines.
Q: What's the difference between S4 and Mamba?
A: S4 (2021) introduced structured SSMs with fixed parameters and \(O(T \log T)\) training. Mamba (2023) added input-dependent parameters (selectivity) and efficient CUDA kernels for \(O(T)\) training. Mamba-2 (2024) simplified the transition matrix A to a scalar multiple of identity for easier kernel fusion.
16. Transformer Architecture Deep Dive (Pre-Norm vs Post-Norm)¶
Источники: Medium "Why Pre-Norm Became the Default in Transformers" (Jan 2025), LayerNorm papers
Layer Normalization Placement¶
Post-Norm (Original Transformer, 2017)¶
Особенности:
- Residual path чистый: x + Attention(x)
- Нормализация ПОСЛЕ сложения
- Gradient должен пройти через LayerNorm
Pre-Norm (GPT-2, 2019 — сейчас стандарт)¶
Особенности:
- Residual path чистый: x + Attention(LayerNorm(x)) → но x передаётся напрямую
- Нормализация ПЕРЕД sublayer
- Gradient течёт напрямую по residual connection
Почему Pre-Norm победил¶
| Аспект | Post-Norm | Pre-Norm |
|---|---|---|
| Gradient Flow | Через LayerNorm | Напрямую по residual |
| Training Stability | Требует warmup | Стабильнее |
| Deep Networks | Сложно обучать >12 слоёв | Легко 100+ слоёв |
| Learning Rate | Чувствителен | Менее чувствителен |
| Gradient Vanishing | Проблема на глубине | Меньше проблем |
Математика Gradient Flow¶
Post-Norm gradient: $\(\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial \text{LayerNorm}}{\partial (x + \text{Attention}(x))}\)$
Gradient должен пройти через LayerNorm — это добавляет сложности.
Pre-Norm gradient: $\(\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot (1 + \frac{\partial \text{Attention}}{\partial \text{LayerNorm}(x)})\)$
Слагаемое 1 означает ПРЯМОЙ gradient path — gradient всегда доходит до входа.
Double Norm Innovation (2024-2025)¶
Новые модели (Grok, Gemma 2, Olmo 2) используют Double Norm:
Double Norm Block:
y = x + Attention(LayerNorm(x)) # Pre-Norm attention
y = LayerNorm(y) # Post-Norm output
y = y + FeedForward(LayerNorm(y)) # Pre-Norm FFN
y = LayerNorm(y) # Post-Norm output
Преимущества: - Pre-Norm для стабильности обучения - Post-Norm для лучшего representations на выходе
Python: Pre-Norm vs Post-Norm Block¶
import torch
import torch.nn as nn
class PreNormTransformerBlock(nn.Module):
"""Pre-Norm (современный стандарт)"""
def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
super().__init__()
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.attn = nn.MultiheadAttention(d_model, n_heads, dropout=dropout)
self.ff = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Linear(d_ff, d_model),
nn.Dropout(dropout)
)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Pre-Norm: normalize BEFORE attention
normed = self.norm1(x)
attn_out, _ = self.attn(normed, normed, normed, mask)
x = x + self.dropout(attn_out) # Clean residual
# Pre-Norm: normalize BEFORE FFN
normed = self.norm2(x)
ff_out = self.ff(normed)
x = x + self.dropout(ff_out) # Clean residual
return x
class PostNormTransformerBlock(nn.Module):
"""Post-Norm (original transformer)"""
def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
super().__init__()
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.attn = nn.MultiheadAttention(d_model, n_heads, dropout=dropout)
self.ff = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Linear(d_ff, d_model),
nn.Dropout(dropout)
)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Post-Norm: normalize AFTER residual
attn_out, _ = self.attn(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_out))
ff_out = self.ff(x)
x = self.norm2(x + self.dropout(ff_out))
return x
Interview Questions¶
Q: Почему Pre-Norm стабильнее Post-Norm?
A: В Pre-Norm residual connection передаёт gradient напрямую: \(\frac{\partial L}{\partial x}\) содержит слагаемое 1. Это означает, что gradient всегда достигает ранних слоёв без затухания. В Post-Norm gradient должен пройти через LayerNorm, что добавляет нелинейность и потенциально затухание.
Q: Когда использовать Post-Norm?
A: Post-Norm может давать лучшие representations на выходе блока, так как LayerNorm "выравнивает" активации. Некоторые модели используют Double Norm (Pre + Post) чтобы получить преимущества обоих подходов.
Q: Как Pre-Norm влияет на learning rate warmup?
A: Pre-Norm менее чувствителен к learning rate и часто позволяет убрать или сильно сократить warmup. Post-Norm требует осторожного warmup чтобы избежать gradient explosion на ранних итерациях.
Q: Что такое Double Norm?
A: Double Norm комбинирует Pre-Norm и Post-Norm: нормализация ПЕРЕД sublayer для стабильности gradient flow, и ПОСЛЕ residual connection для лучшего output distribution. Используется в Grok, Gemma 2, Olmo 2.
17. Distributed Training (DDP, Pipeline, Tensor, ZeRO/FSDP)¶
Источники: Datahacker.rs "LLMs from Scratch #007" (Nov 2025), Deepak Baby Blog (Dec 2025), ZeRO Paper
The Memory Wall Problem¶
Почему нужен distributed training: - 7B параметров в FP32: \(7B \times 4 = 28\) GB (только веса) - Gradients: ещё 28 GB - Optimizer states (Adam): 56 GB (2 momentum terms) - Activations: substantial overhead - Итого: 150GB+ для 7B модели — не влезает в один GPU!
Three Fundamental Parallelization Strategies¶
| Стратегия | Что параллелится | Когда использовать |
|---|---|---|
| Data Parallelism | Training data | Large datasets, model fits in single GPU |
| Model/Tensor Parallelism | Model layers | Very large layers (attention in transformers) |
| Pipeline Parallelism | Model stages | Deep models with many sequential layers |
| 3D Parallelism | Combination | Extremely large models (100B+ params) |
Distributed Data Parallel (DDP)¶
Как работает: 1. Репликация модели на каждый GPU 2. Sharding батча между GPU 3. Forward pass независимо 4. Backward pass независимо 5. AllReduce для синхронизации gradients 6. Обновление весов
AllReduce = Reduce-Scatter + All-Gather: $\(\text{AllReduce}(X) = \text{AllGather}(\text{ReduceScatter}(X))\)$
Cost: \(2 \times \text{size}(X)\)
| Pros DDP | Cons DDP |
|---|---|
| Простая реализация | Memory redundancy (full model per GPU) |
| Linear scaling | Model must fit in single GPU |
| No model changes | Communication overhead (AllReduce) |
| Fault tolerance | Synchronization barrier (stragglers) |
Pipeline Parallelism¶
Как работает: 1. Split model на N stages (по слоям) 2. Каждый GPU обрабатывает свой stage 3. Micro-batching для уменьшения bubbles
Pipeline Bubble Formula: $\(\text{Bubble Fraction} = \frac{p - 1}{m}\)$
Где \(p\) = число stages, \(m\) = число micro-batches.
Pipeline Schedules:
| Schedule | Описание | Memory | Bubble Ratio |
|---|---|---|---|
| GPipe | All forward, then all backward | High | \((p-1)/m\) |
| 1F1B | Alternates forward/backward | Lower | \((p-1)/m\) |
| Interleaved 1F1B | Virtual stages | Lowest | \((p-1)/(m \cdot v)\) |
Tensor Parallelism¶
Как работает: Shard individual layers (не sequential stages).
MLP Example:
# Column-parallel: shard first linear
Y1 = X @ W1[:, :k] # GPU 0
Y2 = X @ W1[:, k:] # GPU 1
# Row-parallel: shard second linear
Z1 = Y1 @ W2[:k, :] # GPU 0
Z2 = Y2 @ W2[k:, :] # GPU 1
# AllReduce to combine
Z = Z1 + Z2 # AllReduce
When to use: Very large layers, within single node (NVLink required).
ZeRO (Zero Redundancy Optimizer)¶
ZeRO stages:
| Stage | Что шардится | Memory Savings |
|---|---|---|
| ZeRO-1 | Optimizer states | 4x |
| ZeRO-2 | + Gradients | 8x |
| ZeRO-3 | + Parameters | \(N \times\) (N = GPU count) |
ZeRO-3 / FSDP Process: 1. Shard: Split params/gradients/optimizer states across GPUs 2. Gather: AllGather params when needed for computation 3. Compute: Forward/backward pass 4. Scatter: Reduce-scatter gradients to owners 5. Update: Each GPU updates its shard
Fully Sharded Data Parallel (FSDP)¶
Gather-Compute-Scatter Pattern:
for layer in model:
# Gather: collect full layer params from all GPUs
all_gather(layer.params)
# Compute: forward/backward
output = layer(input)
# Scatter: return unused params, reduce-scatter grads
reduce_scatter(layer.gradients)
Sharding Strategies:
- FULL_SHARD — ZeRO-3 equivalent (max memory savings)
- SHARD_GRAD_OP — ZeRO-2 equivalent
- NO_SHARD — DDP equivalent
Python: FSDP with PyTorch¶
import torch
import torch.nn as nn
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import ShardingStrategy
# Initialize distributed
torch.distributed.init_process_group(backend="nccl")
# Create model
model = MyLargeModel()
# Wrap with FSDP (ZeRO-3 equivalent)
model = FSDP(
model,
sharding_strategy=ShardingStrategy.FULL_SHARD, # ZeRO-3
device_id=torch.cuda.current_device(),
)
# Training loop works normally
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
for batch in dataloader:
optimizer.zero_grad()
loss = model(batch).sum()
loss.backward()
optimizer.step()
3D Parallelism¶
Combination for extreme scale:
Example (Llama 3 405B): - Tensor Parallel: 8 GPUs per node - Pipeline Parallel: 8 stages - Data Parallel: 128 replicas - Total: \(8 \times 8 \times 128 = 8192\) GPUs
Comparison Summary¶
| Strategy | Memory | Communication | Complexity | Use Case |
|---|---|---|---|---|
| DDP | High | AllReduce | Low | Small models |
| Pipeline | Medium | Point-to-point | Medium | Deep models |
| Tensor | Low | AllGather/Reduce | High | Wide layers |
| FSDP | Very Low | AllGather/ReduceScatter | Medium | Large models |
| 3D | Lowest | Complex | Very High | 100B+ params |
Interview Questions¶
Q: В чём разница между DDP и FSDP?
A: DDP реплицирует полную модель на каждом GPU, синхронизируя только gradients через AllReduce. FSDP шардит параметры, gradients и optimizer states между GPU, собирая их on-demand для computation. FSDP позволяет обучать модели, которые не влезают в память одного GPU.
Q: Что такое pipeline bubble и как его уменьшить?
A: Pipeline bubble — idle time когда GPU ждёт данные от предыдущего stage. Формула: \(\frac{p-1}{m}\) где \(p\) = stages, \(m\) = micro-batches. Уменьшается через: больше micro-batches, schedules (1F1B, Interleaved), interleaved stages.
Q: Когда использовать Tensor Parallelism vs Pipeline Parallelism?
A: Tensor Parallelism для очень широких слоёв (attention), требует NVLink, работает только внутри single node. Pipeline Parallelism для глубоких моделей, работает across nodes, но имеет bubble overhead. Для 100B+ моделей — комбинируют (3D Parallelism).
Q: Объясни ZeRO stages.
A: ZeRO-1 шардит optimizer states (4x memory savings). ZeRO-2 добавляет gradient sharding (8x savings). ZeRO-3 шардит ещё и параметры (\(N \times\) savings для \(N\) GPUs). ZeRO-3 эквивалентен FSDP FULL_SHARD.
18. Vision Transformers (ViT)¶
Источники: Codecademy "Vision Transformers Architecture" (Sept 2025), GeeksforGeeks ViT Architecture (2025), "An Image Is Worth 16x16 Words" paper
Core Idea¶
Vision Transformer (ViT) applies transformer architecture to images by treating them as sequences of patches, not convolutions.
Key insight: Image \(\rightarrow\) Patches \(\rightarrow\) Transformer (same as text!)
Architecture Components¶
1. Image Patching¶
Input: Image [H, W, C]
Output: N patches of size [P, P, C]
Example: 224×224×3 image → 196 patches of 16×16×3
Number of patches: N = (H × W) / (P × P) = 50176 / 256 = 196
2. Patch Embedding¶
Step 1: Flatten patches: $\(\text{Patch vector} = P^2 \times C = 16 \times 16 \times 3 = 768\)$
Step 2: Linear projection to D dimensions: $\(\mathbf{z}_0 = [\mathbf{x}_{\text{cls}}; \mathbf{x}_p^1 \mathbf{E}; \mathbf{x}_p^2 \mathbf{E}; \ldots; \mathbf{x}_p^N \mathbf{E}] + \mathbf{E}_{\text{pos}}\)$
Where \(\mathbf{E} \in \mathbb{R}^{(P^2 \cdot C) \times D}\) is the learnable linear projection.
3. Positional Encoding¶
Why needed: Transformers are permutation invariant — need to encode spatial order.
Learnable positional embeddings: $\(\mathbf{E}_{\text{pos}} \in \mathbb{R}^{(N+1) \times D}\)$
Added to patch embeddings to preserve spatial information.
4. CLS Token¶
Purpose: Learnable token prepended to sequence that aggregates global information.
The CLS token attends to ALL patches and learns image-level representation.
5. Transformer Encoder¶
Standard transformer block (Pre-LN):
Self-Attention Formula: $\(\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\)$
Multi-Head Attention: $\(\text{MSA}(X) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W_O\)$
Feed-Forward Network: $\(\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2\)$
6. Classification Head (MLP Head)¶
ViT vs CNN Comparison¶
| Feature | CNNs | ViTs |
|---|---|---|
| Attention Scope | Local (convolutions) | Global (self-attention) |
| Inductive Bias | Strong (locality, translation invariance) | Minimal, more flexible but data-hungry |
| Data Requirement | Works with small datasets | Needs large datasets |
| Feature Learning | Hierarchical (low→high) | Context-rich, long-range |
| Computational | \(O(K^2 \cdot C_{in} \cdot C_{out})\) per layer | \(O(N^2 \cdot D)\) per attention |
| Transfer Learning | Good | Excellent with pretraining |
Python: ViT from Scratch¶
import torch
import torch.nn as nn
class PatchEmbedding(nn.Module):
"""Convert image into patches and embed them."""
def __init__(self, img_size=224, patch_size=16, in_channels=3, embed_dim=768):
super().__init__()
self.n_patches = (img_size // patch_size) ** 2
# Conv2d with stride=patch_size is equivalent to patch extraction + projection
self.proj = nn.Conv2d(
in_channels, embed_dim,
kernel_size=patch_size, stride=patch_size
)
def forward(self, x):
# x: [B, C, H, W] -> [B, embed_dim, H/P, W/P]
x = self.proj(x)
# Flatten: [B, embed_dim, n_patches] -> [B, n_patches, embed_dim]
x = x.flatten(2).transpose(1, 2)
return x
class VisionTransformer(nn.Module):
def __init__(self, img_size=224, patch_size=16, in_channels=3,
num_classes=1000, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0):
super().__init__()
self.patch_embed = PatchEmbedding(img_size, patch_size, in_channels, embed_dim)
n_patches = self.patch_embed.n_patches
# CLS token + positional embeddings
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
self.pos_embed = nn.Parameter(torch.zeros(1, n_patches + 1, embed_dim))
# Transformer encoder
encoder_layer = nn.TransformerEncoderLayer(
d_model=embed_dim, nhead=num_heads,
dim_feedforward=int(embed_dim * mlp_ratio),
activation='gelu', batch_first=True, norm_first=True
)
self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=depth)
# Classification head
self.head = nn.Linear(embed_dim, num_classes)
def forward(self, x):
B = x.shape[0]
# Patch embedding
x = self.patch_embed(x) # [B, n_patches, embed_dim]
# Add CLS token
cls_tokens = self.cls_token.expand(B, -1, -1)
x = torch.cat([cls_tokens, x], dim=1) # [B, n_patches+1, embed_dim]
# Add positional embeddings
x = x + self.pos_embed
# Transformer encoder
x = self.encoder(x)
# Classification from CLS token
cls_output = x[:, 0] # [B, embed_dim]
return self.head(cls_output)
ViT Variants (2024-2025)¶
| Model | Key Innovation | Best For |
|---|---|---|
| DeiT | Data-efficient training, distillation | Small datasets |
| Swin | Hierarchical, shifted windows | Dense prediction |
| ConvNeXt | CNN designed like ViT | General vision |
| MaxViT | Multi-axis attention | Hybrid local-global |
| EVA-02 | CLIP-pretrained | Transfer learning |
Advantages & Limitations¶
| Advantages | Limitations |
|---|---|
| Global context (long-range deps) | Data-hungry (needs JFT/ImageNet-21K) |
| Scales well at large sizes | Quadratic attention cost \(O(N^2)\) |
| Excellent transfer learning | Longer training time |
| Unified with NLP transformers | Positional encoding dependency |
| Parallel processing | Less efficient for small datasets |
Interview Questions¶
Q: Как ViT работает с изображениями?
A: ViT разбивает изображение на фиксированные patches (например, 16×16), flatten'ит каждый patch в вектор, проецирует линейно в embedding размерность. Добавляет learnable CLS token (как в BERT) и positional embeddings. Затем пропускает через стандарт transformer encoder. CLS token на выходе используется для классификации.
Q: В чём разница ViT vs CNN?
A: CNN использует convolution с local receptive field, имеет strong inductive bias (locality, translation invariance), работает на малых данных. ViT использует global self-attention — каждый patch взаимодействует со всеми, minimal inductive bias, требует большие datasets (JFT-300M, ImageNet-21K), но отлично transfer'ится.
Q: Зачем нужен CLS token в ViT?
A: CLS token — learnable vector, prepended к sequence patches. Через self-attention он агрегирует информацию от ALL patches, learning global image representation. На выходе encoder'а embedding CLS token используется для классификации (как [CLS] в BERT для sentence classification).
Q: Почему ViT требует больше данных чем CNN?
A: CNN имеет strong inductive biases: locality (соседние пиксели связаны), translation equivariance (один фильтр везде). ViT treat'ит patches как unordered tokens, self-attention изучает все связи с нуля. На малых данных ViT overfit'ит. Решение: pretraining на огромных datasets (JFT-300M) или DeiT с distillation.
Связи между темами¶
graph TD
calc["Calculus (gradient)"] --> bp["Backpropagation"]
bp --> opt["Optimizers"]
opt --> winit["Weight Init"]
winit --> norm["Normalization"]
norm --> loop["Training Loop"]
loop --> loss["Loss Functions"]
loss --> archs["CNN / RNN / LSTM"]
archs --> attn["Attention"]
pe["Positional Encodings"] --> attn
attn --> trans["Transformers"]
trans --> llm["LLMs"]
style calc fill:#f3e5f5,stroke:#9c27b0
style bp fill:#e8eaf6,stroke:#3f51b5
style opt fill:#e8eaf6,stroke:#3f51b5
style winit fill:#e8eaf6,stroke:#3f51b5
style norm fill:#e8eaf6,stroke:#3f51b5
style loop fill:#e8f5e9,stroke:#4caf50
style loss fill:#e8f5e9,stroke:#4caf50
style archs fill:#fff3e0,stroke:#ef6c00
style attn fill:#fff3e0,stroke:#ef6c00
style pe fill:#fff3e0,stroke:#ef6c00
style trans fill:#fce4ec,stroke:#c62828
style llm fill:#fce4ec,stroke:#c62828
Рекомендуемый порядок изучения¶
- Week 1: Backprop (micrograd), Weight Init
- Week 2: Optimizers (implement Adam from scratch)
- Week 3: Normalization, LR Scheduling
- Week 4: CNN basics, Training Loop
- Week 5: RNN/LSTM, Vanishing Gradients
- Week 6: Attention, Positional Encodings