Deep Learning: Учебные материалы¶

~22 минуты чтения

Предварительно: Математика для ML | Классический ML

Deep Learning покрывает ~35% вопросов на ML-интервью уровня Middle+ (по данным Blind и levels.fyi за 2025). В этом разделе собраны 18 ключевых тем: от backpropagation и loss functions до MoE, SSM/Mamba и Vision Transformers. Каждая тема содержит формулы, код, лучшие источники и вопросы для самопроверки.

Материалы для 18 тем из категории Deep Learning Обновлено: 2026-02-11

1. Loss Functions (dl_005_loss_functions)¶

Лучшие источники¶

Блоги: - Lil'Log: Loss Functions — визуализация - Distill: Feature Visualization

Papers: - Focal Loss Paper — Lin et al., 2017 - Contrastive Learning Survey

Ключевые формулы¶

Loss	Formula	Use Case
MSE	$\frac{1}{n}\sum(y-\hat{y})^2$	Regression
BCE	$-[y\log\hat{y} + (1-y)\log(1-\hat{y})]$	Binary classification
CE	$-\sum y_i\log\hat{y}_i$	Multiclass
Focal	$-(1-\hat{y}_t)^\gamma \log(\hat{y}_t)$	Imbalanced
Contrastive	$\max(0, d_{pos} - d_{neg} + m)$	Metric learning
Triplet	$\max(0, d(a,p) - d(a,n) + m)$	Face recognition

Заблуждение: MSE подходит для классификации

MSE + sigmoid дает vanishing gradients при насыщении: $\sigma'(z) \to 0$ при $|z| > 5$. Gradient MSE содержит множитель $\sigma(z)(1-\sigma(z))$, который стремится к 0. Cross-entropy не имеет этой проблемы -- gradient пропорционален $(\sigma(z) - y)$, не затухает. На практике переход с MSE на CE ускоряет сходимость в 3-5 раз для classification задач.

Заблуждение: Focal Loss всегда лучше CE для imbalanced данных

Focal Loss с $\gamma=2$ уменьшает loss для easy examples в $\sim$25 раз (при $p_t=0.9$: $(1-0.9)^2 = 0.01$). Но если дисбаланс умеренный (1:10), обычный CE с class weights часто работает не хуже. Focal Loss критичен при extreme imbalance (1:1000+), например object detection где >99% якорей -- фон.

2. Backpropagation (nn_001_backprop)¶

Лучшие источники¶

MUST DO: - Karpathy: micrograd — autograd с нуля - Karpathy: nn-zero-to-hero — курс

YouTube: - 3Blue1Brown: Backpropagation — визуализация - Karpathy: Let's build GPT

Блоги: - Colah's Blog: Backprop — canonical explanation - Chain Rule Explained

Ключевые концепции¶

Chain Rule: $$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial z} \cdot \frac{\partial z}{\partial w}$$

Computational Graph:

graph LR
    x["x"] --> f["f"]
    f --> z["z"]
    z --> g["g"]
    g --> y["y"]
    y --> L["L"]
    L --> loss["loss"]

    f -. "dz/dx" .-> fd["grad"]
    g -. "dy/dz" .-> gd["grad"]
    L -. "dL/dy" .-> Ld["grad"]

    style x fill:#e8eaf6,stroke:#3f51b5
    style z fill:#e8eaf6,stroke:#3f51b5
    style y fill:#e8eaf6,stroke:#3f51b5
    style loss fill:#fce4ec,stroke:#c62828
    style f fill:#e8f5e9,stroke:#4caf50
    style g fill:#e8f5e9,stroke:#4caf50
    style L fill:#e8f5e9,stroke:#4caf50

Topological Sort: Backward pass visits nodes in reverse topological order.

micrograd pattern:

class Value:
    def backward(self):
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        self.grad = 1.0
        for v in reversed(topo):
            v._backward()

3. Optimizers (dl_004_optimizers)¶

Лучшие источники¶

Papers: - Adam Paper — Kingma & Ba, 2014 - AdamW Paper — Loshchilov & Hutter

Блоги: - Ruder: Optimization Overview — MUST READ - Sebastian Raschka: Optimizers

Evolution of Optimizers¶

Optimizer	Update Rule	Innovation
SGD	$w = w - \eta \nabla L$	Baseline
Momentum	$v = \gamma v + \eta \nabla L$	Accumulate velocity
AdaGrad	$w = w - \frac{\eta}{\sqrt{G}} \nabla L$	Per-param LR
RMSprop	$E[g^2] = \gamma E[g^2] + (1-\gamma)g^2$	Fix AdaGrad
Adam	$m = \beta_1 m + (1-\beta_1)g$, $v = \beta_2 v + (1-\beta_2)g^2$	Combine all

Adam formulas: $$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$$

\[v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2\]

\[\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}\]

\[w_t = w_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\]

Заблуждение: Adam не требует weight decay

Оригинальный Adam (2014) реализует L2-регуляризацию неправильно -- добавляет penalty к градиенту ДО adaptive scaling, что ослабляет регуляризацию для параметров с большим $v_t$. AdamW (2017) исправляет это, применяя weight decay напрямую к весам: $w_t = (1 - \lambda)w_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$. На LLM pretraining разница в perplexity может достигать 5-10%.

4. Weight Initialization (dl_006_weight_init)¶

Лучшие источники¶

Papers: - Xavier Initialization — Glorot & Bengio - He Initialization — He et al.

Блоги: - Deep Learning Book: Initialization

Key Formulas¶

Init	Variance	Activation
Xavier	$\frac{1}{n_{in}}$ или $\frac{2}{n_{in}+n_{out}}$	tanh, sigmoid
He	$\frac{2}{n_{in}}$	ReLU
LeCun	$\frac{1}{n_{in}}$	SELU

Why not zeros? - All neurons compute same output - Same gradients → same updates - No symmetry breaking

# PyTorch
nn.init.xavier_uniform_(layer.weight)
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')

5. Normalization (dl_009_batch_norm_layernorm)¶

Лучшие источники¶

Papers: - Batch Normalization — Ioffe & Szegedy, 2015 - Layer Normalization — Ba et al., 2016 - RMSNorm — Zhang & Sennrich, 2019

Блоги: - Lil'Log: Normalization

Comparison¶

Method	Normalize over	Statistics	Use Case
BatchNorm	Batch dim	$\mu_B, \sigma_B$	CNNs
LayerNorm	Feature dim	$\mu_L, \sigma_L$	Transformers, RNNs
InstanceNorm	Spatial dim	$\mu_I, \sigma_I$	Style transfer
GroupNorm	Group of channels	$\mu_G, \sigma_G$	Small batches
RMSNorm	Feature dim	No mean	LLMs (LLaMA, Qwen)

RMSNorm (2025 standard): $$\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2}} \cdot \gamma$$

Simpler than LayerNorm (no mean subtraction), works better for LLMs.

Заблуждение: BatchNorm всегда лучше чем без нормализации

BatchNorm зависит от batch size. При batch_size=1 (inference, некоторые GAN) статистики батча бессмысленны. При batch_size < 8 variance батча нестабильна, что ухудшает обучение на ~2-5% accuracy. Для Transformers и RNN используйте LayerNorm. Для LLMs (2025+) -- RMSNorm. BatchNorm остается стандартом только для CNN с batch_size >= 32.

6. LR Scheduling (dl_007_lr_scheduling)¶

Лучшие источники¶

Papers: - SGDR: Warm Restarts - 1cycle Policy

Common Schedules¶

Schedule	Formula	Use Case
Step	$\eta \cdot \gamma^{\lfloor epoch/d \rfloor}$	Simple baseline
Cosine	$\eta_{min} + \frac{1}{2}(\eta_{max}-\eta_{min})(1+\cos(\frac{t}{T}\pi))$	Standard for LLMs
Linear Warmup	$\eta = \eta_{base} \cdot \frac{t}{T_{warmup}}$	Transformers
1cycle	warmup → peak → anneal	Fast training

Warmup + Cosine Decay (LLM standard):

def lr_lambda(step):
    if step < warmup_steps:
        return step / warmup_steps
    else:
        progress = (step - warmup_steps) / (total_steps - warmup_steps)
        return 0.5 * (1 + math.cos(math.pi * progress))

7. PyTorch Training Loop (dl_002_pytorch_training_loop)¶

Лучшие источники¶

Документация: - PyTorch Tutorials - PyTorch Recipes

Блоги: - PyTorch Best Practices

Standard Training Loop¶

model.train()
for epoch in range(num_epochs):
    for batch_idx, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        # Forward
        optimizer.zero_grad()
        output = model(X)
        loss = criterion(output, y)

        # Backward
        loss.backward()
        optimizer.step()

    # Validation
    model.eval()
    with torch.no_grad():
        val_loss = evaluate(model, val_loader)

    # LR scheduling
    scheduler.step()

Common bugs: - model.eval() missing for validation - optimizer.zero_grad() missing - Not using with torch.no_grad() for inference

8. CNN from Scratch (nn_002_cnn)¶

Лучшие источники¶

Курсы: - CS231n: CNNs — MUST DO - d2l.ai: CNNs

YouTube: - 3Blue1Brown: CNNs

Key Concepts¶

Convolution: $$(f * g)[n] = \sum_{m} f[m] \cdot g[n-m]$$

Output size: $$H_{out} = \lfloor\frac{H_{in} + 2P - K}{S}\rfloor + 1$$

where $P$ = padding, $K$ = kernel size, $S$ = stride.

Backward through conv: - Gradient w.r.t. input = full convolution of error with flipped kernel - Gradient w.r.t. kernel = convolution of input with error

9. RNN/LSTM (nn_003_rnn_lstm)¶

Лучшие источники¶

Papers: - LSTM Paper — Hochreiter & Schmidhuber, 1997 - GRU Paper — Cho et al., 2014

Блоги: - Colah's Blog: LSTM — MUST READ - Lil'Log: RNN

Vanishing Gradient Problem¶

Vanilla RNN: $$h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t + b)$$

Gradient: $\frac{\partial h_t}{\partial h_{t-k}} = \prod_{i=0}^{k-1} W_{hh} \cdot \text{diag}(\tanh')$

If $\|W_{hh}\| < 1$, gradient vanishes exponentially.

LSTM Gates¶

\[f_t = \sigma(W_f[h_{t-1}, x_t] + b_f) \quad \text{(forget gate)}\]

\[i_t = \sigma(W_i[h_{t-1}, x_t] + b_i) \quad \text{(input gate)}\]

\[\tilde{C}_t = \tanh(W_C[h_{t-1}, x_t] + b_C) \quad \text{(candidate)}\]

\[C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t \quad \text{(cell state)}\]

\[o_t = \sigma(W_o[h_{t-1}, x_t] + b_o) \quad \text{(output gate)}\]

\[h_t = o_t \odot \tanh(C_t)\]

10. Attention Mechanism (dl_001_attention_mechanism)¶

Лучшие источники¶

Paper: - Attention Is All You Need — Vaswani et al., 2017

Блоги: - The Illustrated Transformer — MUST READ - Lil'Log: Attention

Scaled Dot-Product Attention¶

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Why $\sqrt{d_k}$? - Large $d_k$ → dot products grow → softmax becomes peaky → small gradients - Scaling prevents this

Multi-Head Attention¶

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O\]

where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$

def attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    attn = F.softmax(scores, dim=-1)
    return torch.matmul(attn, V)

11. Positional Encodings (dl_003_positional)¶

Лучшие источники¶

Papers: - Attention Is All You Need — Sinusoidal - RoPE — Rotary Position Embedding

Блоги: - RoPE Explained

Sinusoidal Encoding¶

\[PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d})\]

\[PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d})\]

Properties: - Model can generalize to longer sequences - Relative positions can be computed - Fixed (not learned)

RoPE (Rotary Position Embedding)¶

Key Idea: Encode position via rotation in complex plane.

\[f(x, m) = (x + ix') \cdot e^{im\theta}\]

Advantages: - Relative position via rotation composition - Better length extrapolation - Standard in LLaMA, Qwen, Mistral

12. KV-Cache Theory for LLM Inference (Gap Filler)¶

Лучшие источники¶

Статьи (2025): - KV Cache Optimization via Multi-Head Latent Attention — PyImageSearch, Oct 2025 - vLLM Tutorial 2025: The Ultimate Guide — vife.ai, Jan 2026 - KV Cache System-level Optimizations — Sara Zan, Oct 2025

Papers: - vLLM: Easy, Fast, and Cheap LLM Serving — Kwon et al., 2023 - DeepSeek-V2: MLA — Multi-Head Latent Attention

What is KV-Cache?¶

Problem: Autoregressive generation computes attention for ALL previous tokens at each step.

Without cache: $O(n^2)$ attention computation per token.

With cache: $O(n)$ — only compute K/V for new token, reuse cached values.

Memory formula: $$\text{KV Memory} = 2 \times L \times B \times S \times H \times D_h \times \text{bytes}$$

Where: $L$ = layers, $B$ = batch, $S$ = sequence length, $H$ = heads, $D_h$ = head dim.

Example (Llama-2-7B, 28K context): - KV cache ≈ 14 GB (comparable to model weights!) - This is why inference is memory-bandwidth bound, not compute bound.

PagedAttention (vLLM)¶

Idea: OS-inspired virtual memory paging for KV cache.

Before: Pre-allocated contiguous blocks → 60-80% memory waste due to fragmentation.

PagedAttention: 1. Divide KV cache into fixed-size pages (e.g., 16-512 tokens per page) 2. Store pages non-contiguously 3. Virtual-to-physical mapping via page table

Results: - 24x higher throughput vs HuggingFace - <4% memory waste (vs 60-80%) - More concurrent requests on same hardware

# vLLM usage example
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3-8b-hf")
sampling_params = SamplingParams(temperature=0.8, max_tokens=100)
outputs = llm.generate(["Hello, my name is"], sampling_params)

Multi-Head Latent Attention (MLA) — DeepSeek-V2¶

Idea: Compress KV into shared latent space instead of caching full-resolution tensors.

Standard MHA: Cache $(n_{heads} \times d_h)$ per token per layer.

MLA: Compress to latent dimension $D_{KV} \ll n_{heads} \times d_h$.

Math: $$C_{KV} = X W^{KV}_{down} \quad \text{(compress to latent)}$$

\[K = C_{KV} W^K_{up}, \quad V = C_{KV} W^V_{up} \quad \text{(up-project on demand)}\]

Benefits: - 8x reduction in cache size (e.g., 512 vs 4096 values per token) - On-demand reconstruction preserves accuracy - Used in DeepSeek-V2, DeepSeek-V3

MQA vs GQA vs MHA¶

Method	KV Heads	Memory	Quality
MHA (Multi-Head)	$H$	Full	Best
MQA (Multi-Query)	1	$\frac{1}{H}$	Slight drop
GQA (Grouped Query)	$G$ where $1 < G < H$	$\frac{G}{H}$	Near MHA

Used in: - Llama-2: MHA - Llama-3: GQA (8 groups for 32 heads) - PaLM: MQA

System-Level Optimizations¶

Memory Management: - PagedAttention (vLLM): Virtual memory paging - vTensor: Flexible tensor management - ChunkAttention: Prefix tree for shared prefixes

Scheduling: - BatchLLM: Group requests by prefix for cache reuse - RadixAttention: Radix tree for KV sharing - FastServe: Preemptive GPU↔CPU cache swapping

Hardware-aware: - FlashAttention: HBM↔SRAM tiling, fused kernels - FlexGen: GPU↔CPU↔SSD tiered offloading - DistServe: Separate prefill/decode across GPUs

Interview Questions¶

Q: Why does KV cache become the bottleneck in LLM inference?

A: For long sequences, KV memory scales as $O(L \times S \times H \times D)$. In models like Llama-2-7B with 28K context, KV cache can exceed model weights (14GB+). GPUs become memory-bandwidth bound — reading/writing wide matrices is slower than compute.

Q: Explain PagedAttention in one sentence.

A: PagedAttention applies OS virtual memory concepts to KV cache — storing fixed-size pages non-contiguously with page table mapping, eliminating fragmentation and enabling 24x throughput gains.

Q: What's the difference between MQA and GQA?

A: MQA uses a single KV head shared across all query heads (minimal memory, slight quality drop). GQA uses $G$ KV heads where $1 < G < H$, balancing memory reduction ($\frac{G}{H}$) with near-MHA quality. Llama-3 uses GQA with 8 groups for 32 heads.

Q: How does MLA (Multi-Head Latent Attention) reduce KV cache?

A: MLA projects K and V into a compressed latent space ($C_{KV}$) before caching, typically 8x smaller. During attention, latent vectors are up-projected on-demand. This preserves modeling capacity while dramatically reducing memory bandwidth.

13. Attention Variants (Gap Filler)¶

Cross-Attention (Encoder-Decoder)¶

Used in: Translation, T5, original Transformer

\[\text{CrossAttn}(Q, K_{enc}, V_{enc}) = \text{softmax}\left(\frac{QK_{enc}^T}{\sqrt{d_k}}\right)V_{enc}\]

Decoder queries attend to encoder outputs.

Decoder-only models (GPT, Llama): Only self-attention, no cross-attention.

Sliding Window Attention¶

Problem: Full attention is $O(n^2)$ for long sequences.

Solution: Each token only attends to local window of $W$ tokens.

\[\text{Attention}(Q, K, V)_{mask} = \text{mask}(d_{ij} > W)\]

Used in: - Longformer (global + local attention) - Mistral (sliding window 4096) - Reduces to $O(n \cdot W)$ complexity

Flash Attention¶

Key insight: Attention is memory-bound (HBM reads/writes), not compute-bound.

Technique: 1. Load Q, K, V blocks into SRAM 2. Compute attention in SRAM (fused softmax, no materialization) 3. Write only final output to HBM

Speedup: 2-4x faster, same numerical results.

Versions: - FlashAttention-1 (2022): Tiling, fused kernels - FlashAttention-2 (2023): Better parallelism, work partitioning - FlashAttention-3 (2024): H100 optimized, FP8 support

# PyTorch 2.0+ has built-in flash attention
F.scaled_dot_product_attention(Q, K, V, is_causal=True)

Linear Attention¶

Goal: Replace softmax with kernel to get $O(n)$ complexity.

\[\text{Attn}(Q, K, V) = \text{softmax}(QK^T)V \approx \phi(Q)(\phi(K)^T V)\]

Where $\phi$ is a kernel feature map (e.g., ELU+1).

Used in: Linear Transformer, Performer, RWKV (partially).

14. Mixture of Experts (MoE) — Gap Filler¶

Лучшие источники¶

Статьи (2025-2026): - Build a Mixture-of-Experts LLM from Scratch — Into AI, Jan 2026 - Router Wars: Which MoE Routing Strategy Actually Works — Cerebras, Aug 2025 - The Rise of MoE: Comparing 2025's Leading Models — Friendli AI

Papers: - Outrageously Large Neural Networks — Shazeer et al., 2017 (original MoE) - Switch Transformer — Fedus et al., 2021 - Mixtral 8x7B — Mistral AI, 2024 - DeepSeekMoE — DeepSeek, 2024

MoE Architecture Overview¶

Key Idea: Replace large FFN with multiple smaller FFNs ("experts"), route each token to top-k experts.

Benefits: - Scale parameters without scaling compute - Sparse activation (only subset of experts active per token) - Better specialization for different token types

Models using MoE: - Mixtral 8x7B (8 experts, top-2 routing) - Grok-1 (314B params, sparse MoE) - DeepSeek-V3 (256 routed experts + shared) - GPT-OSS (120B, MoE architecture)

Mathematical Details¶

Step 1: Router logits $$l_i = x \cdot W_g^{(i)} \quad \text{for } i = 1, ..., n$$

Step 2: Top-k selection $$\text{Keep top-}k \text{ logits, mask rest with } -\infty$$

Step 3: Softmax $$w_i = \frac{\exp(l_i)}{\sum_{j \in \text{top-}k} \exp(l_j)} \quad \text{for selected experts}$$

Step 4: Weighted combination $$y = \sum_{i \in \text{top-}k} w_i \cdot E_i(x)$$

Routing Strategies Comparison¶

Strategy	Description	Pros	Cons
Hash Routing	Deterministic: $\text{expert} = \text{token\_id} \mod N$	Perfect load balance	Ignores context, low specialization
Learned Routing	Trainable router with aux loss	3x better quality than hash	Router collapse risk
Sinkhorn Routing	Iterative normalization per layer	Hash-level balance + learned quality	Hard to scale, gradient issues

Auxiliary Loss (for load balancing): $$L_{aux} = \text{coeff} \cdot \sum_i f_i \cdot P_i$$

Where $f_i$ = fraction of tokens to expert $i$, $P_i$ = sum of router weights.

Router Collapse Problem¶

Symptoms: - Most tokens routed to 1-2 experts - Other experts become "dead" - Model degrades to dense performance

Fixes: - Auxiliary loss (load balancing regularization) - Expert capacity factors - Shared experts (always activated) + routed experts - Z-loss regularization (DeepSeek-V3)

DeepSeekMoE Innovations¶

Hybrid architecture: - Shared experts: Always activated (e.g., 1-2 experts) - Routed experts: Top-k selection (e.g., 256 experts, top-8)

Benefits: - Shared experts capture universal knowledge - Routed experts specialize - More stable training, better utilization

Interview Questions¶

Q: What does "8x7B" mean in Mixtral 8x7B?

A: 8 experts × 7B parameters each. But total params ≠ 56B because: (1) shared components (attention, embeddings) aren't replicated, (2) only top-2 experts active per token → ~13B active params per forward pass.

Q: Why does router collapse happen?

A: Early in training, some experts get slightly better at handling common patterns. Router learns to send more tokens to them → they improve faster → positive feedback loop. Without auxiliary loss, this compounds until most experts are unused.

Q: What's the trade-off between Hash and Learned routing?

A: Hash routing: perfect load balance, 0 overhead, but ignores context → low specialization (~1.5% loss improvement). Learned routing: context-aware, 3x better performance (~4% loss improvement), but requires auxiliary loss and can collapse. Production systems use learned + engineering tricks.

Q: How do shared experts help?

A: Shared experts (always activated) provide a stable "base" representation that all tokens receive. This: (1) prevents collapse by ensuring minimum utilization, (2) captures universal patterns, (3) allows routed experts to focus on specialization rather than general knowledge.

15. State Space Models / Mamba — Gap Filler¶

Лучшие источники¶

Статьи (2025): - How Mamba Beats Transformers at Long Sequences — Galileo AI, Sep 2025 - A Visual Guide to Mamba and State Space Models — Maarten Grootendorst

Papers: - Mamba: Linear-Time Sequence Modeling — Gu & Dao, 2023 - Mamba-2 — Dao et al., 2024 - S4: Efficiently Modeling Long Sequences — Gu et al., 2021

The Transformer Problem¶

Self-attention: $O(T^2)$ time and memory — every token attends to every token.

Consequence: Sequences beyond a few thousand tokens become impractical. KV cache grows linearly with length.

State Space Models (SSM) Basics¶

Continuous-time formulation: $$h'(t) = Ah(t) + Bx(t)$$

\[y(t) = Ch(t)\]

Discretized (for sequences): $$h_t = \bar{A}h_{t-1} + \bar{B}x_t$$

\[y_t = Ch_t\]

Where $\bar{A} = \exp(\Delta A)$, $\bar{B} = (\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B$.

Key properties: - Training: Parallel scan $O(T \log T)$ - Inference: Recurrent update $O(1)$ per token - Fixed hidden state size (no growing cache)

S4 vs Mamba¶

Aspect	S4 (2021)	Mamba (2023)	Mamba-2 (2024)
Parameters	Fixed A, B, C	Input-dependent B, C, Δ	A = scalar × I
Selectivity	No	Yes	Yes
Complexity	$O(T \log T)$	$O(T)$	$O(T)$
Memory	Fixed	Fixed	Fixed

Selective SSM (Mamba's Innovation)¶

Key insight: Make parameters functions of input, not fixed.

\[\Delta_t = \text{Broadcast}_{\text{D}}(\text{Linear}_1(x_t))\]

\[B_t = \text{Linear}_2(x_t)\]

\[C_t = \text{Linear}_3(x_t)\]

Selectivity means: - Can selectively remember or forget information - Skip uninformative tokens (like "the") - Devote capacity to important content

Mamba vs Transformer¶

Aspect	Transformer	Mamba
Time complexity	$O(T^2)$	$O(T)$
Memory (inference)	$O(T)$ KV cache	$O(1)$ hidden state
Training parallelism	Full	Parallel scan
Inference speed	Slower for long sequences	5× faster (>2k tokens)
No positional encoding needed	No (needs PE)	Yes (implicit in recurrence)

Hybrid Architectures (2025 Trend)¶

Jamba (AI21): Transformer layers + Mamba layers interleaved.

Bamba (IBM): Mamba2 + attention hybrid, 2× faster inference.

When to use Mamba: - Long document Q&A - Audio/video processing - Genomics (million-token sequences) - Constant memory requirement scenarios

When to prefer Transformers: - Dense global token interactions - Complex multi-hop reasoning - Established production pipelines

Interview Questions¶

Q: Why is Mamba faster than Transformers for long sequences?

A: Transformers have $O(T^2)$ attention complexity. Mamba uses selective SSMs with $O(T)$ complexity. At inference, Mamba only stores a fixed-size hidden state, while Transformers grow KV cache linearly. For >2k tokens, Mamba runs 5× faster.

Q: What makes Mamba "selective"?

A: Traditional SSMs (like S4) use fixed matrices A, B, C. Mamba makes B, C, and step size Δ functions of the current input. This allows the model to selectively remember important information and forget uninformative tokens on-the-fly.

Q: When would you choose Mamba over Transformers?

A: Mamba excels at long sequences (>2k tokens) where attention becomes prohibitively expensive. Use cases: document processing, audio/video, genomics. Prefer Transformers when you need dense global interactions or have established pipelines.

Q: What's the difference between S4 and Mamba?

A: S4 (2021) introduced structured SSMs with fixed parameters and $O(T \log T)$ training. Mamba (2023) added input-dependent parameters (selectivity) and efficient CUDA kernels for $O(T)$ training. Mamba-2 (2024) simplified the transition matrix A to a scalar multiple of identity for easier kernel fusion.

16. Transformer Architecture Deep Dive (Pre-Norm vs Post-Norm)¶

Источники: Medium "Why Pre-Norm Became the Default in Transformers" (Jan 2025), LayerNorm papers

Layer Normalization Placement¶

Post-Norm (Original Transformer, 2017)¶

Post-Norm Block:
y = LayerNorm(x + Attention(x))
y = LayerNorm(y + FeedForward(y))

Особенности: - Residual path чистый: x + Attention(x) - Нормализация ПОСЛЕ сложения - Gradient должен пройти через LayerNorm

Pre-Norm (GPT-2, 2019 — сейчас стандарт)¶

Pre-Norm Block:
y = x + Attention(LayerNorm(x))
y = y + FeedForward(LayerNorm(y))

Особенности: - Residual path чистый: x + Attention(LayerNorm(x)) → но x передаётся напрямую - Нормализация ПЕРЕД sublayer - Gradient течёт напрямую по residual connection

Почему Pre-Norm победил¶

Аспект	Post-Norm	Pre-Norm
Gradient Flow	Через LayerNorm	Напрямую по residual
Training Stability	Требует warmup	Стабильнее
Deep Networks	Сложно обучать >12 слоёв	Легко 100+ слоёв
Learning Rate	Чувствителен	Менее чувствителен
Gradient Vanishing	Проблема на глубине	Меньше проблем

Математика Gradient Flow¶

Post-Norm gradient: $$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial \text{LayerNorm}}{\partial (x + \text{Attention}(x))}$$

Gradient должен пройти через LayerNorm — это добавляет сложности.

Pre-Norm gradient: $$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot (1 + \frac{\partial \text{Attention}}{\partial \text{LayerNorm}(x)})$$

Слагаемое 1 означает ПРЯМОЙ gradient path — gradient всегда доходит до входа.

Double Norm Innovation (2024-2025)¶

Новые модели (Grok, Gemma 2, Olmo 2) используют Double Norm:

Double Norm Block:
y = x + Attention(LayerNorm(x))     # Pre-Norm attention
y = LayerNorm(y)                     # Post-Norm output
y = y + FeedForward(LayerNorm(y))   # Pre-Norm FFN
y = LayerNorm(y)                     # Post-Norm output

Преимущества: - Pre-Norm для стабильности обучения - Post-Norm для лучшего representations на выходе

Python: Pre-Norm vs Post-Norm Block¶

import torch
import torch.nn as nn

class PreNormTransformerBlock(nn.Module):
    """Pre-Norm (современный стандарт)"""
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.attn = nn.MultiheadAttention(d_model, n_heads, dropout=dropout)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Pre-Norm: normalize BEFORE attention
        normed = self.norm1(x)
        attn_out, _ = self.attn(normed, normed, normed, mask)
        x = x + self.dropout(attn_out)  # Clean residual

        # Pre-Norm: normalize BEFORE FFN
        normed = self.norm2(x)
        ff_out = self.ff(normed)
        x = x + self.dropout(ff_out)    # Clean residual

        return x


class PostNormTransformerBlock(nn.Module):
    """Post-Norm (original transformer)"""
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.attn = nn.MultiheadAttention(d_model, n_heads, dropout=dropout)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Post-Norm: normalize AFTER residual
        attn_out, _ = self.attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_out))

        ff_out = self.ff(x)
        x = self.norm2(x + self.dropout(ff_out))

        return x

Interview Questions¶

Q: Почему Pre-Norm стабильнее Post-Norm?

A: В Pre-Norm residual connection передаёт gradient напрямую: $\frac{\partial L}{\partial x}$ содержит слагаемое 1. Это означает, что gradient всегда достигает ранних слоёв без затухания. В Post-Norm gradient должен пройти через LayerNorm, что добавляет нелинейность и потенциально затухание.

Q: Когда использовать Post-Norm?

A: Post-Norm может давать лучшие representations на выходе блока, так как LayerNorm "выравнивает" активации. Некоторые модели используют Double Norm (Pre + Post) чтобы получить преимущества обоих подходов.

Q: Как Pre-Norm влияет на learning rate warmup?

A: Pre-Norm менее чувствителен к learning rate и часто позволяет убрать или сильно сократить warmup. Post-Norm требует осторожного warmup чтобы избежать gradient explosion на ранних итерациях.

Q: Что такое Double Norm?

A: Double Norm комбинирует Pre-Norm и Post-Norm: нормализация ПЕРЕД sublayer для стабильности gradient flow, и ПОСЛЕ residual connection для лучшего output distribution. Используется в Grok, Gemma 2, Olmo 2.

17. Distributed Training (DDP, Pipeline, Tensor, ZeRO/FSDP)¶

Источники: Datahacker.rs "LLMs from Scratch #007" (Nov 2025), Deepak Baby Blog (Dec 2025), ZeRO Paper

The Memory Wall Problem¶

Почему нужен distributed training: - 7B параметров в FP32: $7B \times 4 = 28$ GB (только веса) - Gradients: ещё 28 GB - Optimizer states (Adam): 56 GB (2 momentum terms) - Activations: substantial overhead - Итого: 150GB+ для 7B модели — не влезает в один GPU!

Three Fundamental Parallelization Strategies¶

Стратегия	Что параллелится	Когда использовать
Data Parallelism	Training data	Large datasets, model fits in single GPU
Model/Tensor Parallelism	Model layers	Very large layers (attention in transformers)
Pipeline Parallelism	Model stages	Deep models with many sequential layers
3D Parallelism	Combination	Extremely large models (100B+ params)

Distributed Data Parallel (DDP)¶

Как работает: 1. Репликация модели на каждый GPU 2. Sharding батча между GPU 3. Forward pass независимо 4. Backward pass независимо 5. AllReduce для синхронизации gradients 6. Обновление весов

AllReduce = Reduce-Scatter + All-Gather: $$\text{AllReduce}(X) = \text{AllGather}(\text{ReduceScatter}(X))$$

Cost: $2 \times \text{size}(X)$

Pros DDP	Cons DDP
Простая реализация	Memory redundancy (full model per GPU)
Linear scaling	Model must fit in single GPU
No model changes	Communication overhead (AllReduce)
Fault tolerance	Synchronization barrier (stragglers)

Pipeline Parallelism¶

Как работает: 1. Split model на N stages (по слоям) 2. Каждый GPU обрабатывает свой stage 3. Micro-batching для уменьшения bubbles

Pipeline Bubble Formula: $$\text{Bubble Fraction} = \frac{p - 1}{m}$$

Где $p$ = число stages, $m$ = число micro-batches.

Pipeline Schedules:

Schedule	Описание	Memory	Bubble Ratio
GPipe	All forward, then all backward	High	$(p-1)/m$
1F1B	Alternates forward/backward	Lower	$(p-1)/m$
Interleaved 1F1B	Virtual stages	Lowest	$(p-1)/(m \cdot v)$

Tensor Parallelism¶

Как работает: Shard individual layers (не sequential stages).

MLP Example:

# Column-parallel: shard first linear
Y1 = X @ W1[:, :k]  # GPU 0
Y2 = X @ W1[:, k:]  # GPU 1

# Row-parallel: shard second linear
Z1 = Y1 @ W2[:k, :]  # GPU 0
Z2 = Y2 @ W2[k:, :]  # GPU 1

# AllReduce to combine
Z = Z1 + Z2  # AllReduce

When to use: Very large layers, within single node (NVLink required).

ZeRO (Zero Redundancy Optimizer)¶

ZeRO stages:

Stage	Что шардится	Memory Savings
ZeRO-1	Optimizer states	4x
ZeRO-2	+ Gradients	8x
ZeRO-3	+ Parameters	$N \times$ (N = GPU count)

ZeRO-3 / FSDP Process: 1. Shard: Split params/gradients/optimizer states across GPUs 2. Gather: AllGather params when needed for computation 3. Compute: Forward/backward pass 4. Scatter: Reduce-scatter gradients to owners 5. Update: Each GPU updates its shard

Fully Sharded Data Parallel (FSDP)¶

Gather-Compute-Scatter Pattern:

for layer in model:
    # Gather: collect full layer params from all GPUs
    all_gather(layer.params)

    # Compute: forward/backward
    output = layer(input)

    # Scatter: return unused params, reduce-scatter grads
    reduce_scatter(layer.gradients)

Sharding Strategies: - FULL_SHARD — ZeRO-3 equivalent (max memory savings) - SHARD_GRAD_OP — ZeRO-2 equivalent - NO_SHARD — DDP equivalent

Python: FSDP with PyTorch¶

import torch
import torch.nn as nn
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import ShardingStrategy

# Initialize distributed
torch.distributed.init_process_group(backend="nccl")

# Create model
model = MyLargeModel()

# Wrap with FSDP (ZeRO-3 equivalent)
model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,  # ZeRO-3
    device_id=torch.cuda.current_device(),
)

# Training loop works normally
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

for batch in dataloader:
    optimizer.zero_grad()
    loss = model(batch).sum()
    loss.backward()
    optimizer.step()

3D Parallelism¶

Combination for extreme scale:

3D Parallelism = Data Parallel × Pipeline Parallel × Tensor Parallel

Example (Llama 3 405B): - Tensor Parallel: 8 GPUs per node - Pipeline Parallel: 8 stages - Data Parallel: 128 replicas - Total: $8 \times 8 \times 128 = 8192$ GPUs

Comparison Summary¶

Strategy	Memory	Communication	Complexity	Use Case
DDP	High	AllReduce	Low	Small models
Pipeline	Medium	Point-to-point	Medium	Deep models
Tensor	Low	AllGather/Reduce	High	Wide layers
FSDP	Very Low	AllGather/ReduceScatter	Medium	Large models
3D	Lowest	Complex	Very High	100B+ params

Interview Questions¶

Q: В чём разница между DDP и FSDP?

A: DDP реплицирует полную модель на каждом GPU, синхронизируя только gradients через AllReduce. FSDP шардит параметры, gradients и optimizer states между GPU, собирая их on-demand для computation. FSDP позволяет обучать модели, которые не влезают в память одного GPU.

Q: Что такое pipeline bubble и как его уменьшить?

A: Pipeline bubble — idle time когда GPU ждёт данные от предыдущего stage. Формула: $\frac{p-1}{m}$ где $p$ = stages, $m$ = micro-batches. Уменьшается через: больше micro-batches, schedules (1F1B, Interleaved), interleaved stages.

Q: Когда использовать Tensor Parallelism vs Pipeline Parallelism?

A: Tensor Parallelism для очень широких слоёв (attention), требует NVLink, работает только внутри single node. Pipeline Parallelism для глубоких моделей, работает across nodes, но имеет bubble overhead. Для 100B+ моделей — комбинируют (3D Parallelism).

Q: Объясни ZeRO stages.

A: ZeRO-1 шардит optimizer states (4x memory savings). ZeRO-2 добавляет gradient sharding (8x savings). ZeRO-3 шардит ещё и параметры ($N \times$ savings для $N$ GPUs). ZeRO-3 эквивалентен FSDP FULL_SHARD.

18. Vision Transformers (ViT)¶

Источники: Codecademy "Vision Transformers Architecture" (Sept 2025), GeeksforGeeks ViT Architecture (2025), "An Image Is Worth 16x16 Words" paper

Core Idea¶

Vision Transformer (ViT) applies transformer architecture to images by treating them as sequences of patches, not convolutions.

Key insight: Image $\rightarrow$ Patches $\rightarrow$ Transformer (same as text!)

Architecture Components¶

1. Image Patching¶

Input: Image [H, W, C]
Output: N patches of size [P, P, C]

Example: 224×224×3 image → 196 patches of 16×16×3
Number of patches: N = (H × W) / (P × P) = 50176 / 256 = 196

2. Patch Embedding¶

Step 1: Flatten patches: $$\text{Patch vector} = P^2 \times C = 16 \times 16 \times 3 = 768$$

Step 2: Linear projection to D dimensions: $$\mathbf{z}_0 = [\mathbf{x}_{\text{cls}}; \mathbf{x}_p^1 \mathbf{E}; \mathbf{x}_p^2 \mathbf{E}; \ldots; \mathbf{x}_p^N \mathbf{E}] + \mathbf{E}_{\text{pos}}$$

Where $\mathbf{E} \in \mathbb{R}^{(P^2 \cdot C) \times D}$ is the learnable linear projection.

3. Positional Encoding¶

Why needed: Transformers are permutation invariant — need to encode spatial order.

Learnable positional embeddings: $$\mathbf{E}_{\text{pos}} \in \mathbb{R}^{(N+1) \times D}$$

Added to patch embeddings to preserve spatial information.

4. CLS Token¶

Purpose: Learnable token prepended to sequence that aggregates global information.

Sequence: [CLS, Patch_1, Patch_2, ..., Patch_N]
          └── Used for final classification

The CLS token attends to ALL patches and learns image-level representation.

5. Transformer Encoder¶

Standard transformer block (Pre-LN):

y = x + Attention(LayerNorm(x))
y = y + FFN(LayerNorm(y))

Self-Attention Formula: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Multi-Head Attention: $$\text{MSA}(X) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W_O$$

Feed-Forward Network: $$\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2$$

6. Classification Head (MLP Head)¶

CLS token output → MLP → Softmax → Class probabilities

ViT vs CNN Comparison¶

Feature	CNNs	ViTs
Attention Scope	Local (convolutions)	Global (self-attention)
Inductive Bias	Strong (locality, translation invariance)	Minimal, more flexible but data-hungry
Data Requirement	Works with small datasets	Needs large datasets
Feature Learning	Hierarchical (low→high)	Context-rich, long-range
Computational	$O(K^2 \cdot C_{in} \cdot C_{out})$ per layer	$O(N^2 \cdot D)$ per attention
Transfer Learning	Good	Excellent with pretraining

Python: ViT from Scratch¶

import torch
import torch.nn as nn

class PatchEmbedding(nn.Module):
    """Convert image into patches and embed them."""
    def __init__(self, img_size=224, patch_size=16, in_channels=3, embed_dim=768):
        super().__init__()
        self.n_patches = (img_size // patch_size) ** 2
        # Conv2d with stride=patch_size is equivalent to patch extraction + projection
        self.proj = nn.Conv2d(
            in_channels, embed_dim,
            kernel_size=patch_size, stride=patch_size
        )

    def forward(self, x):
        # x: [B, C, H, W] -> [B, embed_dim, H/P, W/P]
        x = self.proj(x)
        # Flatten: [B, embed_dim, n_patches] -> [B, n_patches, embed_dim]
        x = x.flatten(2).transpose(1, 2)
        return x


class VisionTransformer(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_channels=3,
                 num_classes=1000, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0):
        super().__init__()
        self.patch_embed = PatchEmbedding(img_size, patch_size, in_channels, embed_dim)
        n_patches = self.patch_embed.n_patches

        # CLS token + positional embeddings
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        self.pos_embed = nn.Parameter(torch.zeros(1, n_patches + 1, embed_dim))

        # Transformer encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embed_dim, nhead=num_heads,
            dim_feedforward=int(embed_dim * mlp_ratio),
            activation='gelu', batch_first=True, norm_first=True
        )
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=depth)

        # Classification head
        self.head = nn.Linear(embed_dim, num_classes)

    def forward(self, x):
        B = x.shape[0]

        # Patch embedding
        x = self.patch_embed(x)  # [B, n_patches, embed_dim]

        # Add CLS token
        cls_tokens = self.cls_token.expand(B, -1, -1)
        x = torch.cat([cls_tokens, x], dim=1)  # [B, n_patches+1, embed_dim]

        # Add positional embeddings
        x = x + self.pos_embed

        # Transformer encoder
        x = self.encoder(x)

        # Classification from CLS token
        cls_output = x[:, 0]  # [B, embed_dim]
        return self.head(cls_output)

ViT Variants (2024-2025)¶

Model	Key Innovation	Best For
DeiT	Data-efficient training, distillation	Small datasets
Swin	Hierarchical, shifted windows	Dense prediction
ConvNeXt	CNN designed like ViT	General vision
MaxViT	Multi-axis attention	Hybrid local-global
EVA-02	CLIP-pretrained	Transfer learning

Advantages & Limitations¶

Advantages	Limitations
Global context (long-range deps)	Data-hungry (needs JFT/ImageNet-21K)
Scales well at large sizes	Quadratic attention cost $O(N^2)$
Excellent transfer learning	Longer training time
Unified with NLP transformers	Positional encoding dependency
Parallel processing	Less efficient for small datasets

Interview Questions¶

Q: Как ViT работает с изображениями?

A: ViT разбивает изображение на фиксированные patches (например, 16×16), flatten'ит каждый patch в вектор, проецирует линейно в embedding размерность. Добавляет learnable CLS token (как в BERT) и positional embeddings. Затем пропускает через стандарт transformer encoder. CLS token на выходе используется для классификации.

Q: В чём разница ViT vs CNN?

A: CNN использует convolution с local receptive field, имеет strong inductive bias (locality, translation invariance), работает на малых данных. ViT использует global self-attention — каждый patch взаимодействует со всеми, minimal inductive bias, требует большие datasets (JFT-300M, ImageNet-21K), но отлично transfer'ится.

Q: Зачем нужен CLS token в ViT?

A: CLS token — learnable vector, prepended к sequence patches. Через self-attention он агрегирует информацию от ALL patches, learning global image representation. На выходе encoder'а embedding CLS token используется для классификации (как [CLS] в BERT для sentence classification).

Q: Почему ViT требует больше данных чем CNN?

A: CNN имеет strong inductive biases: locality (соседние пиксели связаны), translation equivariance (один фильтр везде). ViT treat'ит patches как unordered tokens, self-attention изучает все связи с нуля. На малых данных ViT overfit'ит. Решение: pretraining на огромных datasets (JFT-300M) или DeiT с distillation.

Связи между темами¶

graph TD
    calc["Calculus (gradient)"] --> bp["Backpropagation"]
    bp --> opt["Optimizers"]
    opt --> winit["Weight Init"]
    winit --> norm["Normalization"]
    norm --> loop["Training Loop"]
    loop --> loss["Loss Functions"]
    loss --> archs["CNN / RNN / LSTM"]
    archs --> attn["Attention"]
    pe["Positional Encodings"] --> attn
    attn --> trans["Transformers"]
    trans --> llm["LLMs"]

    style calc fill:#f3e5f5,stroke:#9c27b0
    style bp fill:#e8eaf6,stroke:#3f51b5
    style opt fill:#e8eaf6,stroke:#3f51b5
    style winit fill:#e8eaf6,stroke:#3f51b5
    style norm fill:#e8eaf6,stroke:#3f51b5
    style loop fill:#e8f5e9,stroke:#4caf50
    style loss fill:#e8f5e9,stroke:#4caf50
    style archs fill:#fff3e0,stroke:#ef6c00
    style attn fill:#fff3e0,stroke:#ef6c00
    style pe fill:#fff3e0,stroke:#ef6c00
    style trans fill:#fce4ec,stroke:#c62828
    style llm fill:#fce4ec,stroke:#c62828

Loss	Formula	Use Case
MSE	\(\frac{1}{n}\sum(y-\hat{y})^2\)	Regression
BCE	\(-[y\log\hat{y} + (1-y)\log(1-\hat{y})]\)	Binary classification
CE	\(-\sum y_i\log\hat{y}_i\)	Multiclass
Focal	\(-(1-\hat{y}_t)^\gamma \log(\hat{y}_t)\)	Imbalanced
Contrastive	\(\max(0, d_{pos} - d_{neg} + m)\)	Metric learning
Triplet	\(\max(0, d(a,p) - d(a,n) + m)\)	Face recognition

Optimizer	Update Rule	Innovation
SGD	\(w = w - \eta \nabla L\)	Baseline
Momentum	\(v = \gamma v + \eta \nabla L\)	Accumulate velocity
AdaGrad	\(w = w - \frac{\eta}{\sqrt{G}} \nabla L\)	Per-param LR
RMSprop	\(E[g^2] = \gamma E[g^2] + (1-\gamma)g^2\)	Fix AdaGrad
Adam	\(m = \beta_1 m + (1-\beta_1)g\), \(v = \beta_2 v + (1-\beta_2)g^2\)	Combine all

Init	Variance	Activation
Xavier	\(\frac{1}{n_{in}}\) или \(\frac{2}{n_{in}+n_{out}}\)	tanh, sigmoid
He	\(\frac{2}{n_{in}}\)	ReLU
LeCun	\(\frac{1}{n_{in}}\)	SELU

Method	Normalize over	Statistics	Use Case
BatchNorm	Batch dim	\(\mu_B, \sigma_B\)	CNNs
LayerNorm	Feature dim	\(\mu_L, \sigma_L\)	Transformers, RNNs
InstanceNorm	Spatial dim	\(\mu_I, \sigma_I\)	Style transfer
GroupNorm	Group of channels	\(\mu_G, \sigma_G\)	Small batches
RMSNorm	Feature dim	No mean	LLMs (LLaMA, Qwen)

Schedule	Formula	Use Case
Step	\(\eta \cdot \gamma^{\lfloor epoch/d \rfloor}\)	Simple baseline
Cosine	\(\eta_{min} + \frac{1}{2}(\eta_{max}-\eta_{min})(1+\cos(\frac{t}{T}\pi))\)	Standard for LLMs
Linear Warmup	\(\eta = \eta_{base} \cdot \frac{t}{T_{warmup}}\)	Transformers
1cycle	warmup → peak → anneal	Fast training

Method	KV Heads	Memory	Quality
MHA (Multi-Head)	\(H\)	Full	Best
MQA (Multi-Query)	1	\(\frac{1}{H}\)	Slight drop
GQA (Grouped Query)	\(G\) where \(1 < G < H\)	\(\frac{G}{H}\)	Near MHA

Strategy	Description	Pros	Cons
Hash Routing	Deterministic: \(\text{expert} = \text{token\_id} \mod N\)	Perfect load balance	Ignores context, low specialization
Learned Routing	Trainable router with aux loss	3x better quality than hash	Router collapse risk
Sinkhorn Routing	Iterative normalization per layer	Hash-level balance + learned quality	Hard to scale, gradient issues

Aspect	Transformer	Mamba
Time complexity	\(O(T^2)\)	\(O(T)\)
Memory (inference)	\(O(T)\) KV cache	\(O(1)\) hidden state
Training parallelism	Full	Parallel scan
Inference speed	Slower for long sequences	5× faster (>2k tokens)
No positional encoding needed	No (needs PE)	Yes (implicit in recurrence)