Прогресс глубокого обучения 2025-2026¶

~7 минут чтения

Предварительно: Нормализация: глубокий разбор | Flash Attention 3

Четыре направления определяют deep learning в 2025-2026: (1) SSL -- SelfMatch достигает 93.19% на CIFAR-10 всего с 40 метками (vs 86.19% у FixMatch), делая обучение без разметки реалистичным; (2) Flow Matching заменяет 1000-шаговую диффузию генерацией за 1-10 шагов при сопоставимом качестве; (3) RMSNorm стал стандартом для LLM (LLaMA, Qwen, Mistral), заменив LayerNorm с ~10% ускорением; (4) стек mixed precision + gradient checkpointing + Flash Attention позволяет обучать 7B модель на одном 24GB GPU.

1. Self-Supervised Learning Advances¶

1.1 Основные направления SSL 2025¶

Method	Key Idea	Best For
MAE (Masked Autoencoder)	Mask patches, reconstruct	Vision transformers
DINO/DINOv2	Self-distillation, no labels	Dense prediction tasks
SimCLR	Contrastive learning	General vision
data2vec	Self-distillation with masking	Multi-modal

1.2 DINO-Tracker (2024-2025)¶

Innovation: Test-time training on single video + DINO-ViT features

DINO-Tracker Pipeline:
1. Load pre-trained DINO-ViT
2. Fine-tune on single video (self-supervised)
3. Track points with refined features

Results: SOTA on tracking benchmarks

Key Insight: DINO's semantic features + motion adaptation = robust tracking

1.3 Hi-End-MAE (2025)¶

Hierarchical Encoder-driven MAE for Medical Imaging:

graph LR
    subgraph "Traditional MAE"
        A1["Encoder"] --> A2["Decoder<br/>(final layer only)"]
    end
    subgraph "Hi-End-MAE"
        B1["Encoder"] --> B2["Hierarchical Decoder<br/>(all layers)"]
        B2 --> B3["Fine-grained<br/>semantics"]
    end
    style A1 fill:#e8eaf6,stroke:#3f51b5
    style A2 fill:#fff3e0,stroke:#ef6c00
    style B1 fill:#e8eaf6,stroke:#3f51b5
    style B2 fill:#e8f5e9,stroke:#4caf50
    style B3 fill:#e8f5e9,stroke:#4caf50

Results: - Pre-trained on 10K CT scans - Superior transfer learning on 7 segmentation benchmarks

1.4 SSL + Few-Shot Learning (SelfMatch)¶

Two-Stage Training: 1. Stage 1: Contrastive SSL pre-training 2. Stage 2: Augmentation consistency fine-tuning

Dataset	Labels	SelfMatch	FixMatch
CIFAR-10	40	93.19%	86.19%
SVHN	40	96.05%	92.25%

2. Diffusion Model Improvements¶

2.1 Flow Matching vs Diffusion¶

Aspect	Diffusion Models	Flow Matching
Process	Add noise → denoise	Learn transport map
Steps	50-1000 iterations	1-10 steps possible
Quality	Excellent	Comparable or better
Speed	Slow	Faster

2.2 Consistency Models (2024-2025)¶

Goal: One-step or few-step generation

Traditional Diffusion:  x_T → x_{T-1} → ... → x_0  (1000 steps)
Consistency Model:      x_T → x_0                   (1 step)

Consistency Training Formula: $$f_\theta(x, t) = f_\theta(f_\theta(x, t + \Delta t), t)$$

Where $f_\theta$ maps any timestep to the origin.

2.3 Flow-Anchored Consistency Models (FACM)¶

July 2025:

FACM Training:
1. Train Flow Matching model
2. Use as anchor for Consistency training
3. Combine benefits of both approaches

2.4 Key Papers 2025¶

Paper	Venue	Innovation
Inverse Flow & Consistency	ICML 2025	Unified framework
FACM	arXiv 2025	Flow anchoring
Align Your Flow	NVIDIA	Continuous-time distillation
Training FM via Diffusion	CVPR 2025	Bridge pretrained models

3. Normalization Techniques¶

3.1 Comparison Table¶

Technique	Formula	Pros	Cons
BatchNorm	$\frac{x - \mu_B}{\sigma_B}$	Fast training	Batch-dependent
LayerNorm	$\frac{x - \mu_L}{\sigma_L}$	Batch-independent	Slower
RMSNorm	$\frac{x}{\sqrt{\text{mean}(x^2)}}$	Efficient	No mean centering
GroupNorm	Divide channels into groups	Small batch-friendly	More compute

3.2 RMSNorm (2025 Standard for LLMs)¶

Formula: $$\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{n}\sum_{i=1}^n x_i^2 + \epsilon}} \cdot \gamma$$

vs LayerNorm: $$\text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \cdot \gamma + \beta$$

Key Difference: RMSNorm removes mean subtraction

3.3 Why RMSNorm for LLMs?¶

Factor	RMSNorm	LayerNorm
Compute	~10% faster	Standard
Mean subtraction	No	Yes
Modern LLMs	LLaMA, Qwen, Mistral	BERT, GPT-2
Convergence	Similar	Similar

3.4 GroupNorm Use Cases¶

Small batch sizes (< 16)
Object detection ( Faster R-CNN)
Video processing
When batch statistics unreliable

4. Continual Learning & Catastrophic Forgetting¶

4.1 The Problem¶

Catastrophic Forgetting: Neural networks quickly overwrite previous knowledge when learning new tasks.

Task 1 (A) → Task 2 (B) → Task 3 (C)
Performance on A: 95% → 45% → 20%  ← Catastrophic!

4.2 Mitigation Strategies¶

Strategy	How It Works	Example
Replay	Store/rehearse old data	GEM, A-GEM
Regularization	Penalize important weights	EWC, SI
Architecture	Separate network parts	PackNet, PNN
Mixed Training	Interleave old + new data	Math + NLI mixing

4.3 Mixed Training Results (Dec 2025)¶

Experiment: Fine-tune Flan-T5-Base on math with/without NLI mixing

Training	Math Acc	NLI Acc	Forgetting
Math-only	12.0%	16.5%	64.5pp drop
Mixed 1:1	12.0%	86.2%	Zero!
Mixed 15:1	11.5%	75.0%	Minimal

Key Insight: Even 6.2% interleaved old data prevents forgetting!

4.4 Does Adam Exacerbate Forgetting?¶

Research Finding (2021): - Adam vs SGD comparison on forgetting - Result: SGD sometimes experiences less forgetting than Adam - Suggests optimizer choice matters for continual learning

4.5 Energy-Based Models for CL¶

EBM Advantage: - Change training objective to reduce interference - No external memory needed - Outperforms baselines on several benchmarks

5. Efficient Training Techniques¶

5.1 Mixed Precision Training¶

Concept: Use FP16/BF16 for forward, FP32 for master weights

graph LR
    A["FP32 Master Weights"] --> B["FP16 Copy"]
    B --> C["Forward Pass"]
    C --> D["Loss (FP32)"]
    C --> E["Gradient (FP16)"]
    E --> F["FP32 Update"]
    F --> A
    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#e8f5e9,stroke:#4caf50
    style D fill:#f3e5f5,stroke:#9c27b0
    style E fill:#fff3e0,stroke:#ef6c00
    style F fill:#e8eaf6,stroke:#3f51b5

Benefits: - 2x memory reduction - 1.5-3x speedup on modern GPUs - Minimal accuracy loss

When to Use BF16 vs FP16: | Format | Dynamic Range | Use Case | |--------|---------------|----------| | FP16 | Limited | Older GPUs | | BF16 | Same as FP32 | Modern GPUs (A100, H100) |

5.2 Gradient Checkpointing¶

Concept: Don't store all activations, recompute during backward pass

Without Checkpointing:  Store all activations → Memory = O(n)
With Checkpointing:     Store checkpoints → Memory = O(√n)

Trade-off: - Memory: ~50-70% reduction - Speed: ~20% slower (recomputation)

When to Use: - Model doesn't fit in GPU memory - Training with larger batch sizes - Fine-tuning large models

5.3 Combined Techniques¶

Best Practice 2025:

1. Mixed Precision (BF16) → 2x memory, faster
2. Gradient Checkpointing → Further memory reduction
3. Gradient Accumulation → Effective batch size increase
4. Flash Attention → Efficient attention computation

5.4 Memory-Efficient Training Stack¶

Technique	Memory Saving	Speed Impact
Mixed Precision	50%	+50-200%
Gradient Checkpointing	50-70%	-20%
Flash Attention	20-40%	+20-50%
Gradient Accumulation	Variable	None (more steps)

6. Interview Questions¶

6.1 Concept Questions¶

Q: Compare BatchNorm, LayerNorm, RMSNorm.

A: BatchNorm: Normalize across batch dimension
     - Fast training, batch-dependent, issues with small batches

   LayerNorm: Normalize across feature dimension
     - Batch-independent, used in Transformers, slower

   RMSNorm: LayerNorm without mean subtraction
     - ~10% faster than LayerNorm, standard for LLMs

Q: What is catastrophic forgetting and how to mitigate it?

A: Catastrophic forgetting occurs when neural networks
   quickly lose previously learned knowledge when training
   on new tasks.

   Mitigation strategies:
   1. Replay: Store samples from old tasks
   2. Regularization: EWC (protect important weights)
   3. Architecture: Progressive networks
   4. Mixed training: Interleave old and new data

Q: Explain mixed precision training.

A: Mixed precision uses FP16/BF16 for computations while
   maintaining FP32 master weights for numerical stability.

   Benefits:
   - 2x memory reduction
   - 1.5-3x speedup on modern GPUs
   - Enables larger batch sizes

   Key: Loss scaling to prevent gradient underflow

6.2 Architecture Questions¶

Q: Design a memory-efficient training pipeline for 7B LLM.

A: Stack:
   1. BF16 Mixed Precision (2x memory)
   2. Gradient Checkpointing (50% more reduction)
   3. Flash Attention (efficient attention)
   4. Gradient Accumulation (effective batch size)
   5. LoRA/QLoRA (parameter-efficient fine-tuning)

   Expected: Fit 7B model on single 24GB GPU

Q: Compare Flow Matching vs Diffusion models.

A: Diffusion:
   - Add noise iteratively → denoise
   - 50-1000 steps for generation
   - Well-established, high quality

   Flow Matching:
   - Learn direct transport map
   - 1-10 steps possible
   - Faster, comparable quality
   - Newer, less explored

6.3 Implementation Questions¶

Q: When would you use GroupNorm over BatchNorm?

A: Use GroupNorm when:
   - Batch size is small (< 16)
   - Training on variable-length sequences
   - Batch statistics are unreliable
   - Object detection (region-based)
   - Video processing

   Avoid GroupNorm when:
   - Large batch sizes available
   - Pure speed is priority

Q: How does gradient checkpointing work?

A: During forward pass:
   1. Only save "checkpoint" activations at certain layers
   2. Discard intermediate activations

   During backward pass:
   1. Recompute discarded activations from checkpoints
   2. Compute gradients normally

   Memory: O(√n) vs O(n) for full storage
   Trade-off: ~20% slower due to recomputation

7. Key Papers & Resources¶

Paper/Resource	Year	Key Contribution
DINO-Tracker	2024	Test-time SSL for tracking
Hi-End-MAE	2025	Hierarchical decoder for MAE
FACM	2025	Flow-anchored consistency
RMSNorm paper	2019	Efficient normalization
EBM for CL	2020	Energy-based continual learning
Mixed Training for CL	2025	1:1 mixing eliminates forgetting

8. Formulas Quick Reference¶

RMSNorm¶

\[\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{n}\sum_{i=1}^n x_i^2 + \epsilon}} \cdot \gamma\]

LayerNorm¶

\[\text{LayerNorm}(x) = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta\]

Consistency Model Constraint¶

\[f_\theta(x_t) = f_\theta(x_{t'}) \quad \forall t, t' \in [0, T]\]

Gradient Checkpointing Memory¶

\[\text{Memory} = O(\sqrt{n}) \text{ vs } O(n)\]

Типичные заблуждения¶

Заблуждение: RMSNorm и LayerNorm дают одинаковое качество, разница только в скорости

RMSNorm убирает mean subtraction ($\mu$), оставляя только scale ($\gamma$). Для большинства LLM это не влияет на convergence, но в задачах с центрированными распределениями (некоторые vision tasks) отсутствие mean centering может ухудшить результат. Выбор зависит от домена: LLM -- RMSNorm, vision -- BatchNorm/GroupNorm.

Заблуждение: Gradient checkpointing -- бесплатное уменьшение памяти

Checkpointing снижает память с O(n) до O(sqrt(n)), но замедляет обучение на ~20% из-за recomputation активаций при backward pass. На коротких моделях (< 12 слоёв) overhead может не оправдать экономию. Best practice: включать только когда модель не влезает в GPU без него.

Заблуждение: Catastrophic forgetting -- проблема только маленьких моделей

Эксперимент с Flan-T5-Base показал: при fine-tuning на math без NLI mixing, NLI accuracy падает с 81% до 16.5% (64.5pp drop). Даже крупные LLM теряют capabilities при naive fine-tuning. Решение: mixed training 1:1 (даже 6.2% старых данных) полностью предотвращает forgetting.

Вопросы для собеседования¶

Спроектируйте memory-efficient pipeline для обучения 7B LLM на одном 24GB GPU.

«Просто используйте LoRA» -- неполный ответ, не покрывает training stack.

Полный стек: (1) BF16 Mixed Precision -- 2x memory reduction, динамический диапазон как у FP32; (2) Gradient Checkpointing -- ещё 50-70% reduction, trade-off ~20% speed; (3) Flash Attention -- 20-40% memory saving + 20-50% speedup; (4) Gradient Accumulation -- effective batch size increase без memory cost; (5) LoRA/QLoRA -- PEFT вместо full fine-tuning. Итого: 7B модель влезает в 24GB GPU.

Почему Flow Matching заменяет Diffusion Models? Назовите конкретные цифры.

«Flow Matching просто быстрее» -- нет деталей.

Diffusion: 50-1000 шагов итеративного добавления/удаления шума (DDPM). Flow Matching: обучает direct transport map от шума к данным, позволяя генерацию за 1-10 шагов. Consistency Models ($f_\theta(x, t) = f_\theta(f_\theta(x, t + \Delta t), t)$) обеспечивают one-step generation. FACM (2025) комбинирует Flow Matching + Consistency для лучшего trade-off quality/speed.

Когда использовать BF16 вместо FP16, и почему это критично?

«BF16 просто новее» -- нет понимания разницы форматов.

FP16: 5 бит экспонента + 10 бит мантисса -- высокая precision, но ограниченный dynamic range (max ~65504). BF16: 8 бит экспонента + 7 бит мантисса -- тот же dynamic range что FP32 (3.4 x 10^38), но меньшая precision. Для LLM training BF16 критичен: gradients могут иметь огромный range, и FP16 overflow/underflow требует loss scaling. BF16 работает без loss scaling на A100/H100. На старых GPU (V100) только FP16 доступен.

Technique	Formula	Pros	Cons
BatchNorm	\(\frac{x - \mu_B}{\sigma_B}\)	Fast training	Batch-dependent
LayerNorm	\(\frac{x - \mu_L}{\sigma_L}\)	Batch-independent	Slower
RMSNorm	\(\frac{x}{\sqrt{\text{mean}(x^2)}}\)	Efficient	No mean centering
GroupNorm	Divide channels into groups	Small batch-friendly	More compute