Прогресс глубокого обучения 2025-2026¶
~7 минут чтения
Предварительно: Нормализация: глубокий разбор | Flash Attention 3
Четыре направления определяют deep learning в 2025-2026: (1) SSL -- SelfMatch достигает 93.19% на CIFAR-10 всего с 40 метками (vs 86.19% у FixMatch), делая обучение без разметки реалистичным; (2) Flow Matching заменяет 1000-шаговую диффузию генерацией за 1-10 шагов при сопоставимом качестве; (3) RMSNorm стал стандартом для LLM (LLaMA, Qwen, Mistral), заменив LayerNorm с ~10% ускорением; (4) стек mixed precision + gradient checkpointing + Flash Attention позволяет обучать 7B модель на одном 24GB GPU.
1. Self-Supervised Learning Advances¶
1.1 Основные направления SSL 2025¶
| Method | Key Idea | Best For |
|---|---|---|
| MAE (Masked Autoencoder) | Mask patches, reconstruct | Vision transformers |
| DINO/DINOv2 | Self-distillation, no labels | Dense prediction tasks |
| SimCLR | Contrastive learning | General vision |
| data2vec | Self-distillation with masking | Multi-modal |
1.2 DINO-Tracker (2024-2025)¶
Innovation: Test-time training on single video + DINO-ViT features
DINO-Tracker Pipeline:
1. Load pre-trained DINO-ViT
2. Fine-tune on single video (self-supervised)
3. Track points with refined features
Results: SOTA on tracking benchmarks
Key Insight: DINO's semantic features + motion adaptation = robust tracking
1.3 Hi-End-MAE (2025)¶
Hierarchical Encoder-driven MAE for Medical Imaging:
graph LR
subgraph "Traditional MAE"
A1["Encoder"] --> A2["Decoder<br/>(final layer only)"]
end
subgraph "Hi-End-MAE"
B1["Encoder"] --> B2["Hierarchical Decoder<br/>(all layers)"]
B2 --> B3["Fine-grained<br/>semantics"]
end
style A1 fill:#e8eaf6,stroke:#3f51b5
style A2 fill:#fff3e0,stroke:#ef6c00
style B1 fill:#e8eaf6,stroke:#3f51b5
style B2 fill:#e8f5e9,stroke:#4caf50
style B3 fill:#e8f5e9,stroke:#4caf50
Results: - Pre-trained on 10K CT scans - Superior transfer learning on 7 segmentation benchmarks
1.4 SSL + Few-Shot Learning (SelfMatch)¶
Two-Stage Training: 1. Stage 1: Contrastive SSL pre-training 2. Stage 2: Augmentation consistency fine-tuning
| Dataset | Labels | SelfMatch | FixMatch |
|---|---|---|---|
| CIFAR-10 | 40 | 93.19% | 86.19% |
| SVHN | 40 | 96.05% | 92.25% |
2. Diffusion Model Improvements¶
2.1 Flow Matching vs Diffusion¶
| Aspect | Diffusion Models | Flow Matching |
|---|---|---|
| Process | Add noise → denoise | Learn transport map |
| Steps | 50-1000 iterations | 1-10 steps possible |
| Quality | Excellent | Comparable or better |
| Speed | Slow | Faster |
2.2 Consistency Models (2024-2025)¶
Goal: One-step or few-step generation
Consistency Training Formula: $\(f_\theta(x, t) = f_\theta(f_\theta(x, t + \Delta t), t)\)$
Where \(f_\theta\) maps any timestep to the origin.
2.3 Flow-Anchored Consistency Models (FACM)¶
July 2025:
FACM Training:
1. Train Flow Matching model
2. Use as anchor for Consistency training
3. Combine benefits of both approaches
2.4 Key Papers 2025¶
| Paper | Venue | Innovation |
|---|---|---|
| Inverse Flow & Consistency | ICML 2025 | Unified framework |
| FACM | arXiv 2025 | Flow anchoring |
| Align Your Flow | NVIDIA | Continuous-time distillation |
| Training FM via Diffusion | CVPR 2025 | Bridge pretrained models |
3. Normalization Techniques¶
3.1 Comparison Table¶
| Technique | Formula | Pros | Cons |
|---|---|---|---|
| BatchNorm | \(\frac{x - \mu_B}{\sigma_B}\) | Fast training | Batch-dependent |
| LayerNorm | \(\frac{x - \mu_L}{\sigma_L}\) | Batch-independent | Slower |
| RMSNorm | \(\frac{x}{\sqrt{\text{mean}(x^2)}}\) | Efficient | No mean centering |
| GroupNorm | Divide channels into groups | Small batch-friendly | More compute |
3.2 RMSNorm (2025 Standard for LLMs)¶
Formula: $\(\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{n}\sum_{i=1}^n x_i^2 + \epsilon}} \cdot \gamma\)$
vs LayerNorm: $\(\text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \cdot \gamma + \beta\)$
Key Difference: RMSNorm removes mean subtraction
3.3 Why RMSNorm for LLMs?¶
| Factor | RMSNorm | LayerNorm |
|---|---|---|
| Compute | ~10% faster | Standard |
| Mean subtraction | No | Yes |
| Modern LLMs | LLaMA, Qwen, Mistral | BERT, GPT-2 |
| Convergence | Similar | Similar |
3.4 GroupNorm Use Cases¶
- Small batch sizes (< 16)
- Object detection ( Faster R-CNN)
- Video processing
- When batch statistics unreliable
4. Continual Learning & Catastrophic Forgetting¶
4.1 The Problem¶
Catastrophic Forgetting: Neural networks quickly overwrite previous knowledge when learning new tasks.
4.2 Mitigation Strategies¶
| Strategy | How It Works | Example |
|---|---|---|
| Replay | Store/rehearse old data | GEM, A-GEM |
| Regularization | Penalize important weights | EWC, SI |
| Architecture | Separate network parts | PackNet, PNN |
| Mixed Training | Interleave old + new data | Math + NLI mixing |
4.3 Mixed Training Results (Dec 2025)¶
Experiment: Fine-tune Flan-T5-Base on math with/without NLI mixing
| Training | Math Acc | NLI Acc | Forgetting |
|---|---|---|---|
| Math-only | 12.0% | 16.5% | 64.5pp drop |
| Mixed 1:1 | 12.0% | 86.2% | Zero! |
| Mixed 15:1 | 11.5% | 75.0% | Minimal |
Key Insight: Even 6.2% interleaved old data prevents forgetting!
4.4 Does Adam Exacerbate Forgetting?¶
Research Finding (2021): - Adam vs SGD comparison on forgetting - Result: SGD sometimes experiences less forgetting than Adam - Suggests optimizer choice matters for continual learning
4.5 Energy-Based Models for CL¶
EBM Advantage: - Change training objective to reduce interference - No external memory needed - Outperforms baselines on several benchmarks
5. Efficient Training Techniques¶
5.1 Mixed Precision Training¶
Concept: Use FP16/BF16 for forward, FP32 for master weights
graph LR
A["FP32 Master Weights"] --> B["FP16 Copy"]
B --> C["Forward Pass"]
C --> D["Loss (FP32)"]
C --> E["Gradient (FP16)"]
E --> F["FP32 Update"]
F --> A
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#fff3e0,stroke:#ef6c00
style C fill:#e8f5e9,stroke:#4caf50
style D fill:#f3e5f5,stroke:#9c27b0
style E fill:#fff3e0,stroke:#ef6c00
style F fill:#e8eaf6,stroke:#3f51b5
Benefits: - 2x memory reduction - 1.5-3x speedup on modern GPUs - Minimal accuracy loss
When to Use BF16 vs FP16: | Format | Dynamic Range | Use Case | |--------|---------------|----------| | FP16 | Limited | Older GPUs | | BF16 | Same as FP32 | Modern GPUs (A100, H100) |
5.2 Gradient Checkpointing¶
Concept: Don't store all activations, recompute during backward pass
Without Checkpointing: Store all activations → Memory = O(n)
With Checkpointing: Store checkpoints → Memory = O(√n)
Trade-off: - Memory: ~50-70% reduction - Speed: ~20% slower (recomputation)
When to Use: - Model doesn't fit in GPU memory - Training with larger batch sizes - Fine-tuning large models
5.3 Combined Techniques¶
Best Practice 2025:
1. Mixed Precision (BF16) → 2x memory, faster
2. Gradient Checkpointing → Further memory reduction
3. Gradient Accumulation → Effective batch size increase
4. Flash Attention → Efficient attention computation
5.4 Memory-Efficient Training Stack¶
| Technique | Memory Saving | Speed Impact |
|---|---|---|
| Mixed Precision | 50% | +50-200% |
| Gradient Checkpointing | 50-70% | -20% |
| Flash Attention | 20-40% | +20-50% |
| Gradient Accumulation | Variable | None (more steps) |
6. Interview Questions¶
6.1 Concept Questions¶
Q: Compare BatchNorm, LayerNorm, RMSNorm.
A: BatchNorm: Normalize across batch dimension
- Fast training, batch-dependent, issues with small batches
LayerNorm: Normalize across feature dimension
- Batch-independent, used in Transformers, slower
RMSNorm: LayerNorm without mean subtraction
- ~10% faster than LayerNorm, standard for LLMs
Q: What is catastrophic forgetting and how to mitigate it?
A: Catastrophic forgetting occurs when neural networks
quickly lose previously learned knowledge when training
on new tasks.
Mitigation strategies:
1. Replay: Store samples from old tasks
2. Regularization: EWC (protect important weights)
3. Architecture: Progressive networks
4. Mixed training: Interleave old and new data
Q: Explain mixed precision training.
A: Mixed precision uses FP16/BF16 for computations while
maintaining FP32 master weights for numerical stability.
Benefits:
- 2x memory reduction
- 1.5-3x speedup on modern GPUs
- Enables larger batch sizes
Key: Loss scaling to prevent gradient underflow
6.2 Architecture Questions¶
Q: Design a memory-efficient training pipeline for 7B LLM.
A: Stack:
1. BF16 Mixed Precision (2x memory)
2. Gradient Checkpointing (50% more reduction)
3. Flash Attention (efficient attention)
4. Gradient Accumulation (effective batch size)
5. LoRA/QLoRA (parameter-efficient fine-tuning)
Expected: Fit 7B model on single 24GB GPU
Q: Compare Flow Matching vs Diffusion models.
A: Diffusion:
- Add noise iteratively → denoise
- 50-1000 steps for generation
- Well-established, high quality
Flow Matching:
- Learn direct transport map
- 1-10 steps possible
- Faster, comparable quality
- Newer, less explored
6.3 Implementation Questions¶
Q: When would you use GroupNorm over BatchNorm?
A: Use GroupNorm when:
- Batch size is small (< 16)
- Training on variable-length sequences
- Batch statistics are unreliable
- Object detection (region-based)
- Video processing
Avoid GroupNorm when:
- Large batch sizes available
- Pure speed is priority
Q: How does gradient checkpointing work?
A: During forward pass:
1. Only save "checkpoint" activations at certain layers
2. Discard intermediate activations
During backward pass:
1. Recompute discarded activations from checkpoints
2. Compute gradients normally
Memory: O(√n) vs O(n) for full storage
Trade-off: ~20% slower due to recomputation
7. Key Papers & Resources¶
| Paper/Resource | Year | Key Contribution |
|---|---|---|
| DINO-Tracker | 2024 | Test-time SSL for tracking |
| Hi-End-MAE | 2025 | Hierarchical decoder for MAE |
| FACM | 2025 | Flow-anchored consistency |
| RMSNorm paper | 2019 | Efficient normalization |
| EBM for CL | 2020 | Energy-based continual learning |
| Mixed Training for CL | 2025 | 1:1 mixing eliminates forgetting |
8. Formulas Quick Reference¶
RMSNorm¶
LayerNorm¶
Consistency Model Constraint¶
Gradient Checkpointing Memory¶
Типичные заблуждения¶
Заблуждение: RMSNorm и LayerNorm дают одинаковое качество, разница только в скорости
RMSNorm убирает mean subtraction (\(\mu\)), оставляя только scale (\(\gamma\)). Для большинства LLM это не влияет на convergence, но в задачах с центрированными распределениями (некоторые vision tasks) отсутствие mean centering может ухудшить результат. Выбор зависит от домена: LLM -- RMSNorm, vision -- BatchNorm/GroupNorm.
Заблуждение: Gradient checkpointing -- бесплатное уменьшение памяти
Checkpointing снижает память с O(n) до O(sqrt(n)), но замедляет обучение на ~20% из-за recomputation активаций при backward pass. На коротких моделях (< 12 слоёв) overhead может не оправдать экономию. Best practice: включать только когда модель не влезает в GPU без него.
Заблуждение: Catastrophic forgetting -- проблема только маленьких моделей
Эксперимент с Flan-T5-Base показал: при fine-tuning на math без NLI mixing, NLI accuracy падает с 81% до 16.5% (64.5pp drop). Даже крупные LLM теряют capabilities при naive fine-tuning. Решение: mixed training 1:1 (даже 6.2% старых данных) полностью предотвращает forgetting.
Вопросы для собеседования¶
Спроектируйте memory-efficient pipeline для обучения 7B LLM на одном 24GB GPU.
«Просто используйте LoRA» -- неполный ответ, не покрывает training stack.
Полный стек: (1) BF16 Mixed Precision -- 2x memory reduction, динамический диапазон как у FP32; (2) Gradient Checkpointing -- ещё 50-70% reduction, trade-off ~20% speed; (3) Flash Attention -- 20-40% memory saving + 20-50% speedup; (4) Gradient Accumulation -- effective batch size increase без memory cost; (5) LoRA/QLoRA -- PEFT вместо full fine-tuning. Итого: 7B модель влезает в 24GB GPU.
Почему Flow Matching заменяет Diffusion Models? Назовите конкретные цифры.
«Flow Matching просто быстрее» -- нет деталей.
Diffusion: 50-1000 шагов итеративного добавления/удаления шума (DDPM). Flow Matching: обучает direct transport map от шума к данным, позволяя генерацию за 1-10 шагов. Consistency Models (\(f_\theta(x, t) = f_\theta(f_\theta(x, t + \Delta t), t)\)) обеспечивают one-step generation. FACM (2025) комбинирует Flow Matching + Consistency для лучшего trade-off quality/speed.
Когда использовать BF16 вместо FP16, и почему это критично?
«BF16 просто новее» -- нет понимания разницы форматов.
FP16: 5 бит экспонента + 10 бит мантисса -- высокая precision, но ограниченный dynamic range (max ~65504). BF16: 8 бит экспонента + 7 бит мантисса -- тот же dynamic range что FP32 (3.4 x 10^38), но меньшая precision. Для LLM training BF16 критичен: gradients могут иметь огромный range, и FP16 overflow/underflow требует loss scaling. BF16 работает без loss scaling на A100/H100. На старых GPU (V100) только FP16 доступен.