Перейти к содержанию

Прогресс глубокого обучения 2025-2026

~7 минут чтения

Предварительно: Нормализация: глубокий разбор | Flash Attention 3

Четыре направления определяют deep learning в 2025-2026: (1) SSL -- SelfMatch достигает 93.19% на CIFAR-10 всего с 40 метками (vs 86.19% у FixMatch), делая обучение без разметки реалистичным; (2) Flow Matching заменяет 1000-шаговую диффузию генерацией за 1-10 шагов при сопоставимом качестве; (3) RMSNorm стал стандартом для LLM (LLaMA, Qwen, Mistral), заменив LayerNorm с ~10% ускорением; (4) стек mixed precision + gradient checkpointing + Flash Attention позволяет обучать 7B модель на одном 24GB GPU.


1. Self-Supervised Learning Advances

1.1 Основные направления SSL 2025

Method Key Idea Best For
MAE (Masked Autoencoder) Mask patches, reconstruct Vision transformers
DINO/DINOv2 Self-distillation, no labels Dense prediction tasks
SimCLR Contrastive learning General vision
data2vec Self-distillation with masking Multi-modal

1.2 DINO-Tracker (2024-2025)

Innovation: Test-time training on single video + DINO-ViT features

DINO-Tracker Pipeline:
1. Load pre-trained DINO-ViT
2. Fine-tune on single video (self-supervised)
3. Track points with refined features

Results: SOTA on tracking benchmarks

Key Insight: DINO's semantic features + motion adaptation = robust tracking

1.3 Hi-End-MAE (2025)

Hierarchical Encoder-driven MAE for Medical Imaging:

graph LR
    subgraph "Traditional MAE"
        A1["Encoder"] --> A2["Decoder<br/>(final layer only)"]
    end
    subgraph "Hi-End-MAE"
        B1["Encoder"] --> B2["Hierarchical Decoder<br/>(all layers)"]
        B2 --> B3["Fine-grained<br/>semantics"]
    end
    style A1 fill:#e8eaf6,stroke:#3f51b5
    style A2 fill:#fff3e0,stroke:#ef6c00
    style B1 fill:#e8eaf6,stroke:#3f51b5
    style B2 fill:#e8f5e9,stroke:#4caf50
    style B3 fill:#e8f5e9,stroke:#4caf50

Results: - Pre-trained on 10K CT scans - Superior transfer learning on 7 segmentation benchmarks

1.4 SSL + Few-Shot Learning (SelfMatch)

Two-Stage Training: 1. Stage 1: Contrastive SSL pre-training 2. Stage 2: Augmentation consistency fine-tuning

Dataset Labels SelfMatch FixMatch
CIFAR-10 40 93.19% 86.19%
SVHN 40 96.05% 92.25%

2. Diffusion Model Improvements

2.1 Flow Matching vs Diffusion

Aspect Diffusion Models Flow Matching
Process Add noise → denoise Learn transport map
Steps 50-1000 iterations 1-10 steps possible
Quality Excellent Comparable or better
Speed Slow Faster

2.2 Consistency Models (2024-2025)

Goal: One-step or few-step generation

Traditional Diffusion:  x_T → x_{T-1} → ... → x_0  (1000 steps)
Consistency Model:      x_T → x_0                   (1 step)

Consistency Training Formula: $\(f_\theta(x, t) = f_\theta(f_\theta(x, t + \Delta t), t)\)$

Where \(f_\theta\) maps any timestep to the origin.

2.3 Flow-Anchored Consistency Models (FACM)

July 2025:

FACM Training:
1. Train Flow Matching model
2. Use as anchor for Consistency training
3. Combine benefits of both approaches

2.4 Key Papers 2025

Paper Venue Innovation
Inverse Flow & Consistency ICML 2025 Unified framework
FACM arXiv 2025 Flow anchoring
Align Your Flow NVIDIA Continuous-time distillation
Training FM via Diffusion CVPR 2025 Bridge pretrained models

3. Normalization Techniques

3.1 Comparison Table

Technique Formula Pros Cons
BatchNorm \(\frac{x - \mu_B}{\sigma_B}\) Fast training Batch-dependent
LayerNorm \(\frac{x - \mu_L}{\sigma_L}\) Batch-independent Slower
RMSNorm \(\frac{x}{\sqrt{\text{mean}(x^2)}}\) Efficient No mean centering
GroupNorm Divide channels into groups Small batch-friendly More compute

3.2 RMSNorm (2025 Standard for LLMs)

Formula: $\(\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{n}\sum_{i=1}^n x_i^2 + \epsilon}} \cdot \gamma\)$

vs LayerNorm: $\(\text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \cdot \gamma + \beta\)$

Key Difference: RMSNorm removes mean subtraction

3.3 Why RMSNorm for LLMs?

Factor RMSNorm LayerNorm
Compute ~10% faster Standard
Mean subtraction No Yes
Modern LLMs LLaMA, Qwen, Mistral BERT, GPT-2
Convergence Similar Similar

3.4 GroupNorm Use Cases

  • Small batch sizes (< 16)
  • Object detection ( Faster R-CNN)
  • Video processing
  • When batch statistics unreliable

4. Continual Learning & Catastrophic Forgetting

4.1 The Problem

Catastrophic Forgetting: Neural networks quickly overwrite previous knowledge when learning new tasks.

Task 1 (A) → Task 2 (B) → Task 3 (C)
Performance on A: 95% → 45% → 20%  ← Catastrophic!

4.2 Mitigation Strategies

Strategy How It Works Example
Replay Store/rehearse old data GEM, A-GEM
Regularization Penalize important weights EWC, SI
Architecture Separate network parts PackNet, PNN
Mixed Training Interleave old + new data Math + NLI mixing

4.3 Mixed Training Results (Dec 2025)

Experiment: Fine-tune Flan-T5-Base on math with/without NLI mixing

Training Math Acc NLI Acc Forgetting
Math-only 12.0% 16.5% 64.5pp drop
Mixed 1:1 12.0% 86.2% Zero!
Mixed 15:1 11.5% 75.0% Minimal

Key Insight: Even 6.2% interleaved old data prevents forgetting!

4.4 Does Adam Exacerbate Forgetting?

Research Finding (2021): - Adam vs SGD comparison on forgetting - Result: SGD sometimes experiences less forgetting than Adam - Suggests optimizer choice matters for continual learning

4.5 Energy-Based Models for CL

EBM Advantage: - Change training objective to reduce interference - No external memory needed - Outperforms baselines on several benchmarks


5. Efficient Training Techniques

5.1 Mixed Precision Training

Concept: Use FP16/BF16 for forward, FP32 for master weights

graph LR
    A["FP32 Master Weights"] --> B["FP16 Copy"]
    B --> C["Forward Pass"]
    C --> D["Loss (FP32)"]
    C --> E["Gradient (FP16)"]
    E --> F["FP32 Update"]
    F --> A
    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#e8f5e9,stroke:#4caf50
    style D fill:#f3e5f5,stroke:#9c27b0
    style E fill:#fff3e0,stroke:#ef6c00
    style F fill:#e8eaf6,stroke:#3f51b5

Benefits: - 2x memory reduction - 1.5-3x speedup on modern GPUs - Minimal accuracy loss

When to Use BF16 vs FP16: | Format | Dynamic Range | Use Case | |--------|---------------|----------| | FP16 | Limited | Older GPUs | | BF16 | Same as FP32 | Modern GPUs (A100, H100) |

5.2 Gradient Checkpointing

Concept: Don't store all activations, recompute during backward pass

Without Checkpointing:  Store all activations → Memory = O(n)
With Checkpointing:     Store checkpoints → Memory = O(√n)

Trade-off: - Memory: ~50-70% reduction - Speed: ~20% slower (recomputation)

When to Use: - Model doesn't fit in GPU memory - Training with larger batch sizes - Fine-tuning large models

5.3 Combined Techniques

Best Practice 2025:

1. Mixed Precision (BF16) → 2x memory, faster
2. Gradient Checkpointing → Further memory reduction
3. Gradient Accumulation → Effective batch size increase
4. Flash Attention → Efficient attention computation

5.4 Memory-Efficient Training Stack

Technique Memory Saving Speed Impact
Mixed Precision 50% +50-200%
Gradient Checkpointing 50-70% -20%
Flash Attention 20-40% +20-50%
Gradient Accumulation Variable None (more steps)

6. Interview Questions

6.1 Concept Questions

Q: Compare BatchNorm, LayerNorm, RMSNorm.

A: BatchNorm: Normalize across batch dimension
     - Fast training, batch-dependent, issues with small batches

   LayerNorm: Normalize across feature dimension
     - Batch-independent, used in Transformers, slower

   RMSNorm: LayerNorm without mean subtraction
     - ~10% faster than LayerNorm, standard for LLMs

Q: What is catastrophic forgetting and how to mitigate it?

A: Catastrophic forgetting occurs when neural networks
   quickly lose previously learned knowledge when training
   on new tasks.

   Mitigation strategies:
   1. Replay: Store samples from old tasks
   2. Regularization: EWC (protect important weights)
   3. Architecture: Progressive networks
   4. Mixed training: Interleave old and new data

Q: Explain mixed precision training.

A: Mixed precision uses FP16/BF16 for computations while
   maintaining FP32 master weights for numerical stability.

   Benefits:
   - 2x memory reduction
   - 1.5-3x speedup on modern GPUs
   - Enables larger batch sizes

   Key: Loss scaling to prevent gradient underflow

6.2 Architecture Questions

Q: Design a memory-efficient training pipeline for 7B LLM.

A: Stack:
   1. BF16 Mixed Precision (2x memory)
   2. Gradient Checkpointing (50% more reduction)
   3. Flash Attention (efficient attention)
   4. Gradient Accumulation (effective batch size)
   5. LoRA/QLoRA (parameter-efficient fine-tuning)

   Expected: Fit 7B model on single 24GB GPU

Q: Compare Flow Matching vs Diffusion models.

A: Diffusion:
   - Add noise iteratively → denoise
   - 50-1000 steps for generation
   - Well-established, high quality

   Flow Matching:
   - Learn direct transport map
   - 1-10 steps possible
   - Faster, comparable quality
   - Newer, less explored

6.3 Implementation Questions

Q: When would you use GroupNorm over BatchNorm?

A: Use GroupNorm when:
   - Batch size is small (< 16)
   - Training on variable-length sequences
   - Batch statistics are unreliable
   - Object detection (region-based)
   - Video processing

   Avoid GroupNorm when:
   - Large batch sizes available
   - Pure speed is priority

Q: How does gradient checkpointing work?

A: During forward pass:
   1. Only save "checkpoint" activations at certain layers
   2. Discard intermediate activations

   During backward pass:
   1. Recompute discarded activations from checkpoints
   2. Compute gradients normally

   Memory: O(√n) vs O(n) for full storage
   Trade-off: ~20% slower due to recomputation


7. Key Papers & Resources

Paper/Resource Year Key Contribution
DINO-Tracker 2024 Test-time SSL for tracking
Hi-End-MAE 2025 Hierarchical decoder for MAE
FACM 2025 Flow-anchored consistency
RMSNorm paper 2019 Efficient normalization
EBM for CL 2020 Energy-based continual learning
Mixed Training for CL 2025 1:1 mixing eliminates forgetting

8. Formulas Quick Reference

RMSNorm

\[\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{n}\sum_{i=1}^n x_i^2 + \epsilon}} \cdot \gamma\]

LayerNorm

\[\text{LayerNorm}(x) = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta\]

Consistency Model Constraint

\[f_\theta(x_t) = f_\theta(x_{t'}) \quad \forall t, t' \in [0, T]\]

Gradient Checkpointing Memory

\[\text{Memory} = O(\sqrt{n}) \text{ vs } O(n)\]

Типичные заблуждения

Заблуждение: RMSNorm и LayerNorm дают одинаковое качество, разница только в скорости

RMSNorm убирает mean subtraction (\(\mu\)), оставляя только scale (\(\gamma\)). Для большинства LLM это не влияет на convergence, но в задачах с центрированными распределениями (некоторые vision tasks) отсутствие mean centering может ухудшить результат. Выбор зависит от домена: LLM -- RMSNorm, vision -- BatchNorm/GroupNorm.

Заблуждение: Gradient checkpointing -- бесплатное уменьшение памяти

Checkpointing снижает память с O(n) до O(sqrt(n)), но замедляет обучение на ~20% из-за recomputation активаций при backward pass. На коротких моделях (< 12 слоёв) overhead может не оправдать экономию. Best practice: включать только когда модель не влезает в GPU без него.

Заблуждение: Catastrophic forgetting -- проблема только маленьких моделей

Эксперимент с Flan-T5-Base показал: при fine-tuning на math без NLI mixing, NLI accuracy падает с 81% до 16.5% (64.5pp drop). Даже крупные LLM теряют capabilities при naive fine-tuning. Решение: mixed training 1:1 (даже 6.2% старых данных) полностью предотвращает forgetting.


Вопросы для собеседования

Спроектируйте memory-efficient pipeline для обучения 7B LLM на одном 24GB GPU.

❌ «Просто используйте LoRA» -- неполный ответ, не покрывает training stack.

✅ Полный стек: (1) BF16 Mixed Precision -- 2x memory reduction, динамический диапазон как у FP32; (2) Gradient Checkpointing -- ещё 50-70% reduction, trade-off ~20% speed; (3) Flash Attention -- 20-40% memory saving + 20-50% speedup; (4) Gradient Accumulation -- effective batch size increase без memory cost; (5) LoRA/QLoRA -- PEFT вместо full fine-tuning. Итого: 7B модель влезает в 24GB GPU.

Почему Flow Matching заменяет Diffusion Models? Назовите конкретные цифры.

❌ «Flow Matching просто быстрее» -- нет деталей.

✅ Diffusion: 50-1000 шагов итеративного добавления/удаления шума (DDPM). Flow Matching: обучает direct transport map от шума к данным, позволяя генерацию за 1-10 шагов. Consistency Models (\(f_\theta(x, t) = f_\theta(f_\theta(x, t + \Delta t), t)\)) обеспечивают one-step generation. FACM (2025) комбинирует Flow Matching + Consistency для лучшего trade-off quality/speed.

Когда использовать BF16 вместо FP16, и почему это критично?

❌ «BF16 просто новее» -- нет понимания разницы форматов.

✅ FP16: 5 бит экспонента + 10 бит мантисса -- высокая precision, но ограниченный dynamic range (max ~65504). BF16: 8 бит экспонента + 7 бит мантисса -- тот же dynamic range что FP32 (3.4 x 10^38), но меньшая precision. Для LLM training BF16 критичен: gradients могут иметь огромный range, и FP16 overflow/underflow требует loss scaling. BF16 работает без loss scaling на A100/H100. На старых GPU (V100) только FP16 доступен.