Распределённое обучение: FSDP2, DeepSpeed, Megatron-LM¶

~9 минут чтения

Предварительно: LoRA и варианты файнтюнинга, Техники файнтюнинга LLM

Обучение модели на 70B параметров требует ~1.1 TB VRAM -- ни один GPU не вмещает это в одиночку. FSDP2 и DeepSpeed ZeRO-3 шардируют параметры, градиенты и optimizer state между N GPU, снижая потребление до ~140 GB на 2x H100 (Nx reduction). Но за это платишь +50% communication overhead (3P vs 2P для DDP). TorchTitan демонстрирует 99% scaling efficiency на кластере из 2000 H200 -- распределённое обучение масштабируется почти линейно при правильной настройке.

FSDP2 (DTensor, per-parameter sharding), DeepSpeed ZeRO (stages 1-3, Infinity), Megatron-LM (tensor/pipeline parallelism), DDP, communication overhead, HSDP2, TorchTitan, ms-swift, memory reduction, GPU pricing, code examples (2024-2026)

Ключевые концепции¶

Distributed training -- шардирование модели, градиентов и состояния оптимизатора между GPU.

Memory Requirements¶

Model Size	Full Fine-Tuning Memory	Min GPUs (FSDP/ZeRO-3)	Recommended
7B	~112 GB	1x RTX 4090 (24GB)	2x A100 40GB
13B	~208 GB	2x A100 40GB	4x A100 40GB
70B	~1.1 TB	2x H100 80GB	8x H100 80GB
175B+	~2.8 TB	8x H100 + ZeRO-Infinity	Multi-node

Rule of thumb: 16 GB VRAM per billion parameters (full fine-tuning).

1. Стратегии распределённого обучения¶

DDP (Distributed Data Parallel)¶

Каждый GPU хранит полную копию модели. Синхронизация через all-reduce градиентов.

\[\text{DDP communication}: 2P \cdot \frac{N-1}{N} \approx 2P \quad \text{(P = model params)}\]

FSDP / FSDP2 (Fully Sharded Data Parallel)¶

Шардирование параметров, градиентов и optimizer state между GPU.

Strategy	What's Sharded	Memory Savings
`FULL_SHARD`	Parameters + Gradients + Optimizer	Maximum
`SHARD_GRAD_OP`	Gradients + Optimizer	Moderate
`NO_SHARD`	None (DDP equivalent)	None

\[\text{FSDP/ZeRO-3 communication}: 3P \quad \text{(all-gather + reduce-scatter + all-gather)}\]

DeepSpeed ZeRO¶

Stage	What's Sharded	Memory Reduction
ZeRO-1	Optimizer states	4x
ZeRO-2	Optimizer + Gradients	8x
ZeRO-3	Optimizer + Gradients + Parameters	Nx (N = GPU count)

ZeRO-Infinity: offload на CPU/NVMe для моделей, которые не помещаются в GPU.

Megatron-LM¶

Комбинирует model parallelism + data parallelism: - Tensor parallelism -- разбивает слои между GPU - Pipeline parallelism -- разбивает стадии модели - Sequence parallelism -- для длинных контекстов - Используется NVIDIA для training крупнейших моделей

2. FSDP2 (PyTorch 2.5+)¶

DTensor Architecture¶

FSDP1 использовал FlatParameter (один 1D тензор на модуль). FSDP2 перешёл на DTensor -- per-parameter sharding.

Aspect	FSDP1	FSDP2
Parameter representation	FlatParameter (1D tensor)	DTensor (per-parameter)
Sharding granularity	Module-level flattening	Per-parameter sharding
Metadata storage	Hacks required	Native per-parameter
LoRA support	Complex	Works out-of-box
FP8 mixing	Difficult	Native support
Checkpointing	Communication required	Communication-free (SHARDED_STATE_DICT)

API Changes FSDP1 -> FSDP2¶

FSDP1	FSDP2	Notes
`--fsdp_sharding_strategy`	`--fsdp_reshard_after_forward`	true/false instead of enum
`--fsdp_backward_prefetch`	REMOVED	BACKWARD_PRE default
`--fsdp_sync_module_states`	REMOVED	Redundant with DTensor
`--fsdp_state_dict_type`	`--fsdp_state_dict_type`	SHARDED_STATE_DICT default

Production Benefits¶

Simple partial parameter freezing (LoRA works natively)
FP8 + other precision mixing in same model
Asynchronous checkpointing (CPU copy -> training -> disk save)
Deterministic memory usage (no recordStream)
Better torch.compile integration

3. FSDP vs DeepSpeed Comparison¶

Feature	FSDP/FSDP2	DeepSpeed
Native PyTorch	Yes	External library
ZeRO-3 equivalent	FULL_SHARD	Stage 3
Activation checkpointing	Yes	Yes
Mixed precision	Native	Native
CPU offloading	Yes	ZeRO-Infinity
Configuration	Python API	JSON/YAML config
Learning curve	Lower	Higher
Ecosystem	PyTorch native	Microsoft + Hugging Face

Memory Efficiency¶

Framework	7B Model	70B Model
Standard DDP	~112 GB (OOM)	~1.1 TB (OOM)
FSDP FULL_SHARD	~14 GB per GPU	~140 GB (2x H100)
DeepSpeed ZeRO-3	~14 GB per GPU	~140 GB (2x H100)
ZeRO-Infinity	~8 GB + CPU	~8 GB + CPU/NVMe

When to Choose¶

Scenario	Recommended
New project, PyTorch-native	FSDP2
Extreme scale (100B+)	DeepSpeed ZeRO-Infinity
Hugging Face ecosystem	Both work
Quick prototyping	FSDP2 (simpler API)
Production cluster	DeepSpeed (battle-tested)

ZeRO-3/FSDP FULL_SHARD: +50% communication overhead vs DDP

FSDP/ZeRO-3 требует 3P communication per step (all-gather + reduce-scatter + all-gather) vs 2P для DDP. Если модель помещается на одном GPU -- DDP быстрее. FSDP/ZeRO-3 нужен только когда модель НЕ помещается. Для 7B на 8x A100: DDP (если хватает памяти с gradient checkpointing) быстрее FSDP на 20-30%.

FSDP1 deprecated в PyTorch 2.5+ -- не начинай новый проект на нём

FSDP1 использует FlatParameter (один 1D тензор на модуль) -- ломает LoRA adapters, не поддерживает FP8 mixing, требует communication для checkpointing. FSDP2 с DTensor решает все эти проблемы. Если видишь from torch.distributed.fsdp import FullyShardedDataParallel без DTensor -- это FSDP1. Миграция: --fsdp_sharding_strategy -> --fsdp_reshard_after_forward.

4. Code Examples¶

FSDP Setup¶

import torch
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import ShardingStrategy

torch.distributed.init_process_group("nccl")

model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,
    device_id=torch.cuda.current_device(),
)

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
for batch in dataloader:
    loss = model(batch)
    loss.backward()
    optimizer.step()

DeepSpeed Setup¶

import deepspeed

ds_config = {
    "train_batch_size": 32,
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "cpu"}
    },
    "fp16": {"enabled": True}
}

model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model, optimizer=optimizer, config=ds_config
)

for batch in dataloader:
    outputs = model_engine(batch)
    loss = criterion(outputs, labels)
    model_engine.backward(loss)
    model_engine.step()

5. Framework Ecosystem (2026)¶

Framework	Best For	Status 2026
FSDP2	General training, LoRA	Default for PyTorch 2.5+
DeepSpeed ZeRO-3	Extreme memory efficiency	Still relevant
HSDP2	Multi-node training	FSDP2 + hybrid sharding
Megatron-LM	Massive scale (NVIDIA)	Production-proven
DDP	Small models (<1B)	Fastest for small scale
FSDP1	Legacy code	Deprecated

TorchTitan (PyTorch reference impl): FSDP2 + HSDP2 + Context Parallel. Float8 rowwise training, 99% scaling efficiency on 2K H200s.

ms-swift (Alibaba): Unified interface for DDP, DeepSpeed ZeRO-⅔, FSDP/FSDP2, Megatron.

Dion (Microsoft, 2025): Distributed orthonormal updates, simple pip install.

6. Hardware (2026)¶

GPU	VRAM	Price/Hour
RTX 4000 Ada	20 GB	$0.76
RTX 6000 Ada/L40S	48 GB	$1.57
AMD MI300X	192 GB	$1.99
NVIDIA H100	80 GB	$3.39

Для интервью¶

Q: "Explain FSDP2 vs DeepSpeed ZeRO-3."¶

FSDP2 (PyTorch 2.5+): uses DTensor for per-parameter sharding (vs FSDP1 FlatParameter). Native LoRA support, FP8 mixing, communication-free checkpointing. DeepSpeed ZeRO-3: stages ½/3 progressively shard optimizer/gradients/parameters (4x/8x/Nx memory reduction). ZeRO-Infinity extends to CPU/NVMe. FSDP2 simpler (Python API), DeepSpeed more configurable (JSON). Memory efficiency similar (~14 GB/GPU for 7B). FSDP2 better torch.compile integration, DeepSpeed better for extreme scale (100B+). Communication: DDP = 2P (all-reduce), FSDP/ZeRO-3 = 3P (all-gather + reduce-scatter + all-gather).

Q: "Why did FSDP2 move from FlatParameter to DTensor?"¶

DTensor enables per-parameter sharding with native metadata (dtype, requires_grad). FlatParameter flattens all module parameters into one 1D tensor -- breaks LoRA adapters, makes FP8 mixing difficult, requires communication for checkpointing. DTensor keeps each parameter separate, enabling native LoRA, precision mixing, and communication-free SHARDED_STATE_DICT checkpointing.

Q: "Design a training pipeline for a 70B model."¶

70B = ~1.1 TB full fine-tune. Options: (1) FSDP2 FULL_SHARD on 8x H100 (80GB each), ~140 GB sharded, activation checkpointing, mixed precision. (2) DeepSpeed ZeRO-3 + gradient checkpointing on same cluster. (3) If GPU-constrained: ZeRO-Infinity with CPU/NVMe offload. (4) Megatron-LM for tensor+pipeline parallelism if multi-node. PEFT alternative: LoRA reduces memory 10-20x, single H100 sufficient for 70B.

Ключевые числа¶

Факт	Значение
Full fine-tuning memory	16 GB per billion params
ZeRO-1 memory reduction	4x
ZeRO-2 memory reduction	8x
ZeRO-3 memory reduction	Nx (N = GPU count)
PEFT memory reduction	10-20x
DDP communication	~2P per step
FSDP/ZeRO-3 communication	~3P per step
7B FSDP throughput (8x A100)	~1M tokens/sec
70B ZeRO-3 throughput (64x H100)	~500K tokens/sec
175B Megatron+ZeRO (384x A100)	~200K tokens/sec
TorchTitan scaling efficiency	99% on 2K H200s

Формулы¶

\[\text{DDP}: 2P \cdot \frac{N-1}{N} \approx 2P \quad \text{(all-reduce gradients)}\]

\[\text{FSDP/ZeRO-3}: 3P \quad \text{(all-gather params + reduce-scatter grads + all-gather params)}\]

\[\text{Memory}_{ZeRO-3} = \frac{\text{Memory}_{DDP}}{N} \quad \text{(N = GPU count)}\]

Заблуждение: FSDP/ZeRO-3 всегда быстрее DDP для multi-GPU

FSDP/ZeRO-3 имеет +50% communication overhead (3P vs 2P). Если модель помещается на одном GPU с gradient checkpointing -- DDP на 20-30% быстрее. FSDP/ZeRO-3 нужен ТОЛЬКО когда модель не помещается в VRAM. Для 7B на 8x A100 40GB: сначала попробуй DDP + gradient checkpointing.

Заблуждение: ZeRO-3 и FSDP FULL_SHARD дают одинаковую производительность

При одинаковом sharding memory efficiency похожа (~14 GB/GPU для 7B). Но FSDP2 имеет лучшую интеграцию с torch.compile и нативную поддержку LoRA/FP8, а DeepSpeed лучше для extreme scale (100B+) благодаря ZeRO-Infinity (CPU/NVMe offload). Выбор зависит от масштаба и экосистемы, не от абстрактного "что быстрее".

Заблуждение: activation checkpointing бесплатно снижает память

Activation checkpointing снижает memory в 5-10x, но пересчитывает forward pass для каждого checkpointed слоя во время backward -- увеличивает время обучения на 20-30%. Это trade-off memory vs compute. Всегда профилируй: если GPU utilization <70%, checkpointing оправдан; если >90%, лучше добавить GPU.

Interview Questions¶

Q: Объясните разницу FSDP2 vs DeepSpeed ZeRO-3. Когда что выбрать?

Red flag: "Это одно и то же, просто разные фреймворки"

Strong answer: "FSDP2 (PyTorch 2.5+) использует DTensor для per-parameter sharding -- нативная поддержка LoRA, FP8 mixing, communication-free checkpointing. DeepSpeed ZeRO-3 шардирует прогрессивно (stage ½/3 -- 4x/8x/Nx reduction). Memory efficiency схожа (~14 GB/GPU для 7B). FSDP2: проще API, лучше torch.compile, для новых проектов. DeepSpeed: extreme scale 100B+ (ZeRO-Infinity с CPU/NVMe offload), battle-tested на production кластерах. Communication: DDP = 2P, оба = 3P."

Q: Спроектируйте training pipeline для 70B модели.

Red flag: "Возьмите один H100 и всё поместится"

Strong answer: "70B = ~1.1 TB full FT memory. Варианты: (1) FSDP2 FULL_SHARD на 8x H100 80GB: ~140 GB sharded, activation checkpointing, BF16 mixed precision. (2) DeepSpeed ZeRO-3 + gradient checkpointing на таком же кластере. (3) При ограниченных GPU: ZeRO-Infinity с CPU/NVMe offload. (4) Megatron-LM для tensor+pipeline parallelism на multi-node. PEFT-альтернатива: LoRA снижает memory 10-20x, single H100 достаточно для 70B. Бюджет: 8x H100 = ~$27/час, обучение 70B SFT = 1-3 дня = $650-2000."

Q: Почему FSDP2 перешёл с FlatParameter на DTensor?

Red flag: "Для оптимизации скорости"

Strong answer: "FlatParameter сплющивает все параметры модуля в один 1D тензор -- ломает LoRA adapters (нельзя отличить frozen от trainable), не поддерживает FP8 mixing (разные слои нужны в разных precision), требует communication для checkpointing. DTensor хранит каждый параметр отдельно с нативным metadata (dtype, requires_grad), что даёт: нативную поддержку LoRA через per-parameter freezing, смешивание FP8/BF16 в одной модели, communication-free SHARDED_STATE_DICT checkpointing."

Источники¶

CodeGenes -- "PyTorch FSDP vs DeepSpeed: A Comprehensive Comparison" (Nov 2025)
HuggingFace -- FSDP1 vs FSDP2 Documentation
PyTorch TorchTitan -- FSDP2 Reference Implementation
DeepSpeed Documentation -- ZeRO Optimization
Microsoft Dion Framework -- Distributed Orthonormal Updates
PyTorch Blog -- Float8 Training on Crusoe 2K H200s
ms-swift (Alibaba) -- Distributed Training Support
DeepCompile (arXiv:2504.09983) -- Compiler-Driven Distributed Training