Диффузионные модели и Flow Matching¶

~9 минут чтения

Предварительно: Vision трансформеры | Нормализация

DDPM, Latent Diffusion (VAE compression), DiT (Diffusion Transformers, AdaLN-Zero), Flow Matching (velocity field ODE, rectified flow, reflow), MMDiT (SD3, FLUX), Consistency Models (1-step generation, LCM-LoRA), PyTorch implementation, FID benchmarks, sampling steps comparison (2020-2026)

За 6 лет (2020-2026) диффузионные модели прошли путь от 1000 шагов сэмплирования (DDPM) до 1 шага (Consistency Models), а FID на COCO упал с 7.9 до 9.5 при 500x ускорении. Ключевой сдвиг -- переход от стохастических SDE к детерминированным ODE (Flow Matching), где вместо предсказания шума нейросеть учит velocity field по прямой траектории от noise к data. Stable Diffusion 3 и FLUX -- продакшн-воплощение этих идей: MMDiT + Rectified Flow дают 4-8 шагов с отличной типографикой.

Ключевые концепции¶

Diffusion models -- генеративные модели, которые учатся обращать процесс добавления шума.

Эволюция¶

graph LR
    A["DDPM<br/>2020"] --> B["Latent Diffusion<br/>2022"]
    B --> C["DiT<br/>2023"]
    C --> D["SD3 / FLUX<br/>2024"]
    D --> E["Flow Matching<br/>2025-2026"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8eaf6,stroke:#3f51b5
    style C fill:#f3e5f5,stroke:#9c27b0
    style D fill:#fff3e0,stroke:#ef6c00
    style E fill:#e8f5e9,stroke:#4caf50

Год	Модель	Инновация
2020	DDPM	Denoising Diffusion Probabilistic Models
2021	Guided Diffusion	Classifier-free guidance
2022	Latent Diffusion (SD 1.x)	U-Net в latent space
2023	DiT	Diffusion Transformers вместо U-Net
2024	SD3, FLUX.1	MMDiT, Rectified Flow
2025	SD3.5, Flow Matching	Straight paths, faster sampling
2026	Consistency Models	1-step generation

Ландшафт 2026¶

Метод	Sampling Steps	Качество	Training Stability
DDPM	1000	Excellent	High
DDIM	50-100	Good	--
Flow Matching	5-50	Excellent	Medium
Rectified Flow	1-5	Good	High
Consistency Models	1-2	Good	Medium

1. Diffusion Fundamentals¶

Forward Process (Adding Noise)¶

\[x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon\]

\(x_0\) -- исходное изображение
\(x_T \approx \mathcal{N}(0, I)\) -- чистый шум
\(\bar{\alpha}_t = \prod_{s=1}^t (1-\beta_s)\) -- накопленный noise schedule
T = 1000 шагов (typically)

Closed form: \(q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I)\)

Reverse Process (Denoising)¶

\[p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))\]

Training Objective¶

\[\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]\]

Нейросеть учится предсказывать добавленный noise \(\epsilon\).

SDE Formulation¶

\[dx = f(x, t)dt + g(t)dw\]

\(f(x,t)\) -- drift term
\(g(t)\) -- diffusion coefficient
\(dw\) -- Wiener process

2. Latent Diffusion Models (LDM)¶

Проблема DDPM: pixel space дорого (512x512x3 = 786K dims).

Решение LDM: VAE сжимает в latent space.

graph LR
    A["Image<br/>512x512x3"] --> B["VAE Encoder"]
    B --> C["Latent z<br/>64x64x4"]
    C --> D["Diffusion"]
    D --> E["VAE Decoder"]
    E --> F["Image"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#f3e5f5,stroke:#9c27b0
    style D fill:#fce4ec,stroke:#c62828
    style E fill:#fff3e0,stroke:#ef6c00
    style F fill:#e8f5e9,stroke:#4caf50

8x compression (64x64x4 = 16K dims vs 786K). Используется в Stable Diffusion 1.x/2.x.

\[\mathcal{L}_{LDM} = \mathcal{L}_{rec} + \mathcal{L}_{KL} + \mathcal{L}_{diffusion}\]

3. Diffusion Transformers (DiT)¶

Paper: Peebles & Xie, 2023. Замена U-Net backbone на Vision Transformer.

graph LR
    A["Latent z"] --> B["Patchify"]
    B --> C["Tokens"]
    C --> D["Transformer Blocks<br/>AdaLN-Zero"]
    D --> E["Unpatchify"]
    E --> F["Noise prediction"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#f3e5f5,stroke:#9c27b0
    style D fill:#f3e5f5,stroke:#9c27b0
    style E fill:#fff3e0,stroke:#ef6c00
    style F fill:#e8f5e9,stroke:#4caf50

AdaLN-Zero Block¶

x = x + gate_msa * Attention(modulate(LN(x), scale, shift))
x = x + gate_mlp * MLP(modulate(LN(x), scale, shift))
# scale, shift, gate_msa, gate_mlp learned from timestep embedding

DiT vs U-Net¶

Аспект	U-Net	DiT
Scalability	Limited	Excellent
Inductive bias	Strong (conv)	Weak (generic)
Scaling laws	Unknown	Predictable
Used in	SD 1.x, SD 2.x	SD3, FLUX, Sora

4. Flow Matching¶

Paradigm shift: вместо stochastic diffusion -- continuous flow от noise к data.

Core Concept¶

\[\frac{dx_t}{dt} = v_\theta(x_t, t)\]

\(v_\theta\) -- velocity field, который учится нейросеть. Вместо score function и implicit path -- прямое ODE.

Optimal Transport Path¶

\[x_t = (1-t)x_0 + t \cdot x_1, \quad t \in [0,1]\]

Прямая линия между noise \(x_0\) и data \(x_1\).

Flow Matching Loss¶

\[\mathcal{L}_{FM} = \mathbb{E}_{t, x_0, x_1} \left\| v_\theta(x_t, t) - (x_1 - x_0) \right\|^2\]

Flow Matching vs Diffusion¶

Аспект	Diffusion	Flow Matching
What is learned	Score \(\nabla \log p(x_t)\)	Velocity \(v(x_t, t)\)
Training target	Noise \(\epsilon\)	Velocity \(x_1 - x_0\)
Path	Curved (stochastic SDE)	Straight (deterministic ODE)
Sampling steps	50-1000	5-50
Quality	Excellent	Comparable

Rectified Flow¶

Key insight: straight paths -> faster sampling.

Reflow Iterations	Steps Needed	Quality
0 (Standard FM)	50-100	Excellent
1 (1-Rectified)	5-10	Excellent
2 (2-Rectified)	1-2	Good
3+	1	Good

Reflow: Use samples from 1-Rectified as new \(x_0\), retrain for straighter paths. After K reflows: 1-step generation.

5. MMDiT (Multimodal Diffusion Transformer)¶

Paper: Esser et al., 2024. Used in SD3, FLUX.1.

graph TD
    A["Text Encoder<br/>T5, CLIP"] --> B["Text Tokens"]
    C["Image Latent"] --> D["Patchify"] --> E["Image Tokens"]
    B --> F["Joint Transformer Blocks"]
    E --> F
    F --> G["Noise Prediction"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#f3e5f5,stroke:#9c27b0
    style C fill:#e8eaf6,stroke:#3f51b5
    style D fill:#fff3e0,stroke:#ef6c00
    style E fill:#f3e5f5,stroke:#9c27b0
    style F fill:#fce4ec,stroke:#c62828
    style G fill:#e8f5e9,stroke:#4caf50

Особенности¶

Dual-stream processing: Text и Image обрабатываются separately, затем объединяются
Separate weights: Текст и изображение имеют свои weight matrices
Joint attention: Cross-modal interaction на каждом слое
Rectified Flow training

# MMDiT Block (simplified)
text_hidden = text_linear(x_text)
image_hidden = image_linear(x_image)
combined = concat([text_hidden, image_hidden], dim=seq)
attn_out = Attention(combined)
text_out, image_out = split(attn_out)
text_out = text_mlp(text_out)
image_out = image_mlp(image_out)

SD3 vs SDXL¶

Аспект	SDXL (U-Net)	SD3 (MMDiT)
Params	6.6B	8B
Architecture	U-Net	Transformer
Training	DDPM	Rectified Flow
Steps	20-50	4-8
Typography	Poor	Excellent

6. FLUX.1 (Black Forest Labs, 2024)¶

Создатели -- оригинальные авторы Stable Diffusion.

Variant	Params	Speed	Quality
FLUX.1 [pro]	12B	Slow	Best
FLUX.1 [dev]	12B	Medium	Good
FLUX.1 [schnell]	12B	Fast (4 steps)	Good

Инновации: Guidance Distillation (CFG встроен в модель), Flow Matching backbone, Rotary Position Embeddings (variable aspect ratios).

7. Consistency Models¶

Paper: Song et al., 2023. Цель: map \(x_t \to x_0\) в один шаг.

\[f_\theta(x_t, t) = x_0 \quad \text{(self-consistency: } f_\theta(x_t, t) = f_\theta(x_{t'}, t') \text{)}\]

Consistency Training Loss¶

\[\mathcal{L}_{CT} = \mathbb{E} \left[ d(f_\theta(x_{t_n}, t_n), f_{\theta^-}(x_{t_{n-1}}, t_{n-1})) \right]\]

\(f_{\theta^-}\) -- EMA модели, \(d\) -- distance (LPIPS, L2).

LCM (Latent Consistency Models)¶

Applies consistency training в latent space. 1-4 step generation, compatible с SD.

LCM-LoRA¶

LoRA adapter превращает любую SD модель в fast sampler (2-4 steps вместо 20-50).

8. PyTorch Implementation¶

Flow Matching Training¶

class FlowMatchingModel(nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet  # U-Net or DiT backbone

    def forward(self, x_0, t):
        x_1 = torch.randn_like(x_0)
        # Linear interpolation
        x_t = (1 - t.view(-1, 1, 1, 1)) * x_0 + t.view(-1, 1, 1, 1) * x_1
        # Target velocity
        v_target = x_0 - x_1
        v_pred = self.unet(x_t, t)
        return v_pred, v_target

def train_flow_matching(model, dataloader, optimizer, epochs):
    for epoch in range(epochs):
        for batch in dataloader:
            images = batch["images"]
            t = torch.rand(images.shape[0], device=images.device)
            v_pred, v_target = model(images, t)
            loss = F.mse_loss(v_pred, v_target)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

Euler Sampler¶

def euler_sample(model, shape, num_steps=50, device="cuda"):
    x = torch.randn(shape, device=device)  # Start from noise (t=1)
    dt = -1.0 / num_steps
    with torch.no_grad():
        for i in range(num_steps):
            t = torch.ones(shape[0], device=device) * (1.0 - i / num_steps)
            v = model.unet(x, t)
            x = x + v * dt
    return x  # x_0 = generated sample

9. Practical Applications¶

Выбор модели¶

Task	Рекомендация
High-quality images	SD3, FLUX.1
Fast iteration	LCM, SD Turbo
Video generation	Sora (DiT), AnimateDiff
3D generation	Point-E, Shap-E
Inpainting	SDXL Inpaint, FLUX Fill

Memory Optimization¶

VAE Slicing: Process VAE batch elements separately
Attention Slicing: Split attention computation
CPU Offload: Move weights to CPU when not needed
8-bit Quantization: BitsAndBytes для DiT layers

Для интервью¶

Q: "Compare Flow Matching vs Diffusion."¶

Diffusion: stochastic SDE, learns score \(\nabla \log p(x_t)\), curved paths, 50-1000 steps. Flow Matching: deterministic ODE, learns velocity field \(v_\theta(x_t, t)\) напрямую, straight paths, 5-50 steps. Training: diffusion предсказывает noise \(\epsilon\), flow matching -- velocity \(x_1 - x_0\). Quality comparable, но FM в 10-100x меньше steps. Rectified Flow: reflow для ещё более прямых путей, 1-2 steps после 2 reflows. SD3/FLUX используют Flow Matching.

Q: "What is DiT and why it replaced U-Net?"¶

DiT (Diffusion Transformers, Peebles & Xie, 2023): Vision Transformer вместо U-Net backbone. AdaLN-Zero conditioning: timestep embedding -> scale, shift, gate через adaptive layer norm. Преимущества: better scalability, predictable scaling laws. U-Net: strong inductive bias (conv), limited scalability. DiT: generic, excellent scaling. SD3 использует MMDiT (Multimodal DiT): dual-stream для text + image, joint attention, separate weights per modality. FLUX.1: 12B params, MMDiT + Flow Matching, 4 steps.

Q: "How do Consistency Models achieve 1-step generation?"¶

Consistency Models (Song et al., 2023): учат \(f_\theta(x_t, t) = x_0\) для любого \(t\). Self-consistency: все точки на trajectory маппятся в одну \(x_0\). Training: distillation из pretrained diffusion model или from scratch. Loss: \(d(f_\theta(x_{t_n}), f_{\theta^-}(x_{t_{n-1}}))\) где \(f_{\theta^-}\) -- EMA. LCM: consistency training в latent space, compatible с SD. LCM-LoRA: LoRA adapter превращает любую SD модель в 2-4 step sampler.

Ключевые числа¶

Факт	Значение
DDPM sampling steps	1000
Flow Matching steps	5-50
Rectified Flow (2-reflow) steps	1-2
Consistency Models steps	1-2
LDM compression	8x (64x64x4 vs 512x512x3)
SD3 params	8B
SDXL params	6.6B
FLUX.1 params	12B
FLUX schnell steps	4
FID DDPM (COCO)	7.9 (1000 steps)
FID Flow Matching (COCO)	7.5 (50 steps)
FID Rectified Flow 2-reflow	8.0 (2 steps)
FID Consistency Model	9.5 (1 step)
SDXL training FLOPs	6.5 x 10^23
SD3 inference FLOPs	0.8 x 10^14

Формулы¶

\[x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon \quad \text{(Diffusion forward)}\]

\[\mathcal{L}_{diff} = \mathbb{E}_{t,x_0,\epsilon} \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \quad \text{(Diffusion loss)}\]

\[\frac{dx_t}{dt} = v_\theta(x_t, t) \quad \text{(Flow Matching ODE)}\]

\[x_t = (1-t)x_0 + t x_1 \quad \text{(Rectified Flow path)}\]

\[\mathcal{L}_{FM} = \mathbb{E} \| v_\theta(x_t, t) - (x_1 - x_0) \|^2 \quad \text{(Flow Matching loss)}\]

\[f_\theta(x_t, t) = x_0 \quad \text{(Consistency Model)}\]

\[\mathcal{L}_{LDM} = \mathcal{L}_{rec} + \mathcal{L}_{KL} + \mathcal{L}_{diffusion} \quad \text{(Latent Diffusion)}\]

Заблуждение: Flow Matching и Diffusion -- принципиально разные подходы

Flow Matching -- это переформулировка diffusion в continuous normalizing flow (CNF) framework. Математически оба описываются через probability path от noise к data. Разница в параметризации: diffusion учит score function \(\nabla \log p(x_t)\) через SDE, flow matching учит velocity field \(v_\theta(x_t, t)\) через ODE. Одну модель можно конвертировать в другую через связь \(v = -\frac{1}{2}g^2 \nabla \log p + f\).

Заблуждение: Consistency Models дают такое же качество как diffusion за 1 шаг

FID Consistency Model: 9.5 (1 step) vs DDPM: 7.9 (1000 steps) vs Flow Matching: 7.5 (50 steps). 1-step generation жертвует ~20% quality. LCM-LoRA (2-4 steps) -- лучший компромисс: close to diffusion quality при 10-25x ускорении. Для production обычно используют 4-8 steps с Rectified Flow (SD3, FLUX schnell).

Заблуждение: DiT всегда лучше U-Net

DiT выигрывает при масштабировании (predictable scaling laws), но U-Net лучше при малых compute budgets благодаря сильному inductive bias convolutions. SD 1.x/2.x (U-Net) до сих пор используются в edge deployment, а DiT (SD3, FLUX) требует значительно больше GPU. На ImageNet 256x256 DiT-XL/2 и U-Net-ADM показывают comparable FID при разном compute.

Interview Questions¶

Q: Сравните Flow Matching и Diffusion. Когда какой подход лучше?

Red flag: "Flow Matching быстрее, потому что использует меньше шагов."

Strong answer: "Diffusion: стохастическое SDE, учит score \(\nabla \log p(x_t)\), curved paths, 50-1000 steps, target -- noise \(\epsilon\). Flow Matching: детерминированное ODE, учит velocity field \(v_\theta(x_t, t)\), straight paths через optimal transport, 5-50 steps, target -- velocity \(x_1 - x_0\). Quality comparable (FID 7.5 vs 7.9), но FM в 10-100x меньше steps. Rectified Flow делает пути ещё прямее: после 2 reflows -- 1-2 steps. SD3/FLUX используют FM. Diffusion лучше для diversity sampling (stochasticity), FM -- для deterministic high-quality generation."

Q: Как Latent Diffusion решает проблему вычислительной сложности?

Red flag: "Сжимает изображение, чтобы было быстрее."

Strong answer: "Pixel space: 512x512x3 = 786K dimensions -- quadratic attention cost неприемлем. LDM: VAE encoder сжимает в latent space 64x64x4 = 16K dims (8x compression). Diffusion работает в latent space, затем VAE decoder восстанавливает. Loss: \(L_{rec} + L_{KL} + L_{diffusion}\). VAE обучается отдельно, замораживается при diffusion training. Trade-off: lossy compression ограничивает fine details, поэтому SD3 использует improved VAE с меньшим reconstruction loss."

Q: Что такое AdaLN-Zero в DiT и почему это важно?

Red flag: "Это Layer Norm с дополнительными параметрами."

Strong answer: "AdaLN-Zero -- adaptive layer normalization с zero-initialization gates. Timestep embedding генерирует scale, shift и gate параметры: \(x = x + gate \cdot Attention(modulate(LN(x), scale, shift))\). Zero-init: gates начинают с 0, делая блок identity function -- стабилизирует глубокое обучение. Это conditioning mechanism: вместо cross-attention (дорого) или concatenation (слабо), модуляция через LN -- дешево и эффективно. DiT-XL/2 с AdaLN-Zero -- SOTA на ImageNet, используется в SD3 и FLUX."

Источники¶

Ho et al. -- "Denoising Diffusion Probabilistic Models" (2020)
Rombach et al. -- "High-Resolution Image Synthesis with Latent Diffusion Models" (2022)
Peebles & Xie -- "Scalable Diffusion Models with Transformers" (DiT, 2023)
Esser et al. -- "Scaling Rectified Flow Transformers" (SD3, 2024)
Song et al. -- "Consistency Models" (2023)
Luo et al. -- "Latent Consistency Models" (2023)
Lipman et al. -- "Flow Matching for Generative Modeling" (2023)
Liu et al. -- "Flow Straight and Fast: Rectified Flow" (2023)
Simon Coste -- "Flow Models III: Flow Matching" (scoste.fr)
ICLR Blog 2025 -- "Flow With What You Know"
Black Forest Labs -- FLUX Technical Report
MIT -- Flow Matching & Diffusion Models Course (2026)