DL Interview: Генеративные модели¶

~6 минут чтения

Навигация: Все темы DL интервью | Материалы DL | Математика для ML

Diffusion Models (DDPM, DDIM, CFG, Latent Diffusion, Consistency Models, U-Net, DiT), Variational Autoencoders (VAE, ELBO, beta-VAE, Posterior Collapse), GAN (WGAN, DCGAN, Mode Collapse).

Diffusion Models¶

Источники: MIT 6.S183 Diffusion Models (2026), Lilian Weng Diffusion, Sander Dieleman Guidance

Q: Как работают Diffusion Models?¶

A:

Two processes: 1. Forward (diffusion): Добавляем шум постепенно 2. Reverse (denoising): Учим модель убирать шум

Forward process: $$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$$

Closed form: $$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

Reverse process (learned): $$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$$

Training objective: $$\mathcal{L} = \mathbb{E}_{t,x_0,\epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]$$

Q: DDPM vs DDIM -- в чём разница?¶

A:

Aspect	DDPM	DDIM
Sampling	Stochastic (Markov chain)	Deterministic (non-Markov)
Steps	1000+	10-50 (10-100x faster)
Quality	Higher	Slightly lower but fast
Trade-off	Time vs quality	Adjustable via $\eta$

DDPM sampling:

for t in reversed(range(T)):
    eps = model(x_t, t)
    x_t = (x_t - beta_t / sqrt(1-alpha_bar_t) * eps) / sqrt(alpha_t)
    x_t += sigma_t * noise  # Stochastic

DDIM sampling:

for t in reversed(range(T)):
    eps = model(x_t, t)
    x_t = sqrt(alpha_{t-1}) * pred_x0 + sqrt(1 - alpha_{t-1}) * eps
    # No noise added (deterministic when eta=0)

Eta parameter: $\eta=0$ → deterministic, $\eta=1$ → DDPM equivalent

Q: Что такое Classifier-Free Guidance?¶

A:

Problem: Conditioned generation сложно контролировать.

Solution: Train model on both conditional и unconditional data, combine at inference.

Training: Randomly drop condition (10% probability)

if random() < 0.1:
    condition = None  # Unconditional

Inference: $$\tilde{\epsilon}_\theta = \epsilon_\theta(x_t, t, \emptyset) + s \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset))$$

Guidance scale $s$: - $s = 1$: No guidance (model output) - $s > 1$: More faithful to condition, less diversity - Typical: $s \in [7, 15]$ for Stable Diffusion

Q: Latent Diffusion vs Pixel Diffusion¶

A:

Pixel diffusion: - Works directly on images - Expensive: 256x256 = 65K dimensions - High compute cost

Latent Diffusion (LDM): - Compress image to latent space (VAE encoder) - Diffuse in latent space (e.g., 64x64 = 4K dimensions) - Decode back to image

# Latent Diffusion pipeline
x = vae.encode(image)           # [B, 3, 256, 256] -> [B, 4, 32, 32]
x_latent = diffusion_process(x)  # Diffusion in latent space
image = vae.decode(x_latent)     # Back to pixel space

Benefits: - 16x+ faster training - Lower memory - Same quality

Q: Что такое Consistency Models?¶

A:

Goal: One-step generation from diffusion-trained model.

Approach: Train model to map any $x_t$ directly to $x_0$ (consistency function).

\[f_\theta(x_t, t) = x_0\]

Training methods: 1. Consistency Distillation: Distill from pretrained diffusion 2. Consistency Training (CT): Train from scratch

Result: 1-2 steps vs 1000 DDPM steps, similar quality.

# Inference with consistency model
x_T = noise(batch_size)
x_0 = consistency_model(x_T, T)  # Single step!
image = x_0  # Done

Q: Diffusion Architectures (U-Net, DiT)¶

A:

U-Net (classic): - Encoder-decoder with skip connections - Self-attention at low resolutions - Time embedding via AdaGN (adaptive group norm)

class TimeEmbedding(nn.Module):
    def forward(self, t):
        half = self.dim // 2
        emb = log(10000) / (half - 1)
        emb = torch.exp(torch.arange(half) * -emb)
        emb = t[:, None] * emb[None, :]
        return torch.cat([emb.sin(), emb.cos()], dim=-1)

DiT (Diffusion Transformer): - Pure transformer, no convolutions - Patchify image → tokens - AdaLN-Zero for time conditioning - Scales better than U-Net

Comparison: | Aspect | U-Net | DiT | |--------|-------|-----| | Inductive bias | Locality (conv) | Global (attention) | | Scaling | Plateaus | Better with data | | Speed | Faster (small) | Slower (large) | | SOTA | SD 1.5/2.1 | SD 3, Sora |

Variational Autoencoders (VAE)¶

Источники: Credmark VAE Questions, Medium VAE Overview

Q: Как работает VAE?¶

A:

Architecture: 1. Encoder: $x \rightarrow (\mu, \sigma)$ -- параметры латентного распределения 2. Sampling: $z = \mu + \sigma \cdot \epsilon$, где $\epsilon \sim \mathcal{N}(0, I)$ 3. Decoder: $z \rightarrow \hat{x}$ -- реконструкция

Key difference from AE: VAE learns distribution over latents, not point estimates.

class VAE(nn.Module):
    def __init__(self, input_dim, latent_dim):
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 256), nn.ReLU(),
            nn.Linear(256, latent_dim * 2)  # mu and log_var
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256), nn.ReLU(),
            nn.Linear(256, input_dim), nn.Sigmoid()
        )

    def reparameterize(self, mu, log_var):
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + eps * std

    def forward(self, x):
        h = self.encoder(x)
        mu, log_var = h.chunk(2, dim=-1)
        z = self.reparameterize(mu, log_var)
        return self.decoder(z), mu, log_var

Q: Что такое ELBO и почему оптимизируем его?¶

A:

Evidence Lower Bound: $$\log p(x) \geq \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))$$

Components: 1. Reconstruction term: $\mathbb{E}_{q(z|x)}[\log p(x|z)]$ -- декодер восстанавливает $x$ 2. Regularization term: $D_{KL}$ -- латентное распределение близко к prior

Why lower bound: Intractable $\log p(x) = \log \int p(x|z)p(z)dz$

Loss function:

def vae_loss(x, x_recon, mu, log_var, beta=1.0):
    recon = F.binary_cross_entropy(x_recon, x, reduction='sum')
    kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
    return recon + beta * kl

Q: Reparameterization Trick -- зачем нужен?¶

A:

Problem: Can't backprop through random sampling $z \sim q(z|x)$.

Solution: Factor out randomness: $$z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

Now gradients flow through $\mu$ and $\sigma$.

Without trick:

z = torch.normal(mu, sigma)  # No gradient!

With trick:

eps = torch.randn_like(mu)
z = mu + sigma * eps  # Gradient flows through mu and sigma

Q: Posterior Collapse -- что это и как лечить?¶

A:

Problem: Decoder ignores latent $z$, KL term → 0, VAE degenerates to regular AE.

Symptoms: - KL divergence very small (< 0.1) - Reconstruction good, but latent space meaningless - $\mu \approx 0, \sigma \approx 1$ for all inputs

Solutions: 1. KL Annealing: Gradually increase $\beta$ from 0 to 1 2. Free bits: Minimum KL per dimension 3. Stronger decoder: Weaker decoder needs latent more 4. $\beta$-VAE: Use $\beta > 1$ to force better latent use

# KL Annealing schedule
def get_beta(epoch, warmup=10):
    return min(1.0, epoch / warmup)

# Free bits
kl_min = 0.5 * latent_dim
kl_loss = torch.clamp(kl, min=kl_min)

Q: $\beta$-VAE и Disentanglement¶

A:

$\beta$-VAE: Scale KL term with $\beta > 1$: $$\mathcal{L} = \mathbb{E}[\log p(x|z)] - \beta \cdot D_{KL}(q(z|x) \| p(z))$$

Effect of $\beta$: - $\beta = 1$: Standard VAE - $\beta > 1$: More disentangled, worse reconstruction - $\beta \approx 4$: Good disentanglement/quality tradeoff

Disentanglement: Each latent dimension captures one factor of variation (e.g., rotation, color, size).

Evaluation: - Traverse each $z_i$ independently, observe changes - Metrics: BetaVAE score, FactorVAE score, DCI

Generative Adversarial Networks (GAN)¶

Q: Autoencoder архитектура и применения¶

A:

Architecture: - Encoder: $x \rightarrow z$ (compress to latent space) - Decoder: $z \rightarrow \hat{x}$ (reconstruct)

Loss: $$L = \|x - \hat{x}\|^2 + \lambda \cdot \text{regularization}$$

Applications: - Dimensionality reduction - Denoising (train on noisy inputs) - Anomaly detection (high reconstruction error = anomaly) - Feature learning

Limitation: Not generative -- can't sample new data.

Q: Почему VAE генерирует blurrier images чем GAN?¶

A:

VAE issue: Optimizes reconstruction + KL, tends to average modes.

Mathematical reason: - VAE maximizes likelihood: $p(x) = \int p(x|z)p(z)dz$ - When uncertain, model averages over possible outputs - Result: Blurry, averaged images

GAN advantage: Adversarial loss forces sharp, realistic outputs.

Modern solutions: - VQ-VAE: Discrete latent space - VAE + GAN: Combine reconstruction + adversarial - NVAE: Hierarchical VAE with better priors

Q: GAN -- как работает adversarial training?¶

A:

Two players: - Generator G: Maps noise $z$ to fake data $G(z)$ - Discriminator D: Classifies real vs fake

Objective: $$\min_G \max_D \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

Training dynamics: - D tries to correctly classify real/fake - G tries to fool D - Nash equilibrium: $p_G = p_{data}$, $D(x) = 0.5$

# GAN training step
# Train D
d_real = D(x_real)
d_fake = D(G(z))
loss_d = -torch.log(d_real).mean() - torch.log(1 - d_fake).mean()

# Train G
d_fake = D(G(z))
loss_g = -torch.log(d_fake).mean()  # Fool D

Q: Mode Collapse -- что это и как лечить?¶

A:

Problem: Generator produces limited variety, ignoring some modes of data distribution.

Symptoms: - Same output for different inputs - Missing classes/styles in generated data - Good quality but no diversity

Solutions: 1. Feature matching: Match intermediate features, not just output 2. Mini-batch discrimination: D sees multiple samples 3. Unrolled GAN: G optimizes against future D 4. WGAN: Wasserstein distance → better gradients 5. Spectral normalization: Stabilize both G and D 6. Progressive growing: Start low-res, gradually increase

Q: WGAN -- почему лучше обычного GAN?¶

A:

Standard GAN problem: JS divergence has vanishing gradients when distributions don't overlap.

WGAN solution: Use Wasserstein distance (Earth Mover's distance).

\[W(P_r, P_g) = \inf_{\gamma \in \Pi(P_r, P_g)} \mathbb{E}_{(x,y) \sim \gamma}[\|x - y\|]\]

Kantorovich-Rubinstein duality: $$W(P_r, P_g) \approx \max_{\|f\|_L \leq 1} \mathbb{E}_{x \sim P_r}[f(x)] - \mathbb{E}_{x \sim P_g}[f(x)]$$

Key changes: 1. No sigmoid in D (called "critic") 2. Weight clipping OR gradient penalty (WGAN-GP) 3. Train critic more than generator (5:1 ratio)

Benefits: Stable training, meaningful loss, no mode collapse (mostly).

Q: DCGAN -- ключевые архитектурные решения¶

A:

Guidelines for stable GAN training: 1. Replace pooling with strided convolutions (D) and transposed conv (G) 2. Batch normalization in both G and D (except G output, D input) 3. Remove fully connected layers for deeper architectures 4. ReLU in G (except output: tanh) 5. LeakyReLU in D (avoid sparse gradients)

Generator architecture:

z (100) -> FC -> Reshape -> ConvT(512) -> ConvT(256) -> ConvT(128) -> ConvT(3)
         +BN     +BN       +BN          +BN          +BN        +Tanh

Discriminator architecture:

Image(64x64x3) -> Conv(64) -> Conv(128) -> Conv(256) -> Conv(512) -> FC(1)
                 +LReLU     +LReLU+BN   +LReLU+BN   +LReLU+BN   +Sigmoid

Q: Когда использовать VAE vs GAN vs Diffusion?¶

A:

Model	Pros	Cons	Best for
VAE	Stable, principled, latent space	Blurry outputs	Anomaly detection, interpolation
GAN	Sharp outputs, fast sampling	Unstable, mode collapse	Image synthesis, style transfer
Diffusion	High quality, diverse	Slow sampling	SOTA generation, controlled synthesis

2025-2026 trend: Diffusion dominates image generation, VAE for representation learning, GAN for specific tasks (super-resolution, style).

Hybrid approaches: - VAE + GAN: VAE encoder + GAN decoder - Diffusion + VAE: Latent diffusion (Stable Diffusion) - Flow + VAE: Normalizing flows for better posterior