DL Interview: Генеративные модели¶
~6 минут чтения
Навигация: Все темы DL интервью | Материалы DL | Математика для ML
Diffusion Models (DDPM, DDIM, CFG, Latent Diffusion, Consistency Models, U-Net, DiT), Variational Autoencoders (VAE, ELBO, beta-VAE, Posterior Collapse), GAN (WGAN, DCGAN, Mode Collapse).
Diffusion Models¶
Источники: MIT 6.S183 Diffusion Models (2026), Lilian Weng Diffusion, Sander Dieleman Guidance
Q: Как работают Diffusion Models?¶
A:
Two processes: 1. Forward (diffusion): Добавляем шум постепенно 2. Reverse (denoising): Учим модель убирать шум
Forward process: $\(q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)\)$
Closed form: $\(x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\)$
Reverse process (learned): $\(p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))\)$
Training objective: $\(\mathcal{L} = \mathbb{E}_{t,x_0,\epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]\)$
Q: DDPM vs DDIM -- в чём разница?¶
A:
| Aspect | DDPM | DDIM |
|---|---|---|
| Sampling | Stochastic (Markov chain) | Deterministic (non-Markov) |
| Steps | 1000+ | 10-50 (10-100x faster) |
| Quality | Higher | Slightly lower but fast |
| Trade-off | Time vs quality | Adjustable via \(\eta\) |
DDPM sampling:
for t in reversed(range(T)):
eps = model(x_t, t)
x_t = (x_t - beta_t / sqrt(1-alpha_bar_t) * eps) / sqrt(alpha_t)
x_t += sigma_t * noise # Stochastic
DDIM sampling:
for t in reversed(range(T)):
eps = model(x_t, t)
x_t = sqrt(alpha_{t-1}) * pred_x0 + sqrt(1 - alpha_{t-1}) * eps
# No noise added (deterministic when eta=0)
Eta parameter: \(\eta=0\) → deterministic, \(\eta=1\) → DDPM equivalent
Q: Что такое Classifier-Free Guidance?¶
A:
Problem: Conditioned generation сложно контролировать.
Solution: Train model on both conditional и unconditional data, combine at inference.
Training: Randomly drop condition (10% probability)
Inference: $\(\tilde{\epsilon}_\theta = \epsilon_\theta(x_t, t, \emptyset) + s \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset))\)$
Guidance scale \(s\): - \(s = 1\): No guidance (model output) - \(s > 1\): More faithful to condition, less diversity - Typical: \(s \in [7, 15]\) for Stable Diffusion
Q: Latent Diffusion vs Pixel Diffusion¶
A:
Pixel diffusion: - Works directly on images - Expensive: 256x256 = 65K dimensions - High compute cost
Latent Diffusion (LDM): - Compress image to latent space (VAE encoder) - Diffuse in latent space (e.g., 64x64 = 4K dimensions) - Decode back to image
# Latent Diffusion pipeline
x = vae.encode(image) # [B, 3, 256, 256] -> [B, 4, 32, 32]
x_latent = diffusion_process(x) # Diffusion in latent space
image = vae.decode(x_latent) # Back to pixel space
Benefits: - 16x+ faster training - Lower memory - Same quality
Q: Что такое Consistency Models?¶
A:
Goal: One-step generation from diffusion-trained model.
Approach: Train model to map any \(x_t\) directly to \(x_0\) (consistency function).
Training methods: 1. Consistency Distillation: Distill from pretrained diffusion 2. Consistency Training (CT): Train from scratch
Result: 1-2 steps vs 1000 DDPM steps, similar quality.
# Inference with consistency model
x_T = noise(batch_size)
x_0 = consistency_model(x_T, T) # Single step!
image = x_0 # Done
Q: Diffusion Architectures (U-Net, DiT)¶
A:
U-Net (classic): - Encoder-decoder with skip connections - Self-attention at low resolutions - Time embedding via AdaGN (adaptive group norm)
class TimeEmbedding(nn.Module):
def forward(self, t):
half = self.dim // 2
emb = log(10000) / (half - 1)
emb = torch.exp(torch.arange(half) * -emb)
emb = t[:, None] * emb[None, :]
return torch.cat([emb.sin(), emb.cos()], dim=-1)
DiT (Diffusion Transformer): - Pure transformer, no convolutions - Patchify image → tokens - AdaLN-Zero for time conditioning - Scales better than U-Net
Comparison: | Aspect | U-Net | DiT | |--------|-------|-----| | Inductive bias | Locality (conv) | Global (attention) | | Scaling | Plateaus | Better with data | | Speed | Faster (small) | Slower (large) | | SOTA | SD 1.5/2.1 | SD 3, Sora |
Variational Autoencoders (VAE)¶
Источники: Credmark VAE Questions, Medium VAE Overview
Q: Как работает VAE?¶
A:
Architecture: 1. Encoder: \(x \rightarrow (\mu, \sigma)\) -- параметры латентного распределения 2. Sampling: \(z = \mu + \sigma \cdot \epsilon\), где \(\epsilon \sim \mathcal{N}(0, I)\) 3. Decoder: \(z \rightarrow \hat{x}\) -- реконструкция
Key difference from AE: VAE learns distribution over latents, not point estimates.
class VAE(nn.Module):
def __init__(self, input_dim, latent_dim):
self.encoder = nn.Sequential(
nn.Linear(input_dim, 256), nn.ReLU(),
nn.Linear(256, latent_dim * 2) # mu and log_var
)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 256), nn.ReLU(),
nn.Linear(256, input_dim), nn.Sigmoid()
)
def reparameterize(self, mu, log_var):
std = torch.exp(0.5 * log_var)
eps = torch.randn_like(std)
return mu + eps * std
def forward(self, x):
h = self.encoder(x)
mu, log_var = h.chunk(2, dim=-1)
z = self.reparameterize(mu, log_var)
return self.decoder(z), mu, log_var
Q: Что такое ELBO и почему оптимизируем его?¶
A:
Evidence Lower Bound: $\(\log p(x) \geq \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))\)$
Components: 1. Reconstruction term: \(\mathbb{E}_{q(z|x)}[\log p(x|z)]\) -- декодер восстанавливает \(x\) 2. Regularization term: \(D_{KL}\) -- латентное распределение близко к prior
Why lower bound: Intractable \(\log p(x) = \log \int p(x|z)p(z)dz\)
Loss function:
def vae_loss(x, x_recon, mu, log_var, beta=1.0):
recon = F.binary_cross_entropy(x_recon, x, reduction='sum')
kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
return recon + beta * kl
Q: Reparameterization Trick -- зачем нужен?¶
A:
Problem: Can't backprop through random sampling \(z \sim q(z|x)\).
Solution: Factor out randomness: $\(z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\)$
Now gradients flow through \(\mu\) and \(\sigma\).
Without trick:
With trick:
Q: Posterior Collapse -- что это и как лечить?¶
A:
Problem: Decoder ignores latent \(z\), KL term → 0, VAE degenerates to regular AE.
Symptoms: - KL divergence very small (< 0.1) - Reconstruction good, but latent space meaningless - \(\mu \approx 0, \sigma \approx 1\) for all inputs
Solutions: 1. KL Annealing: Gradually increase \(\beta\) from 0 to 1 2. Free bits: Minimum KL per dimension 3. Stronger decoder: Weaker decoder needs latent more 4. \(\beta\)-VAE: Use \(\beta > 1\) to force better latent use
# KL Annealing schedule
def get_beta(epoch, warmup=10):
return min(1.0, epoch / warmup)
# Free bits
kl_min = 0.5 * latent_dim
kl_loss = torch.clamp(kl, min=kl_min)
Q: \(\beta\)-VAE и Disentanglement¶
A:
\(\beta\)-VAE: Scale KL term with \(\beta > 1\): $\(\mathcal{L} = \mathbb{E}[\log p(x|z)] - \beta \cdot D_{KL}(q(z|x) \| p(z))\)$
Effect of \(\beta\): - \(\beta = 1\): Standard VAE - \(\beta > 1\): More disentangled, worse reconstruction - \(\beta \approx 4\): Good disentanglement/quality tradeoff
Disentanglement: Each latent dimension captures one factor of variation (e.g., rotation, color, size).
Evaluation: - Traverse each \(z_i\) independently, observe changes - Metrics: BetaVAE score, FactorVAE score, DCI
Generative Adversarial Networks (GAN)¶
Q: Autoencoder архитектура и применения¶
A:
Architecture: - Encoder: \(x \rightarrow z\) (compress to latent space) - Decoder: \(z \rightarrow \hat{x}\) (reconstruct)
Loss: $\(L = \|x - \hat{x}\|^2 + \lambda \cdot \text{regularization}\)$
Applications: - Dimensionality reduction - Denoising (train on noisy inputs) - Anomaly detection (high reconstruction error = anomaly) - Feature learning
Limitation: Not generative -- can't sample new data.
Q: Почему VAE генерирует blurrier images чем GAN?¶
A:
VAE issue: Optimizes reconstruction + KL, tends to average modes.
Mathematical reason: - VAE maximizes likelihood: \(p(x) = \int p(x|z)p(z)dz\) - When uncertain, model averages over possible outputs - Result: Blurry, averaged images
GAN advantage: Adversarial loss forces sharp, realistic outputs.
Modern solutions: - VQ-VAE: Discrete latent space - VAE + GAN: Combine reconstruction + adversarial - NVAE: Hierarchical VAE with better priors
Q: GAN -- как работает adversarial training?¶
A:
Two players: - Generator G: Maps noise \(z\) to fake data \(G(z)\) - Discriminator D: Classifies real vs fake
Objective: $\(\min_G \max_D \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]\)$
Training dynamics: - D tries to correctly classify real/fake - G tries to fool D - Nash equilibrium: \(p_G = p_{data}\), \(D(x) = 0.5\)
# GAN training step
# Train D
d_real = D(x_real)
d_fake = D(G(z))
loss_d = -torch.log(d_real).mean() - torch.log(1 - d_fake).mean()
# Train G
d_fake = D(G(z))
loss_g = -torch.log(d_fake).mean() # Fool D
Q: Mode Collapse -- что это и как лечить?¶
A:
Problem: Generator produces limited variety, ignoring some modes of data distribution.
Symptoms: - Same output for different inputs - Missing classes/styles in generated data - Good quality but no diversity
Solutions: 1. Feature matching: Match intermediate features, not just output 2. Mini-batch discrimination: D sees multiple samples 3. Unrolled GAN: G optimizes against future D 4. WGAN: Wasserstein distance → better gradients 5. Spectral normalization: Stabilize both G and D 6. Progressive growing: Start low-res, gradually increase
Q: WGAN -- почему лучше обычного GAN?¶
A:
Standard GAN problem: JS divergence has vanishing gradients when distributions don't overlap.
WGAN solution: Use Wasserstein distance (Earth Mover's distance).
Kantorovich-Rubinstein duality: $\(W(P_r, P_g) \approx \max_{\|f\|_L \leq 1} \mathbb{E}_{x \sim P_r}[f(x)] - \mathbb{E}_{x \sim P_g}[f(x)]\)$
Key changes: 1. No sigmoid in D (called "critic") 2. Weight clipping OR gradient penalty (WGAN-GP) 3. Train critic more than generator (5:1 ratio)
Benefits: Stable training, meaningful loss, no mode collapse (mostly).
Q: DCGAN -- ключевые архитектурные решения¶
A:
Guidelines for stable GAN training: 1. Replace pooling with strided convolutions (D) and transposed conv (G) 2. Batch normalization in both G and D (except G output, D input) 3. Remove fully connected layers for deeper architectures 4. ReLU in G (except output: tanh) 5. LeakyReLU in D (avoid sparse gradients)
Generator architecture:
z (100) -> FC -> Reshape -> ConvT(512) -> ConvT(256) -> ConvT(128) -> ConvT(3)
+BN +BN +BN +BN +BN +Tanh
Discriminator architecture:
Image(64x64x3) -> Conv(64) -> Conv(128) -> Conv(256) -> Conv(512) -> FC(1)
+LReLU +LReLU+BN +LReLU+BN +LReLU+BN +Sigmoid
Q: Когда использовать VAE vs GAN vs Diffusion?¶
A:
| Model | Pros | Cons | Best for |
|---|---|---|---|
| VAE | Stable, principled, latent space | Blurry outputs | Anomaly detection, interpolation |
| GAN | Sharp outputs, fast sampling | Unstable, mode collapse | Image synthesis, style transfer |
| Diffusion | High quality, diverse | Slow sampling | SOTA generation, controlled synthesis |
2025-2026 trend: Diffusion dominates image generation, VAE for representation learning, GAN for specific tasks (super-resolution, style).
Hybrid approaches: - VAE + GAN: VAE encoder + GAN decoder - Diffusion + VAE: Latent diffusion (Stable Diffusion) - Flow + VAE: Normalizing flows for better posterior