DL Interview: Обучение и оптимизация¶
~9 минут чтения
Навигация: Все темы DL интервью | Материалы DL | Математика для ML
Second-Order Optimization (Newton, BFGS, L-BFGS, Natural Gradient, AdaHessian), Mixed Precision Training (FP16/BF16, Loss Scaling, AMP), Gradient Debugging (Vanishing/Exploding, Clipping, Monitoring), Gradient Flow Analysis, Gradient Checkpointing, Advanced Regularization (DropPath, Mixup, CutMix, Label Smoothing).
Second-Order Optimization Methods¶
Q: Почему second-order methods сходятся быстрее?¶
A:
First-order (SGD): Использует только gradient -- направление спуска.
Second-order (Newton): Использует Hessian -- curvature информации.
Newton update: $\(\theta_{t+1} = \theta_t - H^{-1} \nabla L(\theta_t)\)$
Advantage: Адаптивный step size на основе curvature. - Steep curvature -> small steps - Flat curvature -> large steps
Convergence: - SGD: \(O(1/k)\) или \(O(1/\sqrt{k})\) - Newton: quadratic convergence -- \(e_{k+1} = O(e_k^2)\), число верных цифр удваивается каждый шаг
Q: Почему не используем Newton's method в deep learning?¶
A:
Problems: 1. Hessian size: \(O(D^2)\) для \(D\) параметров. GPT-3 (175B params) -> \(10^{22}\) entries! 2. Inverse computation: \(O(D^3)\) -- infeasible 3. Saddle points: Newton направляет К saddle points, не от них 4. Non-convexity: Hessian может не быть positive definite
Solutions: - Quasi-Newton methods (BFGS, L-BFGS) - Hessian-free optimization - Natural gradient (Fisher information)
Q: BFGS vs L-BFGS -- в чём разница?¶
A:
BFGS (Broyden-Fletcher-Goldfarb-Shanno): Quasi-Newton method, approximates \(H^{-1}\).
Update rule: $\(\theta_{t+1} = \theta_t - \alpha_t M_t \nabla L(\theta_t)\)$
где \(M_t\) ~ \(H^{-1}\) (iteratively refined).
L-BFGS (Limited-memory BFGS): - Не хранит полную \(M_t\) (too large!) - Хранит только последние \(m\) gradient differences - Memory: \(O(mD)\) вместо \(O(D^2)\)
| Method | Memory | When to use |
|---|---|---|
| BFGS | \(O(D^2)\) | Small problems (<10K params) |
| L-BFGS | \(O(mD)\) | Large problems, batch optimization |
| SGD/Adam | \(O(D)\) | Deep learning, stochastic |
Q: Natural Gradient -- в чём идея?¶
A:
Problem: Standard gradient не учитывает geometry parameter space.
Natural gradient: Gradient в space of probability distributions, не parameters.
где \(F\) = Fisher Information Matrix.
Intuition: Direction of steepest descent in distribution space.
Properties: - Invariant to reparameterization - Faster convergence на ill-conditioned problems - Expensive: \(F^{-1}\) is \(O(D^2)\)
Approximations: - K-FAC (Kronecker-factored Approximate Curvature) - Adam as diagonal natural gradient approximation
Q: Conjugate Gradient Method -- когда использовать?¶
A:
Idea: Find directions conjugate to previous directions, avoid redundant exploration.
Conjugate directions: \(d_i^T H d_j = 0\) for \(i \neq j\)
Algorithm: $\(d_t = -g_t + \beta_t d_{t-1}\)$
beta computation (Fletcher-Reeves): $\(\beta_t = \frac{g_t^T g_t}{g_{t-1}^T g_{t-1}}\)$
Pros: - No Hessian storage - Guaranteed convergence in <= D steps for quadratic - Works well for large-scale optimization
Cons: - Designed for convex/quadratic problems - Requires exact line search (or good approximation) - Less effective for highly non-convex (deep learning)
Q: AdaHessian -- что это?¶
A:
Idea: Adam + diagonal Hessian approximation for adaptive learning.
Hessian diagonal estimation: Use Hutchinson's method with random vectors.
где \(z\) ~ Rademacher distribution (±1).
AdaHessian update: $\(m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t\)$
Advantage over Adam: Uses curvature information, better on ill-conditioned problems.
Mixed Precision Training (FP16/BF16)¶
Источники: BuildAI: Mixed Precision Training (2025), RunPod: FP16/BF16/FP8 Guide (2025)
Q: Что такое Mixed Precision Training?¶
A:
Idea: Use lower precision (FP16/BF16) for most operations, keep FP32 for critical parts.
Motivation: - 2x memory savings: FP16 = 2 bytes vs FP32 = 4 bytes per parameter - 2-4x speedup: Tensor cores (H100: 989 TOPS FP16 vs 67 TFLOPS FP32) - Larger batches/models: Fit more in GPU memory
Key insight: Neural networks are precision-resilient -- small numerical errors don't hurt learning.
Q: FP16 vs BF16 -- в чём разница?¶
A:
| Property | FP32 | FP16 | BF16 |
|---|---|---|---|
| Bits | 32 | 16 | 16 |
| Exponent | 8 | 5 | 8 |
| Mantissa | 23 | 10 | 7 |
| Max value | 3.4e38 | 65504 | 3.4e38 |
| Min positive | 1.2e-38 | 6.1e-5 | 1.2e-38 |
| Precision | High | Medium | Low |
FP16: Higher precision, smaller range -> gradient underflow risk BF16: Same range as FP32, lower precision -> safer for LLMs
Recommendation: BF16 preferred for modern training (TPU, Ampere+ GPUs).
Q: Почему нужен Loss Scaling?¶
A:
Problem: Gradients часто очень маленькие (< 6.1e-5), underflow to zero в FP16.
Solution: Multiply loss by large constant before backward, divide gradients after:
# Conceptual
scaled_loss = loss * scale # e.g., scale = 65536
scaled_loss.backward()
grad = grad / scale
optimizer.step()
Dynamic scaling (PyTorch AMP):
from torch.cuda.amp import GradScaler, autocast
scaler = GradScaler(init_scale=65536)
with autocast():
output = model(x)
loss = criterion(output, y)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update() # Auto-adjusts scale
GradScaler behavior:
- If inf/nan in gradients -> skip step, reduce scale
- If no overflow for growth_interval steps -> increase scale
Q: PyTorch AMP -- как использовать?¶
A:
from torch.cuda.amp import autocast, GradScaler
import torch.nn as nn
model = MyLargeModel().cuda()
optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler()
for batch, targets in dataloader:
batch, targets = batch.cuda(), targets.cuda()
optimizer.zero_grad()
# Forward pass with autocast
with autocast(): # Auto FP16 for matmul, FP32 for loss
output = model(batch)
loss = nn.CrossEntropyLoss()(output, targets)
# Scaled backward
scaler.scale(loss).backward()
# Unscale for gradient clipping
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Step with scaling
scaler.step(optimizer)
scaler.update()
What autocast does: - FP16 ops: matmul, conv (whitelisted) - FP32 ops: loss, softmax, batchnorm (blacklisted for stability)
Q: Когда Mixed Precision работает плохо?¶
A:
Problem cases: 1. Very small gradients -> even with scaling, may underflow 2. Numerically sensitive ops -> may need FP32 3. Old GPUs -> no tensor cores, no BF16 support 4. Mixed batch norm + small batches -> unstable
Solutions:
- Use BF16 if available (Ampere+, TPU)
- Keep batch norm in FP32
- Use torch.set_float32_matmul_precision('high') for newer GPUs
- Monitor for inf/nan in training
Debug checklist:
# Check for gradient issues
for name, param in model.named_parameters():
if param.grad is not None:
if torch.isnan(param.grad).any():
print(f"NaN gradient in {name}")
if torch.isinf(param.grad).any():
print(f"Inf gradient in {name}")
Debugging Neural Networks: Gradient Issues¶
Q: Vanishing vs Exploding Gradients -- в чём разница?¶
A:
| Problem | Cause | Symptom | Solution |
|---|---|---|---|
| Vanishing | \(\|W\| < 1\), sigmoid derivatives | Early layers don't learn, loss plateaus | ReLU, BatchNorm, skip connections |
| Exploding | \(\|W\| > 1\), deep networks | Loss spikes, NaN, divergence | Gradient clipping, smaller LR |
Chain rule insight: $\(\frac{\partial L}{\partial W_1} = \prod_{l=2}^{L} \frac{\partial L}{\partial W_l} \cdot \frac{\partial L}{\partial W_1}\)$
If each term < 1 -> gradient decays exponentially with depth.
Q: Как обнаружить gradient issues?¶
A:
1. Monitor gradient norms per layer:
def log_gradient_norms(model, step, logger):
for name, param in model.named_parameters():
if param.grad is not None:
grad_norm = param.grad.norm().item()
logger.log({f"grad_norm/{name}": grad_norm}, step=step)
2. Signs of vanishing gradients: - Early layer norms ~ 0 while later layers normal - Loss decreases slowly or plateaus - Model only learns shallow features
3. Signs of exploding gradients: - Loss becomes NaN - Gradient norms spike suddenly - Weights become very large
4. Visual check:
grad_norms = []
for name, param in model.named_parameters():
if param.grad is not None:
grad_norms.append((name, param.grad.norm().item()))
# Sort by norm
grad_norms.sort(key=lambda x: x[1], reverse=True)
for name, norm in grad_norms[:10]:
print(f"{name}: {norm:.2e}")
Q: Gradient Clipping -- как и когда использовать?¶
A:
Gradient clipping ограничивает норму градиента до max_norm:
# By norm (most common)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# By value (clip each parameter)
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)
Formula: If \(\|g\|_2 > c\): $\(g \leftarrow g \cdot \frac{c}{\|g\|_2}\)$
When to use: - RNNs/LSTMs (almost always) - Large models with potential instabilities - Training with large learning rates - When you see occasional loss spikes
Choosing max_norm: - Start with 1.0 - Monitor: if clipping happens >50% of steps -> increase - If still getting instabilities -> decrease
Q: Как stabilise training глубокой сети?¶
A:
1. Weight initialization:
# Xavier (Glorot) for tanh/sigmoid
nn.init.xavier_uniform_(layer.weight)
# He initialization for ReLU
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
2. Normalization layers: - BatchNorm: normalizes across batch - LayerNorm: normalizes across features (transformers) - GroupNorm: middle ground
3. Activation functions: | Activation | Pros | Cons | |------------|------|------| | ReLU | Fast, no vanishing (positive) | Dying neurons | | LeakyReLU | No dying neurons | Extra hyperparameter | | GELU | Smoother, better for transformers | Slightly slower | | Swish | Self-gated, deep networks | Slower |
4. Learning rate warmup:
def get_lr(step, warmup_steps, d_model):
return d_model ** -0.5 * min(step ** -0.5, step * warmup_steps ** -1.5)
5. Skip connections (ResNet): $\(y = F(x) + x\)$
Gradients flow directly through identity mapping.
Q: Production debugging checklist?¶
A:
class TrainingMonitor:
def __init__(self, model):
self.model = model
self.grad_history = []
def check_gradients(self):
total_norm = 0.0
nan_count = 0
inf_count = 0
for name, param in self.model.named_parameters():
if param.grad is not None:
if torch.isnan(param.grad).any():
print(f"[WARNING] NaN in {name}")
nan_count += 1
if torch.isinf(param.grad).any():
print(f"[WARNING] Inf in {name}")
inf_count += 1
param_norm = param.grad.norm().item()
total_norm += param_norm ** 2
self.grad_history.append((name, param_norm))
total_norm = total_norm ** 0.5
return {
'total_norm': total_norm,
'nan_count': nan_count,
'inf_count': inf_count,
'status': 'OK' if nan_count == 0 and inf_count == 0 else 'ISSUE'
}
Alert thresholds: - Gradient norm > 100 -> potential explosion - Gradient norm < 1e-7 -> potential vanishing - Any NaN/Inf -> immediate investigation
Gradient Flow Analysis¶
Q: Gradient explosion -- как обнаружить в PyTorch?¶
A:
L2 Gradient Norm:
def compute_gradient_norm(model):
total_norm = 0
for p in model.parameters():
if p.grad is not None:
param_norm = p.grad.data.norm(2)
total_norm += param_norm.item() ** 2
return total_norm ** 0.5
# In training loop
loss.backward()
grad_norm = compute_gradient_norm(model)
if grad_norm > 100:
print(f"Warning: Large gradient norm = {grad_norm}")
Признаки explosion: - Norm > 100 -- возможно начало проблем - Norm > 1000 -- определённо explosion - Rapid growth между итерациями
Q: Gradient accumulation -- зачем и как?¶
A:
Gradient accumulation позволяет эффективный batch size больше чем память GPU.
Проблема: Large batch улучшает training, но не влезает в GPU memory.
accumulation_steps = 4 # Effective batch = batch_size * 4
optimizer.zero_grad() # Reset ONLY at start of accumulation
for i, (inputs, labels) in enumerate(dataloader):
outputs = model(inputs)
loss = criterion(outputs, labels) / accumulation_steps # Scale loss!
loss.backward() # Accumulate gradients
if (i + 1) % accumulation_steps == 0:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
optimizer.zero_grad() # Reset for next accumulation
Критично: Делить loss на accumulation_steps чтобы gradient magnitude был корректным.
Q: Диагностика проблем с gradient flow -- чеклист?¶
A:
class GradientMonitor:
def __init__(self, model, log_freq=100):
self.model = model
self.log_freq = log_freq
self.step = 0
def log_gradients(self):
if self.step % self.log_freq != 0:
self.step += 1
return
stats = {}
for name, param in self.model.named_parameters():
if param.grad is not None:
grad = param.grad.data
stats[name] = {
'norm': grad.norm().item(),
'max': grad.max().item(),
'min': grad.min().item(),
'mean': grad.mean().item(),
'has_nan': torch.isnan(grad).any().item(),
'has_inf': torch.isinf(grad).any().item(),
}
total_norm = sum(s['norm']**2 for s in stats.values())**0.5
max_grad = max(s['max'] for s in stats.values())
min_grad = min(s['min'] for s in stats.values())
print(f"Step {self.step}:")
print(f" Total norm: {total_norm:.4f}")
print(f" Max: {max_grad:.4f}, Min: {min_grad:.4f}")
if total_norm > 100:
print(f" WARNING: Large gradient norm: {total_norm}")
if any(s['has_nan'] or s['has_inf'] for s in stats.values()):
print(f" CRITICAL: NaN/Inf detected!")
self.step += 1
return stats
Диагностическая таблица:
| Symptom | Possible Cause | Solution |
|---|---|---|
| Norm > 100 consistently | LR too high | Reduce LR, add clipping |
| Norm grows over time | Learning rate schedule | Add warmup, decay |
| NaN/Inf | Numerical instability | FP32, gradient clipping |
| Norm ~ 0 | Vanishing gradients | Better init, skip connections |
| Spikes in norm | Bad batch | Data validation, gradient clipping |
Q: Как предотвратить gradient explosion архитектурно?¶
A:
1. Proper Initialization:
import torch.nn.init as init
# Xavier/Glorot for tanh
init.xavier_uniform_(layer.weight)
# He/Kaiming for ReLU
init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
2. Skip Connections: $\(\mathbf{y} = f(\mathbf{x}) + \mathbf{x}\)$
Градиент течёт напрямую: \(\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial f}{\partial \mathbf{x}} + \mathbf{I}\)
3. Pre-LN vs Post-LN:
# Pre-LN (more stable for deep Transformers)
x = x + attention(layer_norm(x))
x = x + ffn(layer_norm(x))
# Post-LN (original, less stable)
x = layer_norm(x + attention(x))
x = layer_norm(x + ffn(x))
Gradient Checkpointing & Activation Recomputation¶
Q: Что такое Gradient Checkpointing?¶
A:
Core idea: Trade compute for memory -- don't store all activations during forward pass, recompute them during backward.
Normal training:
Forward: A -> store, B -> store, C -> store, D -> store
Backward: Use stored activations for D, C, B, A
Memory: All activations in memory
With checkpointing (B and C checkpointed):
Forward: A -> store, B -> keep I/O only, C -> keep I/O only, D -> store
Backward: D backward, recompute C forward, C backward, recompute B forward, B backward, A backward
Memory: Only checkpoints + current activations
Memory savings: - Reduces activation memory from O(n) to O(sqrt(n)) with optimal partitioning - Typical: 50-70% reduction in peak memory - Cost: 20-30% increase in training time
from torch.utils.checkpoint import checkpoint, checkpoint_sequential
# Single module checkpointing
class MyModel(nn.Module):
def forward(self, x):
x = self.layer1(x)
x = checkpoint(self.layer2, x) # Recompute layer2 during backward
x = self.layer3(x)
return x
# Sequential checkpointing
model = nn.Sequential(layer1, layer2, layer3, layer4, layer5)
out = checkpoint_sequential(model, 2, x) # 2 segments
Q: Gradient Checkpointing vs Gradient Accumulation¶
A:
| Technique | What it does | Trade-off | Use case |
|---|---|---|---|
| Checkpointing | Reduce activation memory | +compute time | Large model, small batch |
| Accumulation | Simulate large batch | +training steps | Small batch, want large effective batch |
# Gradient Accumulation
accumulation_steps = 4
for i, (x, y) in enumerate(dataloader):
loss = model(x, y) / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
Q: Pitfalls с Gradient Checkpointing¶
A:
1. RNG State: - Random operations (Dropout) may produce different values on recompute - Solution: Move dropout outside checkpointed region
2. BatchNorm: - Running statistics updated twice - Solution: Don't checkpoint BatchNorm layers
3. In-place operations: - Break checkpointing contract - Solution: Avoid in-place ops in checkpointed modules
Q: Какой overhead от Gradient Checkpointing?¶
A:
Formula: Overhead ~ F_recompute / (F_total + B_total)
Typical values: - Transformers: 20-35% overhead - CNNs: 15-25% overhead
Benchmark (BERT-base, 16GB GPU): | Config | Batch size | Peak memory | Time/step | |--------|------------|-------------|-----------| | No checkpoint | 24 | 15.2 GB | 0.42s | | Checkpoint 50% | 64 | 14.8 GB | 0.53s | | Checkpoint 100% | 96 | 14.1 GB | 0.68s |
Advanced Regularization¶
Q: Что такое Stochastic Depth (DropPath)?¶
A:
Stochastic Depth -- regularization техника для очень глубоких сетей (ResNet, Transformers), которая случайно "выключает" целые residual blocks во время обучения.
Формула DropPath: $\(\text{DropPath}(x, p) = \begin{cases} \frac{x}{1-p} & \text{if kept} \\ 0 & \text{if dropped} \end{cases}\)$
где \(p\) -- вероятность drop для данного layer (linear decay от 0 до max).
def drop_path(x, drop_prob=0., training=False):
if drop_prob == 0. or not training:
return x
keep_prob = 1 - drop_prob
shape = (x.shape[0],) + (1,) * (x.ndim - 1)
random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
random_tensor.floor_()
return x.div(keep_prob) * random_tensor
class DropPath(nn.Module):
def __init__(self, drop_prob=0.):
super().__init__()
self.drop_prob = drop_prob
def forward(self, x):
return drop_path(x, self.drop_prob, self.training)
Scheduled DropPath (NASNet): $\(p_l = p_{max} \cdot \frac{l}{L}\)$
где \(l\) -- layer index, \(L\) -- total layers.
Q: Mixup vs CutMix -- в чём разница?¶
A:
| Technique | Mix Strategy | Label Strategy | Best For |
|---|---|---|---|
| Mixup | Pixel interpolation | Linear interpolation | General classification |
| CutMix | Patch replacement | Area-weighted | Object localization |
| CutOut | Patch zeroing | Original label | Occlusion robustness |
Mixup formula: $\(\tilde{x} = \lambda x_i + (1-\lambda) x_j\)$
где \(\lambda \sim \text{Beta}(\alpha, \alpha)\)
CutMix: $\(\tilde{x}_{[bbx_1:bbx_2, bby_1:bby_2]} = x_j\)$
def cutmix(data, targets, alpha=1.0):
indices = torch.randperm(data.size(0))
shuffled_data = data[indices]
shuffled_targets = targets[indices]
lam = np.random.beta(alpha, alpha)
bbx1, bby1, bbx2, bby2 = rand_bbox(data.size(), lam)
data[:, :, bbx1:bbx2, bby1:bby2] = shuffled_data[:, :, bbx1:bbx2, bby1:bby2]
lam = 1 - ((bbx2 - bbx1) * (bby2 - bby1) / (data.size(-1) * data.size(-2)))
loss = lam * criterion(output, targets) + (1-lam) * criterion(output, shuffled_targets)
return data, loss
Q: Что такое Label Smoothing?¶
A:
Label Smoothing -- regularization техника, которая "смягчает" one-hot labels, предотвращая overconfidence.
Формула: $\(y'_k = y_k(1-\epsilon) + \frac{\epsilon}{K}\)$
где \(\epsilon\) -- smoothing factor (обычно 0.1), \(K\) -- число классов.
PyTorch CrossEntropy has built-in support:
Q: Когда использовать какой тип regularization?¶
A:
| Problem | Recommended Reg | Why |
|---|---|---|
| Overfitting small dataset | Mixup/CutMix + Label Smoothing | Augment data diversity |
| Very deep network | DropPath + Layer Decay | Enable training, reduce gradient issues |
| Vision tasks | CutMix + CutOut | Spatial robustness |
| NLP tasks | Dropout + Label Smoothing | Sequence-level regularization |
| Calibration needed | Label Smoothing + Temperature scaling | Better probability estimates |
Common configs (Vision Transformers):