DL Interview: Обучение и оптимизация¶

~9 минут чтения

Навигация: Все темы DL интервью | Материалы DL | Математика для ML

Second-Order Optimization (Newton, BFGS, L-BFGS, Natural Gradient, AdaHessian), Mixed Precision Training (FP16/BF16, Loss Scaling, AMP), Gradient Debugging (Vanishing/Exploding, Clipping, Monitoring), Gradient Flow Analysis, Gradient Checkpointing, Advanced Regularization (DropPath, Mixup, CutMix, Label Smoothing).

Second-Order Optimization Methods¶

Q: Почему second-order methods сходятся быстрее?¶

A:

First-order (SGD): Использует только gradient -- направление спуска.

Second-order (Newton): Использует Hessian -- curvature информации.

Newton update: $$\theta_{t+1} = \theta_t - H^{-1} \nabla L(\theta_t)$$

Advantage: Адаптивный step size на основе curvature. - Steep curvature -> small steps - Flat curvature -> large steps

Convergence: - SGD: $O(1/k)$ или $O(1/\sqrt{k})$ - Newton: quadratic convergence -- $e_{k+1} = O(e_k^2)$, число верных цифр удваивается каждый шаг

Q: Почему не используем Newton's method в deep learning?¶

A:

Problems: 1. Hessian size: $O(D^2)$ для $D$ параметров. GPT-3 (175B params) -> $10^{22}$ entries! 2. Inverse computation: $O(D^3)$ -- infeasible 3. Saddle points: Newton направляет К saddle points, не от них 4. Non-convexity: Hessian может не быть positive definite

Solutions: - Quasi-Newton methods (BFGS, L-BFGS) - Hessian-free optimization - Natural gradient (Fisher information)

Q: BFGS vs L-BFGS -- в чём разница?¶

A:

BFGS (Broyden-Fletcher-Goldfarb-Shanno): Quasi-Newton method, approximates $H^{-1}$.

Update rule: $$\theta_{t+1} = \theta_t - \alpha_t M_t \nabla L(\theta_t)$$

где $M_t$ ~ $H^{-1}$ (iteratively refined).

L-BFGS (Limited-memory BFGS): - Не хранит полную $M_t$ (too large!) - Хранит только последние $m$ gradient differences - Memory: $O(mD)$ вместо $O(D^2)$

Method	Memory	When to use
BFGS	$O(D^2)$	Small problems (<10K params)
L-BFGS	$O(mD)$	Large problems, batch optimization
SGD/Adam	$O(D)$	Deep learning, stochastic

Q: Natural Gradient -- в чём идея?¶

A:

Problem: Standard gradient не учитывает geometry parameter space.

Natural gradient: Gradient в space of probability distributions, не parameters.

\[\tilde{\nabla} L = F^{-1} \nabla L\]

где $F$ = Fisher Information Matrix.

Intuition: Direction of steepest descent in distribution space.

Properties: - Invariant to reparameterization - Faster convergence на ill-conditioned problems - Expensive: $F^{-1}$ is $O(D^2)$

Approximations: - K-FAC (Kronecker-factored Approximate Curvature) - Adam as diagonal natural gradient approximation

Q: Conjugate Gradient Method -- когда использовать?¶

A:

Idea: Find directions conjugate to previous directions, avoid redundant exploration.

Conjugate directions: $d_i^T H d_j = 0$ for $i \neq j$

Algorithm: $$d_t = -g_t + \beta_t d_{t-1}$$

\[\theta_{t+1} = \theta_t + \alpha_t d_t\]

beta computation (Fletcher-Reeves): $$\beta_t = \frac{g_t^T g_t}{g_{t-1}^T g_{t-1}}$$

Pros: - No Hessian storage - Guaranteed convergence in <= D steps for quadratic - Works well for large-scale optimization

Cons: - Designed for convex/quadratic problems - Requires exact line search (or good approximation) - Less effective for highly non-convex (deep learning)

Q: AdaHessian -- что это?¶

A:

Idea: Adam + diagonal Hessian approximation for adaptive learning.

Hessian diagonal estimation: Use Hutchinson's method with random vectors.

\[\text{diag}(H) \approx \mathbb{E}[z \odot Hz]\]

где $z$ ~ Rademacher distribution (±1).

AdaHessian update: $$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$$

\[v_t = \beta_2 v_{t-1} + (1-\beta_2) \text{diag}(H_t)^2\]

\[\theta_{t+1} = \theta_t - \alpha \frac{m_t}{\sqrt{v_t} + \epsilon}\]

Advantage over Adam: Uses curvature information, better on ill-conditioned problems.

Mixed Precision Training (FP16/BF16)¶

Источники: BuildAI: Mixed Precision Training (2025), RunPod: FP16/BF16/FP8 Guide (2025)

Q: Что такое Mixed Precision Training?¶

A:

Idea: Use lower precision (FP16/BF16) for most operations, keep FP32 for critical parts.

Motivation: - 2x memory savings: FP16 = 2 bytes vs FP32 = 4 bytes per parameter - 2-4x speedup: Tensor cores (H100: 989 TOPS FP16 vs 67 TFLOPS FP32) - Larger batches/models: Fit more in GPU memory

Key insight: Neural networks are precision-resilient -- small numerical errors don't hurt learning.

Q: FP16 vs BF16 -- в чём разница?¶

A:

Property	FP32	FP16	BF16
Bits	32	16	16
Exponent	8	5	8
Mantissa	23	10	7
Max value	3.4e38	65504	3.4e38
Min positive	1.2e-38	6.1e-5	1.2e-38
Precision	High	Medium	Low

FP16: Higher precision, smaller range -> gradient underflow risk BF16: Same range as FP32, lower precision -> safer for LLMs

Recommendation: BF16 preferred for modern training (TPU, Ampere+ GPUs).

Q: Почему нужен Loss Scaling?¶

A:

Problem: Gradients часто очень маленькие (< 6.1e-5), underflow to zero в FP16.

Solution: Multiply loss by large constant before backward, divide gradients after:

# Conceptual
scaled_loss = loss * scale  # e.g., scale = 65536
scaled_loss.backward()
grad = grad / scale
optimizer.step()

Dynamic scaling (PyTorch AMP):

from torch.cuda.amp import GradScaler, autocast

scaler = GradScaler(init_scale=65536)

with autocast():
    output = model(x)
    loss = criterion(output, y)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()  # Auto-adjusts scale

GradScaler behavior: - If inf/nan in gradients -> skip step, reduce scale - If no overflow for growth_interval steps -> increase scale

Q: PyTorch AMP -- как использовать?¶

A:

from torch.cuda.amp import autocast, GradScaler
import torch.nn as nn

model = MyLargeModel().cuda()
optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler()

for batch, targets in dataloader:
    batch, targets = batch.cuda(), targets.cuda()

    optimizer.zero_grad()

    # Forward pass with autocast
    with autocast():  # Auto FP16 for matmul, FP32 for loss
        output = model(batch)
        loss = nn.CrossEntropyLoss()(output, targets)

    # Scaled backward
    scaler.scale(loss).backward()

    # Unscale for gradient clipping
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    # Step with scaling
    scaler.step(optimizer)
    scaler.update()

What autocast does: - FP16 ops: matmul, conv (whitelisted) - FP32 ops: loss, softmax, batchnorm (blacklisted for stability)

Q: Когда Mixed Precision работает плохо?¶

A:

Problem cases: 1. Very small gradients -> even with scaling, may underflow 2. Numerically sensitive ops -> may need FP32 3. Old GPUs -> no tensor cores, no BF16 support 4. Mixed batch norm + small batches -> unstable

Solutions: - Use BF16 if available (Ampere+, TPU) - Keep batch norm in FP32 - Use torch.set_float32_matmul_precision('high') for newer GPUs - Monitor for inf/nan in training

Debug checklist:

# Check for gradient issues
for name, param in model.named_parameters():
    if param.grad is not None:
        if torch.isnan(param.grad).any():
            print(f"NaN gradient in {name}")
        if torch.isinf(param.grad).any():
            print(f"Inf gradient in {name}")

Debugging Neural Networks: Gradient Issues¶

Источники: Neptune.ai: Vanishing/Exploding Gradients (2025)

Q: Vanishing vs Exploding Gradients -- в чём разница?¶

A:

Problem	Cause	Symptom	Solution
Vanishing	$\\|W\\| < 1$, sigmoid derivatives	Early layers don't learn, loss plateaus	ReLU, BatchNorm, skip connections
Exploding	$\\|W\\| > 1$, deep networks	Loss spikes, NaN, divergence	Gradient clipping, smaller LR

Chain rule insight: $$\frac{\partial L}{\partial W_1} = \prod_{l=2}^{L} \frac{\partial L}{\partial W_l} \cdot \frac{\partial L}{\partial W_1}$$

If each term < 1 -> gradient decays exponentially with depth.

Q: Как обнаружить gradient issues?¶

A:

1. Monitor gradient norms per layer:

def log_gradient_norms(model, step, logger):
    for name, param in model.named_parameters():
        if param.grad is not None:
            grad_norm = param.grad.norm().item()
            logger.log({f"grad_norm/{name}": grad_norm}, step=step)

2. Signs of vanishing gradients: - Early layer norms ~ 0 while later layers normal - Loss decreases slowly or plateaus - Model only learns shallow features

3. Signs of exploding gradients: - Loss becomes NaN - Gradient norms spike suddenly - Weights become very large

4. Visual check:

grad_norms = []
for name, param in model.named_parameters():
    if param.grad is not None:
        grad_norms.append((name, param.grad.norm().item()))

# Sort by norm
grad_norms.sort(key=lambda x: x[1], reverse=True)
for name, norm in grad_norms[:10]:
    print(f"{name}: {norm:.2e}")

Q: Gradient Clipping -- как и когда использовать?¶

A:

Gradient clipping ограничивает норму градиента до max_norm:

# By norm (most common)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# By value (clip each parameter)
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)

Formula: If $\|g\|_2 > c$: $$g \leftarrow g \cdot \frac{c}{\|g\|_2}$$

When to use: - RNNs/LSTMs (almost always) - Large models with potential instabilities - Training with large learning rates - When you see occasional loss spikes

Choosing max_norm: - Start with 1.0 - Monitor: if clipping happens >50% of steps -> increase - If still getting instabilities -> decrease

Q: Как stabilise training глубокой сети?¶

A:

1. Weight initialization:

# Xavier (Glorot) for tanh/sigmoid
nn.init.xavier_uniform_(layer.weight)

# He initialization for ReLU
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')

2. Normalization layers: - BatchNorm: normalizes across batch - LayerNorm: normalizes across features (transformers) - GroupNorm: middle ground

3. Activation functions: | Activation | Pros | Cons | |------------|------|------| | ReLU | Fast, no vanishing (positive) | Dying neurons | | LeakyReLU | No dying neurons | Extra hyperparameter | | GELU | Smoother, better for transformers | Slightly slower | | Swish | Self-gated, deep networks | Slower |

4. Learning rate warmup:

def get_lr(step, warmup_steps, d_model):
    return d_model ** -0.5 * min(step ** -0.5, step * warmup_steps ** -1.5)

5. Skip connections (ResNet): $$y = F(x) + x$$

Gradients flow directly through identity mapping.

Q: Production debugging checklist?¶

A:

class TrainingMonitor:
    def __init__(self, model):
        self.model = model
        self.grad_history = []

    def check_gradients(self):
        total_norm = 0.0
        nan_count = 0
        inf_count = 0

        for name, param in self.model.named_parameters():
            if param.grad is not None:
                if torch.isnan(param.grad).any():
                    print(f"[WARNING] NaN in {name}")
                    nan_count += 1
                if torch.isinf(param.grad).any():
                    print(f"[WARNING] Inf in {name}")
                    inf_count += 1

                param_norm = param.grad.norm().item()
                total_norm += param_norm ** 2
                self.grad_history.append((name, param_norm))

        total_norm = total_norm ** 0.5
        return {
            'total_norm': total_norm,
            'nan_count': nan_count,
            'inf_count': inf_count,
            'status': 'OK' if nan_count == 0 and inf_count == 0 else 'ISSUE'
        }

Alert thresholds: - Gradient norm > 100 -> potential explosion - Gradient norm < 1e-7 -> potential vanishing - Any NaN/Inf -> immediate investigation

Gradient Flow Analysis¶

Q: Gradient explosion -- как обнаружить в PyTorch?¶

A:

L2 Gradient Norm:

def compute_gradient_norm(model):
    total_norm = 0
    for p in model.parameters():
        if p.grad is not None:
            param_norm = p.grad.data.norm(2)
            total_norm += param_norm.item() ** 2
    return total_norm ** 0.5

# In training loop
loss.backward()
grad_norm = compute_gradient_norm(model)
if grad_norm > 100:
    print(f"Warning: Large gradient norm = {grad_norm}")

Признаки explosion: - Norm > 100 -- возможно начало проблем - Norm > 1000 -- определённо explosion - Rapid growth между итерациями

Q: Gradient accumulation -- зачем и как?¶

A:

Gradient accumulation позволяет эффективный batch size больше чем память GPU.

Проблема: Large batch улучшает training, но не влезает в GPU memory.

accumulation_steps = 4  # Effective batch = batch_size * 4
optimizer.zero_grad()   # Reset ONLY at start of accumulation

for i, (inputs, labels) in enumerate(dataloader):
    outputs = model(inputs)
    loss = criterion(outputs, labels) / accumulation_steps  # Scale loss!
    loss.backward()  # Accumulate gradients

    if (i + 1) % accumulation_steps == 0:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        optimizer.zero_grad()  # Reset for next accumulation

Критично: Делить loss на accumulation_steps чтобы gradient magnitude был корректным.

Q: Диагностика проблем с gradient flow -- чеклист?¶

A:

class GradientMonitor:
    def __init__(self, model, log_freq=100):
        self.model = model
        self.log_freq = log_freq
        self.step = 0

    def log_gradients(self):
        if self.step % self.log_freq != 0:
            self.step += 1
            return

        stats = {}
        for name, param in self.model.named_parameters():
            if param.grad is not None:
                grad = param.grad.data
                stats[name] = {
                    'norm': grad.norm().item(),
                    'max': grad.max().item(),
                    'min': grad.min().item(),
                    'mean': grad.mean().item(),
                    'has_nan': torch.isnan(grad).any().item(),
                    'has_inf': torch.isinf(grad).any().item(),
                }

        total_norm = sum(s['norm']**2 for s in stats.values())**0.5
        max_grad = max(s['max'] for s in stats.values())
        min_grad = min(s['min'] for s in stats.values())

        print(f"Step {self.step}:")
        print(f"  Total norm: {total_norm:.4f}")
        print(f"  Max: {max_grad:.4f}, Min: {min_grad:.4f}")

        if total_norm > 100:
            print(f"  WARNING: Large gradient norm: {total_norm}")
        if any(s['has_nan'] or s['has_inf'] for s in stats.values()):
            print(f"  CRITICAL: NaN/Inf detected!")

        self.step += 1
        return stats

Диагностическая таблица:

Symptom	Possible Cause	Solution
Norm > 100 consistently	LR too high	Reduce LR, add clipping
Norm grows over time	Learning rate schedule	Add warmup, decay
NaN/Inf	Numerical instability	FP32, gradient clipping
Norm ~ 0	Vanishing gradients	Better init, skip connections
Spikes in norm	Bad batch	Data validation, gradient clipping

Q: Как предотвратить gradient explosion архитектурно?¶

A:

1. Proper Initialization:

import torch.nn.init as init

# Xavier/Glorot for tanh
init.xavier_uniform_(layer.weight)

# He/Kaiming for ReLU
init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')

2. Skip Connections: $$\mathbf{y} = f(\mathbf{x}) + \mathbf{x}$$

Градиент течёт напрямую: $\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial f}{\partial \mathbf{x}} + \mathbf{I}$

3. Pre-LN vs Post-LN:

# Pre-LN (more stable for deep Transformers)
x = x + attention(layer_norm(x))
x = x + ffn(layer_norm(x))

# Post-LN (original, less stable)
x = layer_norm(x + attention(x))
x = layer_norm(x + ffn(x))

Gradient Checkpointing & Activation Recomputation¶

Q: Что такое Gradient Checkpointing?¶

A:

Core idea: Trade compute for memory -- don't store all activations during forward pass, recompute them during backward.

Normal training:

Forward:  A -> store, B -> store, C -> store, D -> store
Backward: Use stored activations for D, C, B, A
Memory:   All activations in memory

With checkpointing (B and C checkpointed):

Forward:  A -> store, B -> keep I/O only, C -> keep I/O only, D -> store
Backward: D backward, recompute C forward, C backward, recompute B forward, B backward, A backward
Memory:   Only checkpoints + current activations

Memory savings: - Reduces activation memory from O(n) to O(sqrt(n)) with optimal partitioning - Typical: 50-70% reduction in peak memory - Cost: 20-30% increase in training time

from torch.utils.checkpoint import checkpoint, checkpoint_sequential

# Single module checkpointing
class MyModel(nn.Module):
    def forward(self, x):
        x = self.layer1(x)
        x = checkpoint(self.layer2, x)  # Recompute layer2 during backward
        x = self.layer3(x)
        return x

# Sequential checkpointing
model = nn.Sequential(layer1, layer2, layer3, layer4, layer5)
out = checkpoint_sequential(model, 2, x)  # 2 segments

Q: Gradient Checkpointing vs Gradient Accumulation¶

A:

Technique	What it does	Trade-off	Use case
Checkpointing	Reduce activation memory	+compute time	Large model, small batch
Accumulation	Simulate large batch	+training steps	Small batch, want large effective batch

# Gradient Accumulation
accumulation_steps = 4
for i, (x, y) in enumerate(dataloader):
    loss = model(x, y) / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Q: Pitfalls с Gradient Checkpointing¶

A:

1. RNG State: - Random operations (Dropout) may produce different values on recompute - Solution: Move dropout outside checkpointed region

2. BatchNorm: - Running statistics updated twice - Solution: Don't checkpoint BatchNorm layers

3. In-place operations: - Break checkpointing contract - Solution: Avoid in-place ops in checkpointed modules

Q: Какой overhead от Gradient Checkpointing?¶

A:

Formula: Overhead ~ F_recompute / (F_total + B_total)

Typical values: - Transformers: 20-35% overhead - CNNs: 15-25% overhead

Benchmark (BERT-base, 16GB GPU): | Config | Batch size | Peak memory | Time/step | |--------|------------|-------------|-----------| | No checkpoint | 24 | 15.2 GB | 0.42s | | Checkpoint 50% | 64 | 14.8 GB | 0.53s | | Checkpoint 100% | 96 | 14.1 GB | 0.68s |

Advanced Regularization¶

Q: Что такое Stochastic Depth (DropPath)?¶

A:

Stochastic Depth -- regularization техника для очень глубоких сетей (ResNet, Transformers), которая случайно "выключает" целые residual blocks во время обучения.

Формула DropPath: $$\text{DropPath}(x, p) = \begin{cases} \frac{x}{1-p} & \text{if kept} \\ 0 & \text{if dropped} \end{cases}$$

где $p$ -- вероятность drop для данного layer (linear decay от 0 до max).

def drop_path(x, drop_prob=0., training=False):
    if drop_prob == 0. or not training:
        return x
    keep_prob = 1 - drop_prob
    shape = (x.shape[0],) + (1,) * (x.ndim - 1)
    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
    random_tensor.floor_()
    return x.div(keep_prob) * random_tensor

class DropPath(nn.Module):
    def __init__(self, drop_prob=0.):
        super().__init__()
        self.drop_prob = drop_prob

    def forward(self, x):
        return drop_path(x, self.drop_prob, self.training)

Scheduled DropPath (NASNet): $$p_l = p_{max} \cdot \frac{l}{L}$$

где $l$ -- layer index, $L$ -- total layers.

Q: Mixup vs CutMix -- в чём разница?¶

A:

Technique	Mix Strategy	Label Strategy	Best For
Mixup	Pixel interpolation	Linear interpolation	General classification
CutMix	Patch replacement	Area-weighted	Object localization
CutOut	Patch zeroing	Original label	Occlusion robustness

Mixup formula: $$\tilde{x} = \lambda x_i + (1-\lambda) x_j$$

\[\tilde{y} = \lambda y_i + (1-\lambda) y_j\]

где $\lambda \sim \text{Beta}(\alpha, \alpha)$

CutMix: $$\tilde{x}_{[bbx_1:bbx_2, bby_1:bby_2]} = x_j$$

\[\tilde{y} = \lambda y_i + (1-\lambda) y_j\]

def cutmix(data, targets, alpha=1.0):
    indices = torch.randperm(data.size(0))
    shuffled_data = data[indices]
    shuffled_targets = targets[indices]

    lam = np.random.beta(alpha, alpha)
    bbx1, bby1, bbx2, bby2 = rand_bbox(data.size(), lam)
    data[:, :, bbx1:bbx2, bby1:bby2] = shuffled_data[:, :, bbx1:bbx2, bby1:bby2]

    lam = 1 - ((bbx2 - bbx1) * (bby2 - bby1) / (data.size(-1) * data.size(-2)))
    loss = lam * criterion(output, targets) + (1-lam) * criterion(output, shuffled_targets)
    return data, loss

Q: Что такое Label Smoothing?¶

A:

Label Smoothing -- regularization техника, которая "смягчает" one-hot labels, предотвращая overconfidence.

Формула: $$y'_k = y_k(1-\epsilon) + \frac{\epsilon}{K}$$

где $\epsilon$ -- smoothing factor (обычно 0.1), $K$ -- число классов.

One-hot:     [0, 0, 1, 0, 0]  (hard)
Smoothed:    [0.02, 0.02, 0.92, 0.02, 0.02]  (for epsilon=0.1, K=5)

PyTorch CrossEntropy has built-in support:

criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

Q: Когда использовать какой тип regularization?¶

A:

Problem	Recommended Reg	Why
Overfitting small dataset	Mixup/CutMix + Label Smoothing	Augment data diversity
Very deep network	DropPath + Layer Decay	Enable training, reduce gradient issues
Vision tasks	CutMix + CutOut	Spatial robustness
NLP tasks	Dropout + Label Smoothing	Sequence-level regularization
Calibration needed	Label Smoothing + Temperature scaling	Better probability estimates

Common configs (Vision Transformers):

# ViT-B/16 typical regularization
drop_path_rate = 0.1      # Stochastic depth
drop_rate = 0.0           # Dropout (usually 0 for ViT)
label_smoothing = 0.1     # Label smoothing
mixup_alpha = 0.8         # Mixup
cutmix_alpha = 1.0        # CutMix

Method	Memory	When to use
BFGS	\(O(D^2)\)	Small problems (<10K params)
L-BFGS	\(O(mD)\)	Large problems, batch optimization
SGD/Adam	\(O(D)\)	Deep learning, stochastic

Problem	Cause	Symptom	Solution
Vanishing	\(\\|W\\| < 1\), sigmoid derivatives	Early layers don't learn, loss plateaus	ReLU, BatchNorm, skip connections
Exploding	\(\\|W\\| > 1\), deep networks	Loss spikes, NaN, divergence	Gradient clipping, smaller LR