DL Interview: Основы¶

~6 минут чтения

Навигация: Все темы DL интервью | Материалы DL | Математика для ML

Bayesian Neural Networks, Loss Functions, Backpropagation, Optimizers, Weight Initialization, Normalization. По статистике ML-интервью 2025 (Blind, levels.fyi), backpropagation, optimizers и normalization входят в top-5 DL-тем.

Bayesian Neural Networks & Uncertainty¶

Q: Зачем нужны Bayesian Neural Networks?¶

A:

Проблема point estimates: - Обычные NN дают point prediction без uncertainty - Overconfident predictions на out-of-distribution data - Не подходит для safety-critical систем (medical, autonomous driving)

BNN solution: - Weights -- distributions, не fixed values - Prediction = average over weight posterior - Natural uncertainty quantification

\[P(y|x,D) = \int P(y|x,w) P(w|D) dw\]

Q: Epistemic vs Aleatoric Uncertainty¶

A:

Type	Source	Reducible?	Example
Epistemic	Model ignorance	Yes (more data)	New region of input space
Aleatoric	Data noise	No	Measurement noise

In practice: - Epistemic: Different models give different predictions (ensemble variance) - Aleatoric: All models agree but prediction uncertain (inherent noise)

Code pattern:

# Epistemic: MC Dropout / Deep Ensemble
predictions = [model(x) for _ in range(n_samples)]
epistemic = np.var(predictions)  # Model uncertainty

# Aleatoric: Predict variance
mean, var = model(x)  # Model outputs mean and variance
aleatoric = var.mean()  # Data uncertainty

Q: Variational Inference в BNN -- как работает?¶

A:

Goal: Approximate posterior $P(w|D)$ с tractable distribution $q_\theta(w)$.

ELBO (Evidence Lower Bound): $$\mathcal{L} = \mathbb{E}_{q(w)}[\log P(D|w)] - D_{KL}(q(w) \| P(w))$$

Trade-off: - First term: fit data (likelihood) - Second term: stay close to prior (regularization)

Practical implementations: 1. Mean-field VI: Factorized Gaussian $q(w) = \prod_i N(\mu_i, \sigma_i^2)$ 2. Bayes by Backprop: Reparameterization trick для gradients 3. MC Dropout: Dropout at inference = variational approximation

Q: MC Dropout для uncertainty estimation¶

A:

Key insight (Gal & Ghahramani): Dropout at inference ~ approximate variational inference.

Procedure:

def predict_with_uncertainty(model, x, n_samples=100):
    model.train()  # Enable dropout at inference!
    predictions = []
    for _ in range(n_samples):
        pred = model(x)
        predictions.append(pred)
    predictions = torch.stack(predictions)
    mean = predictions.mean(dim=0)
    std = predictions.std(dim=0)  # Uncertainty
    return mean, std

Why model.train(): Need dropout active during inference.

Pros: Easy to implement, works with existing models Cons: Approximate, may underestimate uncertainty

Q: Deep Ensembles для uncertainty¶

A:

Method: Train $M$ independent models with different random seeds.

\[\text{Uncertainty} = \frac{1}{M}\sum_{m=1}^{M} (f_m(x) - \bar{f}(x))^2\]

vs MC Dropout:

Aspect	Deep Ensembles	MC Dropout
Training cost	M x full training	1x full training
Inference cost	M x forward pass	N x forward pass
Uncertainty quality	Better	Good
Implementation	Easy	Easiest

Best practice: Combine both -- ensemble of MC dropout models.

Q: Calibration -- predicted probability vs actual accuracy¶

A:

Calibrated model: Если предсказывает 80% confidence, accuracy на этих примерах ~ 80%.

Expected Calibration Error (ECE): $$ECE = \sum_{m=1}^{M} \frac{|B_m|}{n} |\text{acc}(B_m) - \text{conf}(B_m)|$$

BNN advantage: Естественно better calibrated из-за uncertainty.

Post-hoc calibration: - Temperature scaling: $p' = \text{softmax}(z/T)$ - Optimize T on validation set

Loss Functions¶

Q: Почему cross-entropy вместо MSE для классификации?¶

A:

MSE + sigmoid problem: $$\frac{\partial L_{MSE}}{\partial z} = (\sigma(z)-y) \cdot \sigma(z)(1-\sigma(z))$$

При насыщении sigmoid ($|z| \gg 0$): $\sigma(z)(1-\sigma(z)) \to 0$ -- vanishing gradient!

Cross-entropy solution: $$\frac{\partial L_{CE}}{\partial z} = \sigma(z) - y$$

Градиент пропорционален ошибке, не затухает.

Q: Когда использовать Focal Loss?¶

A: Focal Loss для imbalanced classification.

\[L_{focal} = -\alpha_t (1-p_t)^\gamma \log(p_t)\]

Effect: - Easy examples ($p_t \to 1$): loss близок к 0 - Hard examples ($p_t \to 0$): full loss

Typical: $\gamma = 2$, $\alpha$ = class weight

Q: Contrastive Loss vs Triplet Loss¶

A:

Contrastive	Triplet
Pairs (a, b)	Triples (a, p, n)
One margin	Relative margin
$L = y \cdot d^2 + (1-y)(\max(0, m-d))^2$	$L = \max(0, d(a,p) - d(a,n) + m)$

When Triplet > Contrastive: Need relative similarity (face recognition).

Заблуждение: Label Smoothing всегда улучшает модель

Label smoothing ($y_{smooth} = (1-\epsilon)y + \epsilon/K$) помогает калибровке, но ухудшает knowledge distillation -- teacher выдает "мягкие" вероятности, а student получает ещё более размытый сигнал. Также при $\epsilon > 0.2$ модель начинает недообучаться. Оптимальное значение обычно $\epsilon = 0.1$, и не для всех задач оно полезно (например, не для regression).

Заблуждение: Vanishing gradient решается только ReLU

ReLU решает saturation problem ($\text{ReLU}' = 1$ для $x > 0$), но создает dying neurons -- нейрон с $x < 0$ навсегда выдает 0 и не обновляется. В сетях с 20+ слоями до 40% нейронов могут "умереть". Residual connections ($y = F(x) + x$) -- более надежное решение: gradient через skip connection = 1 regardless of $F$.

Backpropagation¶

Q: Объясните backprop на примере $f(x,y,z) = (x+y)z$¶

A:

Forward:

x=3, y=-4, z=2
q = x + y = -1
f = q * z = -2

Backward (chain rule): $$\frac{\partial f}{\partial z} = q = -1$$

\[\frac{\partial f}{\partial q} = z = 2\]

\[\frac{\partial q}{\partial x} = 1, \quad \frac{\partial q}{\partial y} = 1\]

By chain rule: $$\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \cdot \frac{\partial q}{\partial x} = 2 \cdot 1 = 2$$

\[\frac{\partial f}{\partial y} = 2 \cdot 1 = 2\]

Q: Почему backprop работает через computational graph?¶

A: Любую функцию можно представить как composition простых операций.

Chain rule guarantee: Для $f(g(h(x)))$: $$\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dh} \cdot \frac{dh}{dx}$$

Algorithm: 1. Forward pass: compute all intermediate values 2. Backward pass: apply chain rule in reverse topological order

Efficiency: Each node computes only local gradient, chain rule handles composition.

Q: Vanishing gradients в deep networks -- решение?¶

A:

Causes: - Sigmoid/tanh saturation: $\sigma'(x) \to 0$ - Many layers: product of small gradients

Solutions: 1. ReLU: $\text{ReLU}'(x) = 1$ for $x > 0$ 2. Residual connections: $y = F(x) + x$ -- gradient = 1 3. BatchNorm: Keeps activations in non-saturated regime 4. Proper initialization: Xavier/He

Optimizers¶

Q: Adam vs SGD -- когда что?¶

A:

Adam	SGD + Momentum
Adaptive LR per param	Same LR for all
Fast convergence	Better generalization
Minimal tuning	More tuning needed
Default choice	State-of-the-art results

Practical: - Start with Adam (quick experiments) - Use SGD + momentum for final training (research) - AdamW (weight decay fix) > Adam

Q: Почему Adam может не сходиться?¶

A:

Issues: 1. Non-uniform learning rates across parameters 2. Missing weight decay (fixed in AdamW) 3. $\beta_2 = 0.999$ too high -- slow $v_t$ adaptation

Solutions: - AdamW with proper weight decay - AMSGrad (bounded $v_t$) - Lower $\beta_2$ (e.g., 0.98)

Q: Learning rate warmup -- зачем?¶

A:

Problem: Early training -- random weights -- large gradients -- Adam's $v_t$ explodes -- tiny effective LR.

Solution: Start with small LR, gradually increase.

# Linear warmup
if step < warmup_steps:
    lr = base_lr * step / warmup_steps

Standard for Transformers: First 1-2% of training.

Weight Initialization¶

Q: Почему нельзя инициализировать веса нулями?¶

A:

Symmetry problem: 1. All neurons in layer compute same output 2. All receive same gradient 3. All update identically 4. Network = single neuron

Solution: Random initialization breaks symmetry.

Q: Xavier vs He initialization¶

A:

Xavier	He
$\text{Var}(w) = \frac{2}{n_{in} + n_{out}}$	$\text{Var}(w) = \frac{2}{n_{in}}$
tanh, sigmoid	ReLU, variants
Preserves variance through layers	Accounts for ReLU killing half

Why different? ReLU zeros half the activations -- need 2x variance.

Normalization¶

Q: BatchNorm во время train vs inference¶

A:

Training: Normalize using batch statistics: $$\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$

Inference: Batch may not exist or be size 1. Use running statistics: $$\mu_{running} = \alpha \cdot \mu_{running} + (1-\alpha) \cdot \mu_B$$

In PyTorch: model.eval() switches to running stats.

Q: LayerNorm vs BatchNorm для Transformers¶

A:

BatchNorm problems: - Depends on batch size (small batches = bad stats) - Training/inference discrepancy - Not suited for variable-length sequences

LayerNorm advantages: - Normalizes over features, not batch - Same computation train/inference - Works for any sequence length

2025 Standard: RMSNorm (simpler, faster, same quality).

Q: Почему RMSNorm работает для LLMs?¶

A:

LayerNorm: $\text{LN}(x) = \frac{x - \mu}{\sigma} \cdot \gamma + \beta$

RMSNorm: $\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot \gamma$, где $\text{RMS}(x) = \sqrt{\frac{1}{d}\sum x_i^2}$

Why it works: - Mean centering not critical for Transformers - Simpler computation (no $\mu$) - Same expressiveness (just $\beta$ absorbed into attention)

Xavier	He
\(\text{Var}(w) = \frac{2}{n_{in} + n_{out}}\)	\(\text{Var}(w) = \frac{2}{n_{in}}\)
tanh, sigmoid	ReLU, variants
Preserves variance through layers	Accounts for ReLU killing half