Перейти к содержанию

DL Interview: Основы

~6 минут чтения

Навигация: Все темы DL интервью | Материалы DL | Математика для ML

Bayesian Neural Networks, Loss Functions, Backpropagation, Optimizers, Weight Initialization, Normalization. По статистике ML-интервью 2025 (Blind, levels.fyi), backpropagation, optimizers и normalization входят в top-5 DL-тем.


Bayesian Neural Networks & Uncertainty

Q: Зачем нужны Bayesian Neural Networks?

A:

Проблема point estimates: - Обычные NN дают point prediction без uncertainty - Overconfident predictions на out-of-distribution data - Не подходит для safety-critical систем (medical, autonomous driving)

BNN solution: - Weights -- distributions, не fixed values - Prediction = average over weight posterior - Natural uncertainty quantification

\[P(y|x,D) = \int P(y|x,w) P(w|D) dw\]

Q: Epistemic vs Aleatoric Uncertainty

A:

Type Source Reducible? Example
Epistemic Model ignorance Yes (more data) New region of input space
Aleatoric Data noise No Measurement noise

In practice: - Epistemic: Different models give different predictions (ensemble variance) - Aleatoric: All models agree but prediction uncertain (inherent noise)

Code pattern:

# Epistemic: MC Dropout / Deep Ensemble
predictions = [model(x) for _ in range(n_samples)]
epistemic = np.var(predictions)  # Model uncertainty

# Aleatoric: Predict variance
mean, var = model(x)  # Model outputs mean and variance
aleatoric = var.mean()  # Data uncertainty

Q: Variational Inference в BNN -- как работает?

A:

Goal: Approximate posterior \(P(w|D)\) с tractable distribution \(q_\theta(w)\).

ELBO (Evidence Lower Bound): $\(\mathcal{L} = \mathbb{E}_{q(w)}[\log P(D|w)] - D_{KL}(q(w) \| P(w))\)$

Trade-off: - First term: fit data (likelihood) - Second term: stay close to prior (regularization)

Practical implementations: 1. Mean-field VI: Factorized Gaussian \(q(w) = \prod_i N(\mu_i, \sigma_i^2)\) 2. Bayes by Backprop: Reparameterization trick для gradients 3. MC Dropout: Dropout at inference = variational approximation

Q: MC Dropout для uncertainty estimation

A:

Key insight (Gal & Ghahramani): Dropout at inference ~ approximate variational inference.

Procedure:

def predict_with_uncertainty(model, x, n_samples=100):
    model.train()  # Enable dropout at inference!
    predictions = []
    for _ in range(n_samples):
        pred = model(x)
        predictions.append(pred)
    predictions = torch.stack(predictions)
    mean = predictions.mean(dim=0)
    std = predictions.std(dim=0)  # Uncertainty
    return mean, std

Why model.train(): Need dropout active during inference.

Pros: Easy to implement, works with existing models Cons: Approximate, may underestimate uncertainty

Q: Deep Ensembles для uncertainty

A:

Method: Train \(M\) independent models with different random seeds.

\[\text{Uncertainty} = \frac{1}{M}\sum_{m=1}^{M} (f_m(x) - \bar{f}(x))^2\]

vs MC Dropout:

Aspect Deep Ensembles MC Dropout
Training cost M x full training 1x full training
Inference cost M x forward pass N x forward pass
Uncertainty quality Better Good
Implementation Easy Easiest

Best practice: Combine both -- ensemble of MC dropout models.

Q: Calibration -- predicted probability vs actual accuracy

A:

Calibrated model: Если предсказывает 80% confidence, accuracy на этих примерах ~ 80%.

Expected Calibration Error (ECE): $\(ECE = \sum_{m=1}^{M} \frac{|B_m|}{n} |\text{acc}(B_m) - \text{conf}(B_m)|\)$

BNN advantage: Естественно better calibrated из-за uncertainty.

Post-hoc calibration: - Temperature scaling: \(p' = \text{softmax}(z/T)\) - Optimize T on validation set


Loss Functions

Q: Почему cross-entropy вместо MSE для классификации?

A:

MSE + sigmoid problem: $\(\frac{\partial L_{MSE}}{\partial z} = (\sigma(z)-y) \cdot \sigma(z)(1-\sigma(z))\)$

При насыщении sigmoid (\(|z| \gg 0\)): \(\sigma(z)(1-\sigma(z)) \to 0\) -- vanishing gradient!

Cross-entropy solution: $\(\frac{\partial L_{CE}}{\partial z} = \sigma(z) - y\)$

Градиент пропорционален ошибке, не затухает.

Q: Когда использовать Focal Loss?

A: Focal Loss для imbalanced classification.

\[L_{focal} = -\alpha_t (1-p_t)^\gamma \log(p_t)\]

Effect: - Easy examples (\(p_t \to 1\)): loss близок к 0 - Hard examples (\(p_t \to 0\)): full loss

Typical: \(\gamma = 2\), \(\alpha\) = class weight

Q: Contrastive Loss vs Triplet Loss

A:

Contrastive Triplet
Pairs (a, b) Triples (a, p, n)
One margin Relative margin
\(L = y \cdot d^2 + (1-y)(\max(0, m-d))^2\) \(L = \max(0, d(a,p) - d(a,n) + m)\)

When Triplet > Contrastive: Need relative similarity (face recognition).

Заблуждение: Label Smoothing всегда улучшает модель

Label smoothing (\(y_{smooth} = (1-\epsilon)y + \epsilon/K\)) помогает калибровке, но ухудшает knowledge distillation -- teacher выдает "мягкие" вероятности, а student получает ещё более размытый сигнал. Также при \(\epsilon > 0.2\) модель начинает недообучаться. Оптимальное значение обычно \(\epsilon = 0.1\), и не для всех задач оно полезно (например, не для regression).

Заблуждение: Vanishing gradient решается только ReLU

ReLU решает saturation problem (\(\text{ReLU}' = 1\) для \(x > 0\)), но создает dying neurons -- нейрон с \(x < 0\) навсегда выдает 0 и не обновляется. В сетях с 20+ слоями до 40% нейронов могут "умереть". Residual connections (\(y = F(x) + x\)) -- более надежное решение: gradient через skip connection = 1 regardless of \(F\).


Backpropagation

Q: Объясните backprop на примере \(f(x,y,z) = (x+y)z\)

A:

Forward:

x=3, y=-4, z=2
q = x + y = -1
f = q * z = -2

Backward (chain rule): $\(\frac{\partial f}{\partial z} = q = -1\)$

\[\frac{\partial f}{\partial q} = z = 2\]
\[\frac{\partial q}{\partial x} = 1, \quad \frac{\partial q}{\partial y} = 1\]

By chain rule: $\(\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \cdot \frac{\partial q}{\partial x} = 2 \cdot 1 = 2\)$

\[\frac{\partial f}{\partial y} = 2 \cdot 1 = 2\]

Q: Почему backprop работает через computational graph?

A: Любую функцию можно представить как composition простых операций.

Chain rule guarantee: Для \(f(g(h(x)))\): $\(\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dh} \cdot \frac{dh}{dx}\)$

Algorithm: 1. Forward pass: compute all intermediate values 2. Backward pass: apply chain rule in reverse topological order

Efficiency: Each node computes only local gradient, chain rule handles composition.

Q: Vanishing gradients в deep networks -- решение?

A:

Causes: - Sigmoid/tanh saturation: \(\sigma'(x) \to 0\) - Many layers: product of small gradients

Solutions: 1. ReLU: \(\text{ReLU}'(x) = 1\) for \(x > 0\) 2. Residual connections: \(y = F(x) + x\) -- gradient = 1 3. BatchNorm: Keeps activations in non-saturated regime 4. Proper initialization: Xavier/He


Optimizers

Q: Adam vs SGD -- когда что?

A:

Adam SGD + Momentum
Adaptive LR per param Same LR for all
Fast convergence Better generalization
Minimal tuning More tuning needed
Default choice State-of-the-art results

Practical: - Start with Adam (quick experiments) - Use SGD + momentum for final training (research) - AdamW (weight decay fix) > Adam

Q: Почему Adam может не сходиться?

A:

Issues: 1. Non-uniform learning rates across parameters 2. Missing weight decay (fixed in AdamW) 3. \(\beta_2 = 0.999\) too high -- slow \(v_t\) adaptation

Solutions: - AdamW with proper weight decay - AMSGrad (bounded \(v_t\)) - Lower \(\beta_2\) (e.g., 0.98)

Q: Learning rate warmup -- зачем?

A:

Problem: Early training -- random weights -- large gradients -- Adam's \(v_t\) explodes -- tiny effective LR.

Solution: Start with small LR, gradually increase.

# Linear warmup
if step < warmup_steps:
    lr = base_lr * step / warmup_steps

Standard for Transformers: First 1-2% of training.


Weight Initialization

Q: Почему нельзя инициализировать веса нулями?

A:

Symmetry problem: 1. All neurons in layer compute same output 2. All receive same gradient 3. All update identically 4. Network = single neuron

Solution: Random initialization breaks symmetry.

Q: Xavier vs He initialization

A:

Xavier He
\(\text{Var}(w) = \frac{2}{n_{in} + n_{out}}\) \(\text{Var}(w) = \frac{2}{n_{in}}\)
tanh, sigmoid ReLU, variants
Preserves variance through layers Accounts for ReLU killing half

Why different? ReLU zeros half the activations -- need 2x variance.


Normalization

Q: BatchNorm во время train vs inference

A:

Training: Normalize using batch statistics: $\(\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}\)$

Inference: Batch may not exist or be size 1. Use running statistics: $\(\mu_{running} = \alpha \cdot \mu_{running} + (1-\alpha) \cdot \mu_B\)$

In PyTorch: model.eval() switches to running stats.

Q: LayerNorm vs BatchNorm для Transformers

A:

BatchNorm problems: - Depends on batch size (small batches = bad stats) - Training/inference discrepancy - Not suited for variable-length sequences

LayerNorm advantages: - Normalizes over features, not batch - Same computation train/inference - Works for any sequence length

2025 Standard: RMSNorm (simpler, faster, same quality).

Q: Почему RMSNorm работает для LLMs?

A:

LayerNorm: \(\text{LN}(x) = \frac{x - \mu}{\sigma} \cdot \gamma + \beta\)

RMSNorm: \(\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot \gamma\), где \(\text{RMS}(x) = \sqrt{\frac{1}{d}\sum x_i^2}\)

Why it works: - Mean centering not critical for Transformers - Simpler computation (no \(\mu\)) - Same expressiveness (just \(\beta\) absorbed into attention)