DL Interview: Основы¶
~6 минут чтения
Навигация: Все темы DL интервью | Материалы DL | Математика для ML
Bayesian Neural Networks, Loss Functions, Backpropagation, Optimizers, Weight Initialization, Normalization. По статистике ML-интервью 2025 (Blind, levels.fyi), backpropagation, optimizers и normalization входят в top-5 DL-тем.
Bayesian Neural Networks & Uncertainty¶
Q: Зачем нужны Bayesian Neural Networks?¶
A:
Проблема point estimates: - Обычные NN дают point prediction без uncertainty - Overconfident predictions на out-of-distribution data - Не подходит для safety-critical систем (medical, autonomous driving)
BNN solution: - Weights -- distributions, не fixed values - Prediction = average over weight posterior - Natural uncertainty quantification
Q: Epistemic vs Aleatoric Uncertainty¶
A:
| Type | Source | Reducible? | Example |
|---|---|---|---|
| Epistemic | Model ignorance | Yes (more data) | New region of input space |
| Aleatoric | Data noise | No | Measurement noise |
In practice: - Epistemic: Different models give different predictions (ensemble variance) - Aleatoric: All models agree but prediction uncertain (inherent noise)
Code pattern:
# Epistemic: MC Dropout / Deep Ensemble
predictions = [model(x) for _ in range(n_samples)]
epistemic = np.var(predictions) # Model uncertainty
# Aleatoric: Predict variance
mean, var = model(x) # Model outputs mean and variance
aleatoric = var.mean() # Data uncertainty
Q: Variational Inference в BNN -- как работает?¶
A:
Goal: Approximate posterior \(P(w|D)\) с tractable distribution \(q_\theta(w)\).
ELBO (Evidence Lower Bound): $\(\mathcal{L} = \mathbb{E}_{q(w)}[\log P(D|w)] - D_{KL}(q(w) \| P(w))\)$
Trade-off: - First term: fit data (likelihood) - Second term: stay close to prior (regularization)
Practical implementations: 1. Mean-field VI: Factorized Gaussian \(q(w) = \prod_i N(\mu_i, \sigma_i^2)\) 2. Bayes by Backprop: Reparameterization trick для gradients 3. MC Dropout: Dropout at inference = variational approximation
Q: MC Dropout для uncertainty estimation¶
A:
Key insight (Gal & Ghahramani): Dropout at inference ~ approximate variational inference.
Procedure:
def predict_with_uncertainty(model, x, n_samples=100):
model.train() # Enable dropout at inference!
predictions = []
for _ in range(n_samples):
pred = model(x)
predictions.append(pred)
predictions = torch.stack(predictions)
mean = predictions.mean(dim=0)
std = predictions.std(dim=0) # Uncertainty
return mean, std
Why model.train(): Need dropout active during inference.
Pros: Easy to implement, works with existing models Cons: Approximate, may underestimate uncertainty
Q: Deep Ensembles для uncertainty¶
A:
Method: Train \(M\) independent models with different random seeds.
vs MC Dropout:
| Aspect | Deep Ensembles | MC Dropout |
|---|---|---|
| Training cost | M x full training | 1x full training |
| Inference cost | M x forward pass | N x forward pass |
| Uncertainty quality | Better | Good |
| Implementation | Easy | Easiest |
Best practice: Combine both -- ensemble of MC dropout models.
Q: Calibration -- predicted probability vs actual accuracy¶
A:
Calibrated model: Если предсказывает 80% confidence, accuracy на этих примерах ~ 80%.
Expected Calibration Error (ECE): $\(ECE = \sum_{m=1}^{M} \frac{|B_m|}{n} |\text{acc}(B_m) - \text{conf}(B_m)|\)$
BNN advantage: Естественно better calibrated из-за uncertainty.
Post-hoc calibration: - Temperature scaling: \(p' = \text{softmax}(z/T)\) - Optimize T on validation set
Loss Functions¶
Q: Почему cross-entropy вместо MSE для классификации?¶
A:
MSE + sigmoid problem: $\(\frac{\partial L_{MSE}}{\partial z} = (\sigma(z)-y) \cdot \sigma(z)(1-\sigma(z))\)$
При насыщении sigmoid (\(|z| \gg 0\)): \(\sigma(z)(1-\sigma(z)) \to 0\) -- vanishing gradient!
Cross-entropy solution: $\(\frac{\partial L_{CE}}{\partial z} = \sigma(z) - y\)$
Градиент пропорционален ошибке, не затухает.
Q: Когда использовать Focal Loss?¶
A: Focal Loss для imbalanced classification.
Effect: - Easy examples (\(p_t \to 1\)): loss близок к 0 - Hard examples (\(p_t \to 0\)): full loss
Typical: \(\gamma = 2\), \(\alpha\) = class weight
Q: Contrastive Loss vs Triplet Loss¶
A:
| Contrastive | Triplet |
|---|---|
| Pairs (a, b) | Triples (a, p, n) |
| One margin | Relative margin |
| \(L = y \cdot d^2 + (1-y)(\max(0, m-d))^2\) | \(L = \max(0, d(a,p) - d(a,n) + m)\) |
When Triplet > Contrastive: Need relative similarity (face recognition).
Заблуждение: Label Smoothing всегда улучшает модель
Label smoothing (\(y_{smooth} = (1-\epsilon)y + \epsilon/K\)) помогает калибровке, но ухудшает knowledge distillation -- teacher выдает "мягкие" вероятности, а student получает ещё более размытый сигнал. Также при \(\epsilon > 0.2\) модель начинает недообучаться. Оптимальное значение обычно \(\epsilon = 0.1\), и не для всех задач оно полезно (например, не для regression).
Заблуждение: Vanishing gradient решается только ReLU
ReLU решает saturation problem (\(\text{ReLU}' = 1\) для \(x > 0\)), но создает dying neurons -- нейрон с \(x < 0\) навсегда выдает 0 и не обновляется. В сетях с 20+ слоями до 40% нейронов могут "умереть". Residual connections (\(y = F(x) + x\)) -- более надежное решение: gradient через skip connection = 1 regardless of \(F\).
Backpropagation¶
Q: Объясните backprop на примере \(f(x,y,z) = (x+y)z\)¶
A:
Forward:
Backward (chain rule): $\(\frac{\partial f}{\partial z} = q = -1\)$
By chain rule: $\(\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \cdot \frac{\partial q}{\partial x} = 2 \cdot 1 = 2\)$
Q: Почему backprop работает через computational graph?¶
A: Любую функцию можно представить как composition простых операций.
Chain rule guarantee: Для \(f(g(h(x)))\): $\(\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dh} \cdot \frac{dh}{dx}\)$
Algorithm: 1. Forward pass: compute all intermediate values 2. Backward pass: apply chain rule in reverse topological order
Efficiency: Each node computes only local gradient, chain rule handles composition.
Q: Vanishing gradients в deep networks -- решение?¶
A:
Causes: - Sigmoid/tanh saturation: \(\sigma'(x) \to 0\) - Many layers: product of small gradients
Solutions: 1. ReLU: \(\text{ReLU}'(x) = 1\) for \(x > 0\) 2. Residual connections: \(y = F(x) + x\) -- gradient = 1 3. BatchNorm: Keeps activations in non-saturated regime 4. Proper initialization: Xavier/He
Optimizers¶
Q: Adam vs SGD -- когда что?¶
A:
| Adam | SGD + Momentum |
|---|---|
| Adaptive LR per param | Same LR for all |
| Fast convergence | Better generalization |
| Minimal tuning | More tuning needed |
| Default choice | State-of-the-art results |
Practical: - Start with Adam (quick experiments) - Use SGD + momentum for final training (research) - AdamW (weight decay fix) > Adam
Q: Почему Adam может не сходиться?¶
A:
Issues: 1. Non-uniform learning rates across parameters 2. Missing weight decay (fixed in AdamW) 3. \(\beta_2 = 0.999\) too high -- slow \(v_t\) adaptation
Solutions: - AdamW with proper weight decay - AMSGrad (bounded \(v_t\)) - Lower \(\beta_2\) (e.g., 0.98)
Q: Learning rate warmup -- зачем?¶
A:
Problem: Early training -- random weights -- large gradients -- Adam's \(v_t\) explodes -- tiny effective LR.
Solution: Start with small LR, gradually increase.
Standard for Transformers: First 1-2% of training.
Weight Initialization¶
Q: Почему нельзя инициализировать веса нулями?¶
A:
Symmetry problem: 1. All neurons in layer compute same output 2. All receive same gradient 3. All update identically 4. Network = single neuron
Solution: Random initialization breaks symmetry.
Q: Xavier vs He initialization¶
A:
| Xavier | He |
|---|---|
| \(\text{Var}(w) = \frac{2}{n_{in} + n_{out}}\) | \(\text{Var}(w) = \frac{2}{n_{in}}\) |
| tanh, sigmoid | ReLU, variants |
| Preserves variance through layers | Accounts for ReLU killing half |
Why different? ReLU zeros half the activations -- need 2x variance.
Normalization¶
Q: BatchNorm во время train vs inference¶
A:
Training: Normalize using batch statistics: $\(\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}\)$
Inference: Batch may not exist or be size 1. Use running statistics: $\(\mu_{running} = \alpha \cdot \mu_{running} + (1-\alpha) \cdot \mu_B\)$
In PyTorch: model.eval() switches to running stats.
Q: LayerNorm vs BatchNorm для Transformers¶
A:
BatchNorm problems: - Depends on batch size (small batches = bad stats) - Training/inference discrepancy - Not suited for variable-length sequences
LayerNorm advantages: - Normalizes over features, not batch - Same computation train/inference - Works for any sequence length
2025 Standard: RMSNorm (simpler, faster, same quality).
Q: Почему RMSNorm работает для LLMs?¶
A:
LayerNorm: \(\text{LN}(x) = \frac{x - \mu}{\sigma} \cdot \gamma + \beta\)
RMSNorm: \(\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot \gamma\), где \(\text{RMS}(x) = \sqrt{\frac{1}{d}\sum x_i^2}\)
Why it works: - Mean centering not critical for Transformers - Simpler computation (no \(\mu\)) - Same expressiveness (just \(\beta\) absorbed into attention)