DL Interview: Специальные темы¶
~11 минут чтения
Навигация: Все темы DL интервью | Материалы DL | Математика для ML
Time Series DL (DeepAR, TFT), Uncertainty Quantification (MC Dropout, Deep Ensembles, Conformal Prediction, OOD Detection), t-SNE & UMAP, Speculative Decoding, Mamba & State Space Models (SSM), Sequence Modeling Beyond RNN (TCN, WaveNet).
Time Series Deep Learning¶
Источники: TFT Tutorial (Towards Data Science), Google TFT Paper
Q: DeepAR vs TFT vs Prophet -- сравнение¶
A:
| Model | Type | Key Features | When to Use |
|---|---|---|---|
| Prophet | Additive regression | Trend, seasonality, holidays | Baseline, simple patterns |
| DeepAR | Autoregressive RNN | Probabilistic, learns across series | Many similar series, cold start |
| TFT | Transformer + attention | Interpretable, multi-horizon | Complex patterns, need interpretability |
DeepAR (Amazon): Autoregressive RNN with likelihood output, learns global model from multiple series.
TFT (Google): Variable Selection Network + Multi-head attention, supports static + time-varying + known future features, interpretable attention.
Q: Что такое Temporal Fusion Transformer (TFT)?¶
A:
Architecture: 1. Variable Selection Network: Learns which features matter 2. LSTM Encoder-Decoder: Local temporal patterns 3. Multi-head Attention: Long-term dependencies 4. Gated Residual Network: Non-linear processing
Feature types: - Time-varying known: Future known (holidays, promotions) - Time-varying unknown: Past only (sales, demand) - Static real/categorical: Constant per series (product ID)
Q: Interpretable Attention в TFT -- как работает?¶
A:
Three types of interpretability: 1. Seasonality-wise: Attention weights show which past timesteps matter 2. Feature-wise: Variable Selection Network outputs importance per feature 3. Extreme events: Analyze behavior on rare value ranges
Uncertainty Quantification (UQ)¶
Источники: Kendall & Gal "What Uncertainties Do We Need in Bayesian Deep Learning" (2017), Lakshminarayanan et al. "Deep Ensembles" (2017), Gal & Ghahramani "MC Dropout" (2016)
Q: Что такое Aleatoric vs Epistemic Uncertainty?¶
A:
| Type | Definition | Can be reduced? | Source |
|---|---|---|---|
| Aleatoric | Irreducible noise in data | No (more same data won't help) | Sensor noise, label ambiguity |
| Epistemic | Uncertainty about model | Yes (better data/model) | Limited data, model misspecification |
Mathematical formulation: $\(p(y|x, D) = \int p(y|x, \theta) p(\theta|D) d\theta\)$
- \(p(y|x, \theta)\) captures aleatoric (noise in labels)
- \(p(\theta|D)\) captures epistemic (uncertainty in weights)
Why distinguish: - High epistemic -> collect more data - High aleatoric -> improve sensors, reduce label noise
Q: Как работает Monte Carlo (MC) Dropout?¶
A:
Key insight: Dropout at test time approximates Bayesian inference.
def mc_dropout_predict(model, x, T=100):
model.train() # Keep dropout active!
predictions = []
for _ in range(T):
with torch.no_grad():
pred = model(x)
predictions.append(pred)
predictions = torch.stack(predictions)
mean = predictions.mean(dim=0)
variance = predictions.var(dim=0)
return mean, variance
Advantages: - No model changes needed - Fast (same inference cost x T) - Strong baseline for UQ
Limitations: - Requires dropout in architecture - Variance can be underestimated
Q: Deep Ensembles для uncertainty estimation?¶
A:
Approach: Train M models with different random seeds, aggregate predictions.
class DeepEnsemble:
def __init__(self, models):
self.models = models
def predict(self, x):
predictions = [model(x) for model in self.models]
predictions = torch.stack(predictions)
mean = predictions.mean(dim=0)
variance = predictions.var(dim=0)
return mean, variance
Comparison with MC Dropout: | Aspect | MC Dropout | Deep Ensembles | |--------|------------|----------------| | Training cost | Same | M x more | | Inference cost | T forward passes | M forward passes | | Diversity source | Stochastic dropout | Different minima | | Typical T/M | 10-100 | 5-10 |
Best practice: Deep Ensembles + Temperature Scaling
Q: Как оценить calibration модели?¶
A:
Expected Calibration Error (ECE): $\(ECE = \sum_{m=1}^{M} \frac{|B_m|}{n} |acc(B_m) - conf(B_m)|\)$
Temperature Scaling:
def temperature_scale(logits, temperature):
return softmax(logits / temperature, dim=-1)
# Learn T on validation set
temperature = nn.Parameter(torch.ones(1))
optimizer = torch.optim.LBFGS([temperature], lr=0.01, max_iter=50)
Q: Conformal Prediction -- что это?¶
A:
Goal: Distribution-free, finite-sample valid prediction sets.
Key property: Under exchangeability, coverage is guaranteed: $\(P(Y_{n+1} \in \hat{C}(X_{n+1})) \geq 1 - \alpha\)$
Split Conformal procedure: 1. Compute calibration scores: \(s_i = 1 - f(x_i)_{y_i}\) 2. Find quantile: \(\hat{q} = \lceil (n+1)(1-\alpha) \rceil\)-th largest score 3. Prediction set: \(\{y : f(x_{new})_y \geq 1 - \hat{q}\}\)
def conformal_prediction(cal_logits, cal_labels, test_logits, alpha=0.1):
cal_probs = softmax(cal_logits, dim=-1)
scores = 1 - cal_probs[range(len(cal_labels)), cal_labels]
n = len(scores)
q_level = (n + 1) * (1 - alpha) / n
q = torch.quantile(scores, q_level)
test_probs = softmax(test_logits, dim=-1)
prediction_sets = test_probs >= (1 - q)
return prediction_sets
Advantages: - No distribution assumptions - Works with any model - Guaranteed coverage
Q: OOD Detection -- как определять out-of-distribution?¶
A:
Baselines:
1. Max Softmax Probability (MSP):
def msp_ood_score(logits):
prob = softmax(logits, dim=-1)
return 1 - prob.max(dim=-1)
# Higher score = more OOD
2. Energy-based: $\(E(x) = -T \log \sum_i e^{f_i(x)/T}\)$
3. Mahalanobis distance (in feature space):
def mahalanobis_ood(features, class_means, cov):
scores = []
for mean in class_means:
diff = features - mean
score = diff @ torch.inverse(cov) @ diff.T
scores.append(score.diag())
return torch.min(torch.stack(scores), dim=0)
Evaluation metrics: AUROC, AUPR, FPR@95TPR
Q: Selective Prediction -- когда отказываться от предсказания?¶
A:
Idea: Model can abstain when uncertain, trading coverage for accuracy.
def selective_predict(model, x, threshold=0.9):
probs = softmax(model(x), dim=-1)
max_prob, pred = probs.max(dim=-1)
accepted = max_prob >= threshold
return pred[accepted], accepted
Coverage-accuracy trade-off: - Higher threshold -> higher accuracy, lower coverage - Choose threshold based on business requirements
Q: Когда какой UQ метод использовать?¶
A:
| Scenario | Recommended Method |
|---|---|
| Quick baseline | Temperature scaling + MC Dropout |
| Maximum reliability | Deep Ensembles + Temperature scaling |
| Safety-critical | Conformal prediction + Ensembles |
| Need sets, not scores | Conformal prediction |
| OOD detection | Energy-based + Mahalanobis |
| Resource-constrained | MC Dropout (T=10-20) |
| Aleatoric uncertainty | Heteroscedastic loss (predict variance) |
Dimensionality Reduction for Visualization (t-SNE, UMAP)¶
Источники: AI Under the Hood: t-SNE & UMAP (2025)
Q: t-SNE -- как работает?¶
A:
t-Distributed Stochastic Neighbor Embedding (van der Maaten & Hinton, 2008).
Core idea: Preserve local neighborhoods -- similar points in high-D should be close in low-D.
Algorithm: 1. Compute similarities in high-D: $\(p_{j|i} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / 2\sigma_i^2)}\)$
-
Compute similarities in low-D (Student t-distribution): $\(q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}}\)$
-
Minimize KL divergence: $\(C = KL(P\|Q) = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}\)$
Why Student t? Heavy tails allow dissimilar points to be far apart without crushing local structure.
Perplexity: Controls effective number of neighbors. Typical: 5-50.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30, learning_rate=200,
n_iter=1000, random_state=42)
embedding = tsne.fit_transform(X_high_dim)
Q: UMAP -- как работает и чем отличается от t-SNE?¶
A:
Uniform Manifold Approximation and Projection (McInnes et al., 2018).
| Aspect | t-SNE | UMAP |
|---|---|---|
| Theory | Probabilistic (KL divergence) | Topological (fuzzy simplicial sets) |
| Global structure | Poor | Better preserved |
| Speed | Slow (O(n^2)) | Faster (O(n log n)) |
| New data | Can't embed | Can transform new points |
| Scalability | ~100K samples | Millions of samples |
| Parameters | perplexity | n_neighbors, min_dist, metric |
Key parameters:
- n_neighbors (default 15): Local vs global balance
- min_dist (default 0.1): Spread of points
import umap
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='euclidean')
embedding = reducer.fit_transform(X_high_dim)
# Can transform new data!
new_embedding = reducer.transform(X_new)
Q: Когда использовать t-SNE vs UMAP vs PCA?¶
A:
| Method | Speed | Global Structure | New Data | Use Case |
|---|---|---|---|---|
| PCA | Very fast | Preserved | Yes | Initial exploration, linear data |
| t-SNE | Slow | Poor | No | Cluster visualization, small datasets |
| UMAP | Medium | Better | Yes | General purpose, large datasets |
Q: Common pitfalls визуализации embeddings?¶
A:
- Random seed sensitivity: Different runs produce different shapes -- always set
random_state - Cluster size does not equal actual size: t-SNE/UMAP distort distances
- Global distances meaningless: Far clusters may not be far in high-D
- Parameters matter: Same data, different perplexity -> very different visualizations
- Over-interpretation: Patterns may be artifacts -- validate with other methods
Q: Визуализация BERT/ResNet embeddings -- best practices?¶
A:
import torch
from transformers import AutoModel, AutoTokenizer
import umap
model = AutoModel.from_pretrained('bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
def get_embeddings(texts):
inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
return outputs.last_hidden_state[:, 0, :].numpy() # CLS token
embeddings = get_embeddings(texts)
# UMAP for visualization
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, metric='cosine')
vis = reducer.fit_transform(embeddings)
Best practices:
- Use metric='cosine' for text/image embeddings
- Preprocess with PCA if >100 dimensions
- Try multiple n_neighbors values
- Color by class/cluster to validate separation
Speculative Decoding (LLM Inference Acceleration)¶
Источники: Michael Brenndoerfer: Speculative Decoding (2026), BentoML Blog (2025)
Q: Что такое Speculative Decoding и зачем он нужен?¶
A:
Speculative Decoding -- техника ускорения LLM inference на 2-3x без потери качества генерации.
Проблема: LLM inference memory-bound, не compute-bound. - 70B model требует ~140GB загрузки из памяти на каждый токен - A100: 2 TB/s bandwidth -> ~70ms только на загрузку весов - GPU compute простаивает 95% времени
Solution: Generate multiple tokens per large model forward pass.
Q: Как работает Speculative Decoding?¶
A:
Architecture: 2 модели -- draft model (маленькая, быстрая) + target model (большая).
Algorithm (per round): 1. Draft phase: Small model генерирует K candidate tokens (обычно K=4-8) 2. Verify phase: Large model проверяет все K токенов за один forward pass 3. Accept/reject: Сравниваем вероятности draft vs target 4. Correction: При rejection -- sample из adjusted distribution
Key insight: Verification почти бесплатен -- те же 70ms загрузки памяти, но проверяем K токенов сразу.
Q: Acceptance criterion -- как решать принимать токен?¶
A:
Формула: $\(\alpha(x) = \min\left(1, \frac{p(x)}{q(x)}\right)\)$
Где: - \(p(x)\) -- probability от target model - \(q(x)\) -- probability от draft model
Интерпретация: - Если \(p(x) \geq q(x)\) (target любит токен больше) -> accept всегда (alpha=1) - Если \(p(x) < q(x)\) (draft переоценил) -> accept с вероятностью \(p/q\)
Correction distribution: $\(p_{\text{correction}}(x) = \frac{\max(0, p(x) - q(x))}{\sum_{x'} \max(0, p(x') - q(x'))}\)$
Гарантия: Output distribution точно совпадает с target model -- lossless acceleration.
Q: Как выбрать draft model?¶
A:
| Approach | Acceptance Rate | Pros | Cons |
|---|---|---|---|
| Same family (LLaMA-7B -> LLaMA-70B) | 70-80% | Easy, no training | Limited to model families |
| Distilled | 85%+ | Best alignment | Requires training infra |
| Self-drafting (early exit) | 60-70% | No extra model | Complex implementation |
Requirements: 1. Speed: Draft generation < target forward pass (rule: ~10-15% of target parameters) 2. Alignment: Similar probability distributions to target 3. Vocabulary: Exact same tokenizer
Q: Когда Speculative Decoding не работает?¶
A:
Bad cases: 1. Low acceptance rate (< 60%): overhead превышает benefit 2. Small target model: draft overhead не окупается 3. Very diverse outputs (creative writing): hard to predict 4. Batch inference: memory conflicts, harder to implement
Good cases: - Large models (>30B parameters) - Repetitive/deterministic outputs (code, structured data) - Same-family draft model available - Latency-critical applications
# vLLM with speculative decoding
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
speculative_model="meta-llama/Llama-2-7b-hf",
num_speculative_tokens=5,
)
Mamba & State Space Models (SSM)¶
Q: Что такое State Space Models и как они связаны с RNN?¶
A:
State Space Models (SSM) -- архитектура для последовательного моделирования с линейной сложностью \(O(N)\).
Continuous SSM (из control theory): $\(\frac{d\mathbf{h}(t)}{dt} = \mathbf{Ah}(t) + \mathbf{B}x(t)\)$
Discretization (Euler method): $\(\mathbf{h}_{t+1} = \mathbf{A}^*\mathbf{h}_t + \mathbf{B}^*x_t\)$
Связь с RNN: Discrete SSMs -- это linear RNNs (RNN с identity activation).
Mamba discretization (zero-order hold): $\(\mathbf{A}^* = \exp(\Delta\mathbf{A})\)$
Q: Почему SSMs лучше обычных RNN?¶
A:
Проблемы RNN: 1. Difficulty modeling long-range dependencies 2. Slow training (sequential computation)
SSM solutions: 1. HiPPO Framework -- principled initialization of \(\mathbf{A}\) matrix (Sequential MNIST: 68% -> 98%) 2. Parallel training via convolution: $\(\mathbf{y} = \mathbf{K} \odot \mathbf{x}\)$
S4 model (2022): Linear time computation via structured \(\mathbf{A}\) matrix
Q: В чём ключевое отличие Mamba от S4?¶
A:
Selective State Space Models -- Mamba делает параметры input-dependent.
S4 (static parameters): \(\Delta, \mathbf{B}, \mathbf{C}\) -- фиксированные для всех токенов
Mamba (input-dependent parameters): $\(\Delta_t, \mathbf{B}_t, \mathbf{C}_t = f(x_t)\)$
Эффект selectivity: - Content-aware reasoning: Модель может "забывать" нерелевантное и "запоминать" важное - Variable memory: Разные токены используют разный effective context - Attention-like behavior: Через gating механизмы
class MambaBlock(nn.Module):
def __init__(self, d_model, d_state=16, d_conv=4, expand=2):
self.d_inner = d_model * expand
self.proj = nn.Linear(d_model, self.d_inner * 2)
self.conv = nn.Conv1d(self.d_inner, self.d_inner, d_conv, padding=d_conv-1)
# Input-dependent projections
self.proj_dt = nn.Linear(self.d_inner, d_state)
self.proj_B = nn.Linear(self.d_inner, d_state)
self.proj_C = nn.Linear(self.d_inner, d_state)
self.A = nn.Parameter(torch.randn(d_state, d_state))
def forward(self, x):
B, L, D = x.shape
xz = self.proj(x)
x, z = xz.chunk(2, dim=-1)
x = self.conv(x.transpose(1, 2)).transpose(1, 2)
dt = F.softplus(self.proj_dt(x))
B_t = self.proj_B(x)
C_t = self.proj_C(x)
y = self.selective_scan(x, dt, B_t, C_t)
return y * F.silu(z)
Q: Mamba vs Transformer -- complexity comparison?¶
A:
| Aspect | Transformer | Mamba/SSM |
|---|---|---|
| Training | \(O(N^2 \cdot d)\) | \(O(N \cdot d^2)\) |
| Inference | \(O(N \cdot d^2)\) cache | \(O(d^2)\) constant |
| Memory | \(O(N \cdot d)\) KV cache | \(O(d)\) state |
| Long sequences | Quadratic bottleneck | Linear scaling |
When to prefer Mamba: - Very long sequences (DNA, audio, video) - Real-time streaming applications - Memory-constrained inference
When to prefer Transformer: - In-context learning tasks - Retrieval-heavy workloads - Already optimized infrastructure
Q: Как работает hardware-aware implementation Mamba?¶
A:
Mamba использует parallel associative scan для efficient GPU computation.
Challenge: SSM -- inherently sequential: \(\mathbf{h}_{t+1} = \mathbf{A}^*_t\mathbf{h}_t + \mathbf{B}^*_t x_t\)
Solution: Express as associative operation, enabling parallel prefix sum (scan) algorithm: - \(O(\log N)\) parallel steps - Full GPU utilization
Memory optimization: - Recomputation: Don't store intermediate states, recompute during backward pass - Chunked computation: Process in chunks, merge states
Mamba-2 improvements: - State Space Duality (SSD) framework - 2-8x faster than Mamba-1 - Better scaling laws
Q: Гибридные архитектуры Mamba + Transformer -- когда и зачем?¶
A:
Hybrid models (Jamba, Bamba) комбинируют SSM и Attention слои.
Architecture pattern:
Jamba (AI21, 2024): - 12B parameters (94B with MoE) - 4:1 ratio of Mamba:Attention layers - 3x longer effective context than pure Transformer - 140K context window
When to use hybrid: - Need both long context AND ICL - Retrieval-augmented generation - Multi-modal (text + long context)
Sequence Modeling Beyond RNN (TCN, WaveNet)¶
Источники: Shadecoder TCN Guide (2025), DeepMind WaveNet paper (2016)
Q: Что такое Temporal Convolutional Network (TCN)?¶
A:
TCN -- архитектура для sequence modeling, использующая convolutional слои вместо recurrent.
Три ключевых идеи: 1. Causal convolutions: output в момент \(t\) зависит только от \(t\) и ранее 2. Dilated convolutions: экспоненциальный рост receptive field без роста параметров 3. Residual connections: стабилизация обучения глубоких сетей
Dilated Convolution: $\(y(t) = \sum_{i=0}^{k-1} f(i) \cdot x(t - d \cdot i)\)$
Receptive Field: $\(R = 1 + (k-1) \cdot \sum_{i=1}^{n} d_i = 1 + (k-1) \cdot (2^n - 1)\)$
class TCNBlock(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size, dilation, dropout=0.2):
super().__init__()
padding = (kernel_size - 1) * dilation
self.conv1 = nn.Conv1d(in_channels, out_channels, kernel_size,
padding=padding, dilation=dilation)
self.conv2 = nn.Conv1d(out_channels, out_channels, kernel_size,
padding=padding, dilation=dilation)
self.bn1 = nn.BatchNorm1d(out_channels)
self.bn2 = nn.BatchNorm1d(out_channels)
self.dropout = nn.Dropout(dropout)
self.relu = nn.ReLU()
self.residual = nn.Conv1d(in_channels, out_channels, 1) \
if in_channels != out_channels else nn.Identity()
self.chomp = padding
def forward(self, x):
residual = self.residual(x)
out = self.relu(self.bn1(self.conv1(x)))
out = self.dropout(out)
out = self.relu(self.bn2(self.conv2(out)))
out = self.dropout(out)
out = out[:, :, :-self.chomp] if self.chomp > 0 else out
return self.relu(out + residual)
Q: TCN vs LSTM/GRU -- когда что использовать?¶
A:
| Aspect | TCN | LSTM/GRU |
|---|---|---|
| Parallelization | Full (train faster) | Sequential (slow) |
| Training stability | Excellent (no vanishing gradients) | Requires careful init |
| Long-term memory | Fixed receptive field | Theoretically infinite |
| Inference speed | O(1) with precomputed | O(sequence length) |
| Variable length | Requires padding/masking | Natural handling |
When TCN is better: 1. Long sequences where parallel training matters 2. Need stable gradients without special tricks 3. Real-time/low-latency inference 4. Time series forecasting, anomaly detection
When LSTM/GRU is better: 1. Variable-length sequences without padding 2. Need truly unbounded memory 3. Irregularly sampled data
Q: Что такое WaveNet и как он связан с TCN?¶
A:
WaveNet (DeepMind, 2016) -- generative model для raw audio waveforms, использующая dilated causal convolutions.
Key innovations: 1. Raw waveform modeling -- no vocoder, no spectrograms 2. Exponentially dilated convolutions -- receptive field ~16K samples 3. Gated activations -- \(\tanh(W_f * x) \odot \sigma(W_g * x)\) 4. mu-law companding -- quantize 16-bit audio to 256 values
| Feature | WaveNet | TCN |
|---|---|---|
| Purpose | Audio generation | General sequence |
| Gating | Yes (\(\tanh \odot \sigma\)) | No (ReLU) |
| Output | Autoregressive | Single pass |
| Speed | Slow (sample-by-sample) | Fast |
Q: Killer question -- почему TCN не заменил LSTM повсеместно?¶
A:
Причины, почему LSTM/GRU остаются популярными:
-
Фиксированный receptive field -- TCN видит только N шагов назад, LSTM теоретически может помнить бесконечно
-
Variable-length sequences -- LSTM естественно обрабатывает разную длину, TCN требует padding/masking
-
Inference latency при streaming -- TCN нужно хранить всё окно receptive field, LSTM только hidden state
-
Inductive bias -- LSTM заточен под sequential dependencies, TCN ищет локальные паттерны
-
Ecosystem momentum -- десятилетия LSTM кода и туториалов
Гибридные подходы (2025+): - TCN + Attention (лучшее из обоих) - Conformer для speech (CNN + Transformer) - S4/Mamba для truly long sequences