DL Interview: Специальные темы¶

~11 минут чтения

Навигация: Все темы DL интервью | Материалы DL | Математика для ML

Time Series DL (DeepAR, TFT), Uncertainty Quantification (MC Dropout, Deep Ensembles, Conformal Prediction, OOD Detection), t-SNE & UMAP, Speculative Decoding, Mamba & State Space Models (SSM), Sequence Modeling Beyond RNN (TCN, WaveNet).

Time Series Deep Learning¶

Источники: TFT Tutorial (Towards Data Science), Google TFT Paper

Q: DeepAR vs TFT vs Prophet -- сравнение¶

A:

Model	Type	Key Features	When to Use
Prophet	Additive regression	Trend, seasonality, holidays	Baseline, simple patterns
DeepAR	Autoregressive RNN	Probabilistic, learns across series	Many similar series, cold start
TFT	Transformer + attention	Interpretable, multi-horizon	Complex patterns, need interpretability

DeepAR (Amazon): Autoregressive RNN with likelihood output, learns global model from multiple series.

TFT (Google): Variable Selection Network + Multi-head attention, supports static + time-varying + known future features, interpretable attention.

Q: Что такое Temporal Fusion Transformer (TFT)?¶

A:

Architecture: 1. Variable Selection Network: Learns which features matter 2. LSTM Encoder-Decoder: Local temporal patterns 3. Multi-head Attention: Long-term dependencies 4. Gated Residual Network: Non-linear processing

Feature types: - Time-varying known: Future known (holidays, promotions) - Time-varying unknown: Past only (sales, demand) - Static real/categorical: Constant per series (product ID)

Q: Interpretable Attention в TFT -- как работает?¶

A:

Three types of interpretability: 1. Seasonality-wise: Attention weights show which past timesteps matter 2. Feature-wise: Variable Selection Network outputs importance per feature 3. Extreme events: Analyze behavior on rare value ranges

Uncertainty Quantification (UQ)¶

Источники: Kendall & Gal "What Uncertainties Do We Need in Bayesian Deep Learning" (2017), Lakshminarayanan et al. "Deep Ensembles" (2017), Gal & Ghahramani "MC Dropout" (2016)

Q: Что такое Aleatoric vs Epistemic Uncertainty?¶

A:

Type	Definition	Can be reduced?	Source
Aleatoric	Irreducible noise in data	No (more same data won't help)	Sensor noise, label ambiguity
Epistemic	Uncertainty about model	Yes (better data/model)	Limited data, model misspecification

Mathematical formulation: $$p(y|x, D) = \int p(y|x, \theta) p(\theta|D) d\theta$$

$p(y|x, \theta)$ captures aleatoric (noise in labels)
$p(\theta|D)$ captures epistemic (uncertainty in weights)

Why distinguish: - High epistemic -> collect more data - High aleatoric -> improve sensors, reduce label noise

Q: Как работает Monte Carlo (MC) Dropout?¶

A:

Key insight: Dropout at test time approximates Bayesian inference.

def mc_dropout_predict(model, x, T=100):
    model.train()  # Keep dropout active!
    predictions = []

    for _ in range(T):
        with torch.no_grad():
            pred = model(x)
            predictions.append(pred)

    predictions = torch.stack(predictions)
    mean = predictions.mean(dim=0)
    variance = predictions.var(dim=0)

    return mean, variance

Advantages: - No model changes needed - Fast (same inference cost x T) - Strong baseline for UQ

Limitations: - Requires dropout in architecture - Variance can be underestimated

Q: Deep Ensembles для uncertainty estimation?¶

A:

Approach: Train M models with different random seeds, aggregate predictions.

class DeepEnsemble:
    def __init__(self, models):
        self.models = models

    def predict(self, x):
        predictions = [model(x) for model in self.models]
        predictions = torch.stack(predictions)

        mean = predictions.mean(dim=0)
        variance = predictions.var(dim=0)
        return mean, variance

Comparison with MC Dropout: | Aspect | MC Dropout | Deep Ensembles | |--------|------------|----------------| | Training cost | Same | M x more | | Inference cost | T forward passes | M forward passes | | Diversity source | Stochastic dropout | Different minima | | Typical T/M | 10-100 | 5-10 |

Best practice: Deep Ensembles + Temperature Scaling

Q: Как оценить calibration модели?¶

A:

Expected Calibration Error (ECE): $$ECE = \sum_{m=1}^{M} \frac{|B_m|}{n} |acc(B_m) - conf(B_m)|$$

Temperature Scaling:

def temperature_scale(logits, temperature):
    return softmax(logits / temperature, dim=-1)

# Learn T on validation set
temperature = nn.Parameter(torch.ones(1))
optimizer = torch.optim.LBFGS([temperature], lr=0.01, max_iter=50)

Q: Conformal Prediction -- что это?¶

A:

Goal: Distribution-free, finite-sample valid prediction sets.

Key property: Under exchangeability, coverage is guaranteed: $$P(Y_{n+1} \in \hat{C}(X_{n+1})) \geq 1 - \alpha$$

Split Conformal procedure: 1. Compute calibration scores: $s_i = 1 - f(x_i)_{y_i}$ 2. Find quantile: $\hat{q} = \lceil (n+1)(1-\alpha) \rceil$-th largest score 3. Prediction set: $\{y : f(x_{new})_y \geq 1 - \hat{q}\}$

def conformal_prediction(cal_logits, cal_labels, test_logits, alpha=0.1):
    cal_probs = softmax(cal_logits, dim=-1)
    scores = 1 - cal_probs[range(len(cal_labels)), cal_labels]

    n = len(scores)
    q_level = (n + 1) * (1 - alpha) / n
    q = torch.quantile(scores, q_level)

    test_probs = softmax(test_logits, dim=-1)
    prediction_sets = test_probs >= (1 - q)

    return prediction_sets

Advantages: - No distribution assumptions - Works with any model - Guaranteed coverage

Q: OOD Detection -- как определять out-of-distribution?¶

A:

Baselines:

1. Max Softmax Probability (MSP):

def msp_ood_score(logits):
    prob = softmax(logits, dim=-1)
    return 1 - prob.max(dim=-1)
# Higher score = more OOD

2. Energy-based: $$E(x) = -T \log \sum_i e^{f_i(x)/T}$$

def energy_ood_score(logits, T=1.0):
    return -T * torch.logsumexp(logits / T, dim=-1)

3. Mahalanobis distance (in feature space):

def mahalanobis_ood(features, class_means, cov):
    scores = []
    for mean in class_means:
        diff = features - mean
        score = diff @ torch.inverse(cov) @ diff.T
        scores.append(score.diag())
    return torch.min(torch.stack(scores), dim=0)

Evaluation metrics: AUROC, AUPR, FPR@95TPR

Q: Selective Prediction -- когда отказываться от предсказания?¶

A:

Idea: Model can abstain when uncertain, trading coverage for accuracy.

def selective_predict(model, x, threshold=0.9):
    probs = softmax(model(x), dim=-1)
    max_prob, pred = probs.max(dim=-1)

    accepted = max_prob >= threshold
    return pred[accepted], accepted

Coverage-accuracy trade-off: - Higher threshold -> higher accuracy, lower coverage - Choose threshold based on business requirements

Q: Когда какой UQ метод использовать?¶

A:

Scenario	Recommended Method
Quick baseline	Temperature scaling + MC Dropout
Maximum reliability	Deep Ensembles + Temperature scaling
Safety-critical	Conformal prediction + Ensembles
Need sets, not scores	Conformal prediction
OOD detection	Energy-based + Mahalanobis
Resource-constrained	MC Dropout (T=10-20)
Aleatoric uncertainty	Heteroscedastic loss (predict variance)

Dimensionality Reduction for Visualization (t-SNE, UMAP)¶

Источники: AI Under the Hood: t-SNE & UMAP (2025)

Q: t-SNE -- как работает?¶

A:

t-Distributed Stochastic Neighbor Embedding (van der Maaten & Hinton, 2008).

Core idea: Preserve local neighborhoods -- similar points in high-D should be close in low-D.

Algorithm: 1. Compute similarities in high-D: $$p_{j|i} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / 2\sigma_i^2)}$$

Compute similarities in low-D (Student t-distribution): $$q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}}$$
Minimize KL divergence: $$C = KL(P\|Q) = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}$$

Why Student t? Heavy tails allow dissimilar points to be far apart without crushing local structure.

Perplexity: Controls effective number of neighbors. Typical: 5-50.

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, learning_rate=200,
            n_iter=1000, random_state=42)
embedding = tsne.fit_transform(X_high_dim)

Q: UMAP -- как работает и чем отличается от t-SNE?¶

A:

Uniform Manifold Approximation and Projection (McInnes et al., 2018).

Aspect	t-SNE	UMAP
Theory	Probabilistic (KL divergence)	Topological (fuzzy simplicial sets)
Global structure	Poor	Better preserved
Speed	Slow (O(n^2))	Faster (O(n log n))
New data	Can't embed	Can transform new points
Scalability	~100K samples	Millions of samples
Parameters	perplexity	n_neighbors, min_dist, metric

Key parameters: - n_neighbors (default 15): Local vs global balance - min_dist (default 0.1): Spread of points

import umap

reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='euclidean')
embedding = reducer.fit_transform(X_high_dim)

# Can transform new data!
new_embedding = reducer.transform(X_new)

Q: Когда использовать t-SNE vs UMAP vs PCA?¶

A:

Method	Speed	Global Structure	New Data	Use Case
PCA	Very fast	Preserved	Yes	Initial exploration, linear data
t-SNE	Slow	Poor	No	Cluster visualization, small datasets
UMAP	Medium	Better	Yes	General purpose, large datasets

Q: Common pitfalls визуализации embeddings?¶

A:

Random seed sensitivity: Different runs produce different shapes -- always set random_state
Cluster size does not equal actual size: t-SNE/UMAP distort distances
Global distances meaningless: Far clusters may not be far in high-D
Parameters matter: Same data, different perplexity -> very different visualizations
Over-interpretation: Patterns may be artifacts -- validate with other methods

Q: Визуализация BERT/ResNet embeddings -- best practices?¶

A:

import torch
from transformers import AutoModel, AutoTokenizer
import umap

model = AutoModel.from_pretrained('bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def get_embeddings(texts):
    inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0, :].numpy()  # CLS token

embeddings = get_embeddings(texts)

# UMAP for visualization
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, metric='cosine')
vis = reducer.fit_transform(embeddings)

Best practices: - Use metric='cosine' for text/image embeddings - Preprocess with PCA if >100 dimensions - Try multiple n_neighbors values - Color by class/cluster to validate separation

Speculative Decoding (LLM Inference Acceleration)¶

Источники: Michael Brenndoerfer: Speculative Decoding (2026), BentoML Blog (2025)

Q: Что такое Speculative Decoding и зачем он нужен?¶

A:

Speculative Decoding -- техника ускорения LLM inference на 2-3x без потери качества генерации.

Проблема: LLM inference memory-bound, не compute-bound. - 70B model требует ~140GB загрузки из памяти на каждый токен - A100: 2 TB/s bandwidth -> ~70ms только на загрузку весов - GPU compute простаивает 95% времени

Solution: Generate multiple tokens per large model forward pass.

Q: Как работает Speculative Decoding?¶

A:

Architecture: 2 модели -- draft model (маленькая, быстрая) + target model (большая).

Algorithm (per round): 1. Draft phase: Small model генерирует K candidate tokens (обычно K=4-8) 2. Verify phase: Large model проверяет все K токенов за один forward pass 3. Accept/reject: Сравниваем вероятности draft vs target 4. Correction: При rejection -- sample из adjusted distribution

Key insight: Verification почти бесплатен -- те же 70ms загрузки памяти, но проверяем K токенов сразу.

Q: Acceptance criterion -- как решать принимать токен?¶

A:

Формула: $$\alpha(x) = \min\left(1, \frac{p(x)}{q(x)}\right)$$

Где: - $p(x)$ -- probability от target model - $q(x)$ -- probability от draft model

Интерпретация: - Если $p(x) \geq q(x)$ (target любит токен больше) -> accept всегда (alpha=1) - Если $p(x) < q(x)$ (draft переоценил) -> accept с вероятностью $p/q$

Correction distribution: $$p_{\text{correction}}(x) = \frac{\max(0, p(x) - q(x))}{\sum_{x'} \max(0, p(x') - q(x'))}$$

Гарантия: Output distribution точно совпадает с target model -- lossless acceleration.

Q: Как выбрать draft model?¶

A:

Approach	Acceptance Rate	Pros	Cons
Same family (LLaMA-7B -> LLaMA-70B)	70-80%	Easy, no training	Limited to model families
Distilled	85%+	Best alignment	Requires training infra
Self-drafting (early exit)	60-70%	No extra model	Complex implementation

Requirements: 1. Speed: Draft generation < target forward pass (rule: ~10-15% of target parameters) 2. Alignment: Similar probability distributions to target 3. Vocabulary: Exact same tokenizer

Q: Когда Speculative Decoding не работает?¶

A:

Bad cases: 1. Low acceptance rate (< 60%): overhead превышает benefit 2. Small target model: draft overhead не окупается 3. Very diverse outputs (creative writing): hard to predict 4. Batch inference: memory conflicts, harder to implement

Good cases: - Large models (>30B parameters) - Repetitive/deterministic outputs (code, structured data) - Same-family draft model available - Latency-critical applications

# vLLM with speculative decoding
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    speculative_model="meta-llama/Llama-2-7b-hf",
    num_speculative_tokens=5,
)

Mamba & State Space Models (SSM)¶

Q: Что такое State Space Models и как они связаны с RNN?¶

A:

State Space Models (SSM) -- архитектура для последовательного моделирования с линейной сложностью $O(N)$.

Continuous SSM (из control theory): $$\frac{d\mathbf{h}(t)}{dt} = \mathbf{Ah}(t) + \mathbf{B}x(t)$$

\[y(t) = \mathbf{Ch}(t) + \mathbf{D}x(t)\]

Discretization (Euler method): $$\mathbf{h}_{t+1} = \mathbf{A}^*\mathbf{h}_t + \mathbf{B}^*x_t$$

\[y_t = \mathbf{Ch}_t\]

Связь с RNN: Discrete SSMs -- это linear RNNs (RNN с identity activation).

Mamba discretization (zero-order hold): $$\mathbf{A}^* = \exp(\Delta\mathbf{A})$$

\[\mathbf{B}^* = (\Delta\mathbf{A})^{-1}(\exp(\Delta\mathbf{A}) - I)\Delta\mathbf{B}\]

Q: Почему SSMs лучше обычных RNN?¶

A:

Проблемы RNN: 1. Difficulty modeling long-range dependencies 2. Slow training (sequential computation)

SSM solutions: 1. HiPPO Framework -- principled initialization of $\mathbf{A}$ matrix (Sequential MNIST: 68% -> 98%) 2. Parallel training via convolution: $$\mathbf{y} = \mathbf{K} \odot \mathbf{x}$$

S4 model (2022): Linear time computation via structured $\mathbf{A}$ matrix

Q: В чём ключевое отличие Mamba от S4?¶

A:

Selective State Space Models -- Mamba делает параметры input-dependent.

S4 (static parameters): $\Delta, \mathbf{B}, \mathbf{C}$ -- фиксированные для всех токенов

Mamba (input-dependent parameters): $$\Delta_t, \mathbf{B}_t, \mathbf{C}_t = f(x_t)$$

Эффект selectivity: - Content-aware reasoning: Модель может "забывать" нерелевантное и "запоминать" важное - Variable memory: Разные токены используют разный effective context - Attention-like behavior: Через gating механизмы

class MambaBlock(nn.Module):
    def __init__(self, d_model, d_state=16, d_conv=4, expand=2):
        self.d_inner = d_model * expand
        self.proj = nn.Linear(d_model, self.d_inner * 2)
        self.conv = nn.Conv1d(self.d_inner, self.d_inner, d_conv, padding=d_conv-1)

        # Input-dependent projections
        self.proj_dt = nn.Linear(self.d_inner, d_state)
        self.proj_B = nn.Linear(self.d_inner, d_state)
        self.proj_C = nn.Linear(self.d_inner, d_state)
        self.A = nn.Parameter(torch.randn(d_state, d_state))

    def forward(self, x):
        B, L, D = x.shape
        xz = self.proj(x)
        x, z = xz.chunk(2, dim=-1)
        x = self.conv(x.transpose(1, 2)).transpose(1, 2)

        dt = F.softplus(self.proj_dt(x))
        B_t = self.proj_B(x)
        C_t = self.proj_C(x)

        y = self.selective_scan(x, dt, B_t, C_t)
        return y * F.silu(z)

Q: Mamba vs Transformer -- complexity comparison?¶

A:

Aspect	Transformer	Mamba/SSM
Training	$O(N^2 \cdot d)$	$O(N \cdot d^2)$
Inference	$O(N \cdot d^2)$ cache	$O(d^2)$ constant
Memory	$O(N \cdot d)$ KV cache	$O(d)$ state
Long sequences	Quadratic bottleneck	Linear scaling

When to prefer Mamba: - Very long sequences (DNA, audio, video) - Real-time streaming applications - Memory-constrained inference

When to prefer Transformer: - In-context learning tasks - Retrieval-heavy workloads - Already optimized infrastructure

Q: Как работает hardware-aware implementation Mamba?¶

A:

Mamba использует parallel associative scan для efficient GPU computation.

Challenge: SSM -- inherently sequential: $\mathbf{h}_{t+1} = \mathbf{A}^*_t\mathbf{h}_t + \mathbf{B}^*_t x_t$

Solution: Express as associative operation, enabling parallel prefix sum (scan) algorithm: - $O(\log N)$ parallel steps - Full GPU utilization

Memory optimization: - Recomputation: Don't store intermediate states, recompute during backward pass - Chunked computation: Process in chunks, merge states

Mamba-2 improvements: - State Space Duality (SSD) framework - 2-8x faster than Mamba-1 - Better scaling laws

Q: Гибридные архитектуры Mamba + Transformer -- когда и зачем?¶

A:

Hybrid models (Jamba, Bamba) комбинируют SSM и Attention слои.

Architecture pattern:

[SSM] [SSM] [Attn] [SSM] [SSM] [Attn] ...  # Jamba-style

Jamba (AI21, 2024): - 12B parameters (94B with MoE) - 4:1 ratio of Mamba:Attention layers - 3x longer effective context than pure Transformer - 140K context window

When to use hybrid: - Need both long context AND ICL - Retrieval-augmented generation - Multi-modal (text + long context)

Sequence Modeling Beyond RNN (TCN, WaveNet)¶

Источники: Shadecoder TCN Guide (2025), DeepMind WaveNet paper (2016)

Q: Что такое Temporal Convolutional Network (TCN)?¶

A:

TCN -- архитектура для sequence modeling, использующая convolutional слои вместо recurrent.

Три ключевых идеи: 1. Causal convolutions: output в момент $t$ зависит только от $t$ и ранее 2. Dilated convolutions: экспоненциальный рост receptive field без роста параметров 3. Residual connections: стабилизация обучения глубоких сетей

Dilated Convolution: $$y(t) = \sum_{i=0}^{k-1} f(i) \cdot x(t - d \cdot i)$$

Receptive Field: $$R = 1 + (k-1) \cdot \sum_{i=1}^{n} d_i = 1 + (k-1) \cdot (2^n - 1)$$

class TCNBlock(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, dilation, dropout=0.2):
        super().__init__()
        padding = (kernel_size - 1) * dilation

        self.conv1 = nn.Conv1d(in_channels, out_channels, kernel_size,
                               padding=padding, dilation=dilation)
        self.conv2 = nn.Conv1d(out_channels, out_channels, kernel_size,
                               padding=padding, dilation=dilation)
        self.bn1 = nn.BatchNorm1d(out_channels)
        self.bn2 = nn.BatchNorm1d(out_channels)
        self.dropout = nn.Dropout(dropout)
        self.relu = nn.ReLU()

        self.residual = nn.Conv1d(in_channels, out_channels, 1) \
            if in_channels != out_channels else nn.Identity()
        self.chomp = padding

    def forward(self, x):
        residual = self.residual(x)
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.dropout(out)
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.dropout(out)
        out = out[:, :, :-self.chomp] if self.chomp > 0 else out
        return self.relu(out + residual)

Q: TCN vs LSTM/GRU -- когда что использовать?¶

A:

Aspect	TCN	LSTM/GRU
Parallelization	Full (train faster)	Sequential (slow)
Training stability	Excellent (no vanishing gradients)	Requires careful init
Long-term memory	Fixed receptive field	Theoretically infinite
Inference speed	O(1) with precomputed	O(sequence length)
Variable length	Requires padding/masking	Natural handling

When TCN is better: 1. Long sequences where parallel training matters 2. Need stable gradients without special tricks 3. Real-time/low-latency inference 4. Time series forecasting, anomaly detection

When LSTM/GRU is better: 1. Variable-length sequences without padding 2. Need truly unbounded memory 3. Irregularly sampled data

Q: Что такое WaveNet и как он связан с TCN?¶

A:

WaveNet (DeepMind, 2016) -- generative model для raw audio waveforms, использующая dilated causal convolutions.

Key innovations: 1. Raw waveform modeling -- no vocoder, no spectrograms 2. Exponentially dilated convolutions -- receptive field ~16K samples 3. Gated activations -- $\tanh(W_f * x) \odot \sigma(W_g * x)$ 4. mu-law companding -- quantize 16-bit audio to 256 values

Feature	WaveNet	TCN
Purpose	Audio generation	General sequence
Gating	Yes ($\tanh \odot \sigma$)	No (ReLU)
Output	Autoregressive	Single pass
Speed	Slow (sample-by-sample)	Fast

Q: Killer question -- почему TCN не заменил LSTM повсеместно?¶

A:

Причины, почему LSTM/GRU остаются популярными:

Фиксированный receptive field -- TCN видит только N шагов назад, LSTM теоретически может помнить бесконечно
Variable-length sequences -- LSTM естественно обрабатывает разную длину, TCN требует padding/masking
Inference latency при streaming -- TCN нужно хранить всё окно receptive field, LSTM только hidden state
Inductive bias -- LSTM заточен под sequential dependencies, TCN ищет локальные паттерны
Ecosystem momentum -- десятилетия LSTM кода и туториалов

Гибридные подходы (2025+): - TCN + Attention (лучшее из обоих) - Conformer для speech (CNN + Transformer) - S4/Mamba для truly long sequences

Aspect	Transformer	Mamba/SSM
Training	\(O(N^2 \cdot d)\)	\(O(N \cdot d^2)\)
Inference	\(O(N \cdot d^2)\) cache	\(O(d^2)\) constant
Memory	\(O(N \cdot d)\) KV cache	\(O(d)\) state
Long sequences	Quadratic bottleneck	Linear scaling