Перейти к содержанию

DL Interview: Специальные темы

~11 минут чтения

Навигация: Все темы DL интервью | Материалы DL | Математика для ML

Time Series DL (DeepAR, TFT), Uncertainty Quantification (MC Dropout, Deep Ensembles, Conformal Prediction, OOD Detection), t-SNE & UMAP, Speculative Decoding, Mamba & State Space Models (SSM), Sequence Modeling Beyond RNN (TCN, WaveNet).


Time Series Deep Learning

Источники: TFT Tutorial (Towards Data Science), Google TFT Paper

Q: DeepAR vs TFT vs Prophet -- сравнение

A:

Model Type Key Features When to Use
Prophet Additive regression Trend, seasonality, holidays Baseline, simple patterns
DeepAR Autoregressive RNN Probabilistic, learns across series Many similar series, cold start
TFT Transformer + attention Interpretable, multi-horizon Complex patterns, need interpretability

DeepAR (Amazon): Autoregressive RNN with likelihood output, learns global model from multiple series.

TFT (Google): Variable Selection Network + Multi-head attention, supports static + time-varying + known future features, interpretable attention.

Q: Что такое Temporal Fusion Transformer (TFT)?

A:

Architecture: 1. Variable Selection Network: Learns which features matter 2. LSTM Encoder-Decoder: Local temporal patterns 3. Multi-head Attention: Long-term dependencies 4. Gated Residual Network: Non-linear processing

Feature types: - Time-varying known: Future known (holidays, promotions) - Time-varying unknown: Past only (sales, demand) - Static real/categorical: Constant per series (product ID)

Q: Interpretable Attention в TFT -- как работает?

A:

Three types of interpretability: 1. Seasonality-wise: Attention weights show which past timesteps matter 2. Feature-wise: Variable Selection Network outputs importance per feature 3. Extreme events: Analyze behavior on rare value ranges


Uncertainty Quantification (UQ)

Источники: Kendall & Gal "What Uncertainties Do We Need in Bayesian Deep Learning" (2017), Lakshminarayanan et al. "Deep Ensembles" (2017), Gal & Ghahramani "MC Dropout" (2016)

Q: Что такое Aleatoric vs Epistemic Uncertainty?

A:

Type Definition Can be reduced? Source
Aleatoric Irreducible noise in data No (more same data won't help) Sensor noise, label ambiguity
Epistemic Uncertainty about model Yes (better data/model) Limited data, model misspecification

Mathematical formulation: $\(p(y|x, D) = \int p(y|x, \theta) p(\theta|D) d\theta\)$

  • \(p(y|x, \theta)\) captures aleatoric (noise in labels)
  • \(p(\theta|D)\) captures epistemic (uncertainty in weights)

Why distinguish: - High epistemic -> collect more data - High aleatoric -> improve sensors, reduce label noise

Q: Как работает Monte Carlo (MC) Dropout?

A:

Key insight: Dropout at test time approximates Bayesian inference.

def mc_dropout_predict(model, x, T=100):
    model.train()  # Keep dropout active!
    predictions = []

    for _ in range(T):
        with torch.no_grad():
            pred = model(x)
            predictions.append(pred)

    predictions = torch.stack(predictions)
    mean = predictions.mean(dim=0)
    variance = predictions.var(dim=0)

    return mean, variance

Advantages: - No model changes needed - Fast (same inference cost x T) - Strong baseline for UQ

Limitations: - Requires dropout in architecture - Variance can be underestimated

Q: Deep Ensembles для uncertainty estimation?

A:

Approach: Train M models with different random seeds, aggregate predictions.

class DeepEnsemble:
    def __init__(self, models):
        self.models = models

    def predict(self, x):
        predictions = [model(x) for model in self.models]
        predictions = torch.stack(predictions)

        mean = predictions.mean(dim=0)
        variance = predictions.var(dim=0)
        return mean, variance

Comparison with MC Dropout: | Aspect | MC Dropout | Deep Ensembles | |--------|------------|----------------| | Training cost | Same | M x more | | Inference cost | T forward passes | M forward passes | | Diversity source | Stochastic dropout | Different minima | | Typical T/M | 10-100 | 5-10 |

Best practice: Deep Ensembles + Temperature Scaling

Q: Как оценить calibration модели?

A:

Expected Calibration Error (ECE): $\(ECE = \sum_{m=1}^{M} \frac{|B_m|}{n} |acc(B_m) - conf(B_m)|\)$

Temperature Scaling:

def temperature_scale(logits, temperature):
    return softmax(logits / temperature, dim=-1)

# Learn T on validation set
temperature = nn.Parameter(torch.ones(1))
optimizer = torch.optim.LBFGS([temperature], lr=0.01, max_iter=50)

Q: Conformal Prediction -- что это?

A:

Goal: Distribution-free, finite-sample valid prediction sets.

Key property: Under exchangeability, coverage is guaranteed: $\(P(Y_{n+1} \in \hat{C}(X_{n+1})) \geq 1 - \alpha\)$

Split Conformal procedure: 1. Compute calibration scores: \(s_i = 1 - f(x_i)_{y_i}\) 2. Find quantile: \(\hat{q} = \lceil (n+1)(1-\alpha) \rceil\)-th largest score 3. Prediction set: \(\{y : f(x_{new})_y \geq 1 - \hat{q}\}\)

def conformal_prediction(cal_logits, cal_labels, test_logits, alpha=0.1):
    cal_probs = softmax(cal_logits, dim=-1)
    scores = 1 - cal_probs[range(len(cal_labels)), cal_labels]

    n = len(scores)
    q_level = (n + 1) * (1 - alpha) / n
    q = torch.quantile(scores, q_level)

    test_probs = softmax(test_logits, dim=-1)
    prediction_sets = test_probs >= (1 - q)

    return prediction_sets

Advantages: - No distribution assumptions - Works with any model - Guaranteed coverage

Q: OOD Detection -- как определять out-of-distribution?

A:

Baselines:

1. Max Softmax Probability (MSP):

def msp_ood_score(logits):
    prob = softmax(logits, dim=-1)
    return 1 - prob.max(dim=-1)
# Higher score = more OOD

2. Energy-based: $\(E(x) = -T \log \sum_i e^{f_i(x)/T}\)$

def energy_ood_score(logits, T=1.0):
    return -T * torch.logsumexp(logits / T, dim=-1)

3. Mahalanobis distance (in feature space):

def mahalanobis_ood(features, class_means, cov):
    scores = []
    for mean in class_means:
        diff = features - mean
        score = diff @ torch.inverse(cov) @ diff.T
        scores.append(score.diag())
    return torch.min(torch.stack(scores), dim=0)

Evaluation metrics: AUROC, AUPR, FPR@95TPR

Q: Selective Prediction -- когда отказываться от предсказания?

A:

Idea: Model can abstain when uncertain, trading coverage for accuracy.

def selective_predict(model, x, threshold=0.9):
    probs = softmax(model(x), dim=-1)
    max_prob, pred = probs.max(dim=-1)

    accepted = max_prob >= threshold
    return pred[accepted], accepted

Coverage-accuracy trade-off: - Higher threshold -> higher accuracy, lower coverage - Choose threshold based on business requirements

Q: Когда какой UQ метод использовать?

A:

Scenario Recommended Method
Quick baseline Temperature scaling + MC Dropout
Maximum reliability Deep Ensembles + Temperature scaling
Safety-critical Conformal prediction + Ensembles
Need sets, not scores Conformal prediction
OOD detection Energy-based + Mahalanobis
Resource-constrained MC Dropout (T=10-20)
Aleatoric uncertainty Heteroscedastic loss (predict variance)

Dimensionality Reduction for Visualization (t-SNE, UMAP)

Источники: AI Under the Hood: t-SNE & UMAP (2025)

Q: t-SNE -- как работает?

A:

t-Distributed Stochastic Neighbor Embedding (van der Maaten & Hinton, 2008).

Core idea: Preserve local neighborhoods -- similar points in high-D should be close in low-D.

Algorithm: 1. Compute similarities in high-D: $\(p_{j|i} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / 2\sigma_i^2)}\)$

  1. Compute similarities in low-D (Student t-distribution): $\(q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}}\)$

  2. Minimize KL divergence: $\(C = KL(P\|Q) = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}\)$

Why Student t? Heavy tails allow dissimilar points to be far apart without crushing local structure.

Perplexity: Controls effective number of neighbors. Typical: 5-50.

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, learning_rate=200,
            n_iter=1000, random_state=42)
embedding = tsne.fit_transform(X_high_dim)

Q: UMAP -- как работает и чем отличается от t-SNE?

A:

Uniform Manifold Approximation and Projection (McInnes et al., 2018).

Aspect t-SNE UMAP
Theory Probabilistic (KL divergence) Topological (fuzzy simplicial sets)
Global structure Poor Better preserved
Speed Slow (O(n^2)) Faster (O(n log n))
New data Can't embed Can transform new points
Scalability ~100K samples Millions of samples
Parameters perplexity n_neighbors, min_dist, metric

Key parameters: - n_neighbors (default 15): Local vs global balance - min_dist (default 0.1): Spread of points

import umap

reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='euclidean')
embedding = reducer.fit_transform(X_high_dim)

# Can transform new data!
new_embedding = reducer.transform(X_new)

Q: Когда использовать t-SNE vs UMAP vs PCA?

A:

Method Speed Global Structure New Data Use Case
PCA Very fast Preserved Yes Initial exploration, linear data
t-SNE Slow Poor No Cluster visualization, small datasets
UMAP Medium Better Yes General purpose, large datasets

Q: Common pitfalls визуализации embeddings?

A:

  1. Random seed sensitivity: Different runs produce different shapes -- always set random_state
  2. Cluster size does not equal actual size: t-SNE/UMAP distort distances
  3. Global distances meaningless: Far clusters may not be far in high-D
  4. Parameters matter: Same data, different perplexity -> very different visualizations
  5. Over-interpretation: Patterns may be artifacts -- validate with other methods

Q: Визуализация BERT/ResNet embeddings -- best practices?

A:

import torch
from transformers import AutoModel, AutoTokenizer
import umap

model = AutoModel.from_pretrained('bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def get_embeddings(texts):
    inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0, :].numpy()  # CLS token

embeddings = get_embeddings(texts)

# UMAP for visualization
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, metric='cosine')
vis = reducer.fit_transform(embeddings)

Best practices: - Use metric='cosine' for text/image embeddings - Preprocess with PCA if >100 dimensions - Try multiple n_neighbors values - Color by class/cluster to validate separation


Speculative Decoding (LLM Inference Acceleration)

Источники: Michael Brenndoerfer: Speculative Decoding (2026), BentoML Blog (2025)

Q: Что такое Speculative Decoding и зачем он нужен?

A:

Speculative Decoding -- техника ускорения LLM inference на 2-3x без потери качества генерации.

Проблема: LLM inference memory-bound, не compute-bound. - 70B model требует ~140GB загрузки из памяти на каждый токен - A100: 2 TB/s bandwidth -> ~70ms только на загрузку весов - GPU compute простаивает 95% времени

Solution: Generate multiple tokens per large model forward pass.

Q: Как работает Speculative Decoding?

A:

Architecture: 2 модели -- draft model (маленькая, быстрая) + target model (большая).

Algorithm (per round): 1. Draft phase: Small model генерирует K candidate tokens (обычно K=4-8) 2. Verify phase: Large model проверяет все K токенов за один forward pass 3. Accept/reject: Сравниваем вероятности draft vs target 4. Correction: При rejection -- sample из adjusted distribution

Key insight: Verification почти бесплатен -- те же 70ms загрузки памяти, но проверяем K токенов сразу.

Q: Acceptance criterion -- как решать принимать токен?

A:

Формула: $\(\alpha(x) = \min\left(1, \frac{p(x)}{q(x)}\right)\)$

Где: - \(p(x)\) -- probability от target model - \(q(x)\) -- probability от draft model

Интерпретация: - Если \(p(x) \geq q(x)\) (target любит токен больше) -> accept всегда (alpha=1) - Если \(p(x) < q(x)\) (draft переоценил) -> accept с вероятностью \(p/q\)

Correction distribution: $\(p_{\text{correction}}(x) = \frac{\max(0, p(x) - q(x))}{\sum_{x'} \max(0, p(x') - q(x'))}\)$

Гарантия: Output distribution точно совпадает с target model -- lossless acceleration.

Q: Как выбрать draft model?

A:

Approach Acceptance Rate Pros Cons
Same family (LLaMA-7B -> LLaMA-70B) 70-80% Easy, no training Limited to model families
Distilled 85%+ Best alignment Requires training infra
Self-drafting (early exit) 60-70% No extra model Complex implementation

Requirements: 1. Speed: Draft generation < target forward pass (rule: ~10-15% of target parameters) 2. Alignment: Similar probability distributions to target 3. Vocabulary: Exact same tokenizer

Q: Когда Speculative Decoding не работает?

A:

Bad cases: 1. Low acceptance rate (< 60%): overhead превышает benefit 2. Small target model: draft overhead не окупается 3. Very diverse outputs (creative writing): hard to predict 4. Batch inference: memory conflicts, harder to implement

Good cases: - Large models (>30B parameters) - Repetitive/deterministic outputs (code, structured data) - Same-family draft model available - Latency-critical applications

# vLLM with speculative decoding
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    speculative_model="meta-llama/Llama-2-7b-hf",
    num_speculative_tokens=5,
)

Mamba & State Space Models (SSM)

Q: Что такое State Space Models и как они связаны с RNN?

A:

State Space Models (SSM) -- архитектура для последовательного моделирования с линейной сложностью \(O(N)\).

Continuous SSM (из control theory): $\(\frac{d\mathbf{h}(t)}{dt} = \mathbf{Ah}(t) + \mathbf{B}x(t)\)$

\[y(t) = \mathbf{Ch}(t) + \mathbf{D}x(t)\]

Discretization (Euler method): $\(\mathbf{h}_{t+1} = \mathbf{A}^*\mathbf{h}_t + \mathbf{B}^*x_t\)$

\[y_t = \mathbf{Ch}_t\]

Связь с RNN: Discrete SSMs -- это linear RNNs (RNN с identity activation).

Mamba discretization (zero-order hold): $\(\mathbf{A}^* = \exp(\Delta\mathbf{A})\)$

\[\mathbf{B}^* = (\Delta\mathbf{A})^{-1}(\exp(\Delta\mathbf{A}) - I)\Delta\mathbf{B}\]

Q: Почему SSMs лучше обычных RNN?

A:

Проблемы RNN: 1. Difficulty modeling long-range dependencies 2. Slow training (sequential computation)

SSM solutions: 1. HiPPO Framework -- principled initialization of \(\mathbf{A}\) matrix (Sequential MNIST: 68% -> 98%) 2. Parallel training via convolution: $\(\mathbf{y} = \mathbf{K} \odot \mathbf{x}\)$

S4 model (2022): Linear time computation via structured \(\mathbf{A}\) matrix

Q: В чём ключевое отличие Mamba от S4?

A:

Selective State Space Models -- Mamba делает параметры input-dependent.

S4 (static parameters): \(\Delta, \mathbf{B}, \mathbf{C}\) -- фиксированные для всех токенов

Mamba (input-dependent parameters): $\(\Delta_t, \mathbf{B}_t, \mathbf{C}_t = f(x_t)\)$

Эффект selectivity: - Content-aware reasoning: Модель может "забывать" нерелевантное и "запоминать" важное - Variable memory: Разные токены используют разный effective context - Attention-like behavior: Через gating механизмы

class MambaBlock(nn.Module):
    def __init__(self, d_model, d_state=16, d_conv=4, expand=2):
        self.d_inner = d_model * expand
        self.proj = nn.Linear(d_model, self.d_inner * 2)
        self.conv = nn.Conv1d(self.d_inner, self.d_inner, d_conv, padding=d_conv-1)

        # Input-dependent projections
        self.proj_dt = nn.Linear(self.d_inner, d_state)
        self.proj_B = nn.Linear(self.d_inner, d_state)
        self.proj_C = nn.Linear(self.d_inner, d_state)
        self.A = nn.Parameter(torch.randn(d_state, d_state))

    def forward(self, x):
        B, L, D = x.shape
        xz = self.proj(x)
        x, z = xz.chunk(2, dim=-1)
        x = self.conv(x.transpose(1, 2)).transpose(1, 2)

        dt = F.softplus(self.proj_dt(x))
        B_t = self.proj_B(x)
        C_t = self.proj_C(x)

        y = self.selective_scan(x, dt, B_t, C_t)
        return y * F.silu(z)

Q: Mamba vs Transformer -- complexity comparison?

A:

Aspect Transformer Mamba/SSM
Training \(O(N^2 \cdot d)\) \(O(N \cdot d^2)\)
Inference \(O(N \cdot d^2)\) cache \(O(d^2)\) constant
Memory \(O(N \cdot d)\) KV cache \(O(d)\) state
Long sequences Quadratic bottleneck Linear scaling

When to prefer Mamba: - Very long sequences (DNA, audio, video) - Real-time streaming applications - Memory-constrained inference

When to prefer Transformer: - In-context learning tasks - Retrieval-heavy workloads - Already optimized infrastructure

Q: Как работает hardware-aware implementation Mamba?

A:

Mamba использует parallel associative scan для efficient GPU computation.

Challenge: SSM -- inherently sequential: \(\mathbf{h}_{t+1} = \mathbf{A}^*_t\mathbf{h}_t + \mathbf{B}^*_t x_t\)

Solution: Express as associative operation, enabling parallel prefix sum (scan) algorithm: - \(O(\log N)\) parallel steps - Full GPU utilization

Memory optimization: - Recomputation: Don't store intermediate states, recompute during backward pass - Chunked computation: Process in chunks, merge states

Mamba-2 improvements: - State Space Duality (SSD) framework - 2-8x faster than Mamba-1 - Better scaling laws

Q: Гибридные архитектуры Mamba + Transformer -- когда и зачем?

A:

Hybrid models (Jamba, Bamba) комбинируют SSM и Attention слои.

Architecture pattern:

[SSM] [SSM] [Attn] [SSM] [SSM] [Attn] ...  # Jamba-style

Jamba (AI21, 2024): - 12B parameters (94B with MoE) - 4:1 ratio of Mamba:Attention layers - 3x longer effective context than pure Transformer - 140K context window

When to use hybrid: - Need both long context AND ICL - Retrieval-augmented generation - Multi-modal (text + long context)


Sequence Modeling Beyond RNN (TCN, WaveNet)

Источники: Shadecoder TCN Guide (2025), DeepMind WaveNet paper (2016)

Q: Что такое Temporal Convolutional Network (TCN)?

A:

TCN -- архитектура для sequence modeling, использующая convolutional слои вместо recurrent.

Три ключевых идеи: 1. Causal convolutions: output в момент \(t\) зависит только от \(t\) и ранее 2. Dilated convolutions: экспоненциальный рост receptive field без роста параметров 3. Residual connections: стабилизация обучения глубоких сетей

Dilated Convolution: $\(y(t) = \sum_{i=0}^{k-1} f(i) \cdot x(t - d \cdot i)\)$

Receptive Field: $\(R = 1 + (k-1) \cdot \sum_{i=1}^{n} d_i = 1 + (k-1) \cdot (2^n - 1)\)$

class TCNBlock(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, dilation, dropout=0.2):
        super().__init__()
        padding = (kernel_size - 1) * dilation

        self.conv1 = nn.Conv1d(in_channels, out_channels, kernel_size,
                               padding=padding, dilation=dilation)
        self.conv2 = nn.Conv1d(out_channels, out_channels, kernel_size,
                               padding=padding, dilation=dilation)
        self.bn1 = nn.BatchNorm1d(out_channels)
        self.bn2 = nn.BatchNorm1d(out_channels)
        self.dropout = nn.Dropout(dropout)
        self.relu = nn.ReLU()

        self.residual = nn.Conv1d(in_channels, out_channels, 1) \
            if in_channels != out_channels else nn.Identity()
        self.chomp = padding

    def forward(self, x):
        residual = self.residual(x)
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.dropout(out)
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.dropout(out)
        out = out[:, :, :-self.chomp] if self.chomp > 0 else out
        return self.relu(out + residual)

Q: TCN vs LSTM/GRU -- когда что использовать?

A:

Aspect TCN LSTM/GRU
Parallelization Full (train faster) Sequential (slow)
Training stability Excellent (no vanishing gradients) Requires careful init
Long-term memory Fixed receptive field Theoretically infinite
Inference speed O(1) with precomputed O(sequence length)
Variable length Requires padding/masking Natural handling

When TCN is better: 1. Long sequences where parallel training matters 2. Need stable gradients without special tricks 3. Real-time/low-latency inference 4. Time series forecasting, anomaly detection

When LSTM/GRU is better: 1. Variable-length sequences without padding 2. Need truly unbounded memory 3. Irregularly sampled data

Q: Что такое WaveNet и как он связан с TCN?

A:

WaveNet (DeepMind, 2016) -- generative model для raw audio waveforms, использующая dilated causal convolutions.

Key innovations: 1. Raw waveform modeling -- no vocoder, no spectrograms 2. Exponentially dilated convolutions -- receptive field ~16K samples 3. Gated activations -- \(\tanh(W_f * x) \odot \sigma(W_g * x)\) 4. mu-law companding -- quantize 16-bit audio to 256 values

Feature WaveNet TCN
Purpose Audio generation General sequence
Gating Yes (\(\tanh \odot \sigma\)) No (ReLU)
Output Autoregressive Single pass
Speed Slow (sample-by-sample) Fast

Q: Killer question -- почему TCN не заменил LSTM повсеместно?

A:

Причины, почему LSTM/GRU остаются популярными:

  1. Фиксированный receptive field -- TCN видит только N шагов назад, LSTM теоретически может помнить бесконечно

  2. Variable-length sequences -- LSTM естественно обрабатывает разную длину, TCN требует padding/masking

  3. Inference latency при streaming -- TCN нужно хранить всё окно receptive field, LSTM только hidden state

  4. Inductive bias -- LSTM заточен под sequential dependencies, TCN ищет локальные паттерны

  5. Ecosystem momentum -- десятилетия LSTM кода и туториалов

Гибридные подходы (2025+): - TCN + Attention (лучшее из обоих) - Conformer для speech (CNN + Transformer) - S4/Mamba для truly long sequences