Classical ML: Interview Q&A¶

~58 минут чтения

Типичные вопросы с собеседований 2025-2026 Формат: Q: вопрос / A: развернутый ответ Обновлено: 2026-02-11

Предварительно: Материалы

Содержание¶

K-Nearest Neighbors
Logistic Regression
K-Means
Naive Bayes
Decision Trees
SVM
Gradient Boosting
Feature Engineering
Feature Selection
Сложные вопросы (Senior+)
Model Interpretability (SHAP & LIME) — NEW 2026
Reinforcement Learning Basics
NLP: Word Embeddings (Word2Vec, GloVe)
NLP: Named Entity Recognition (NER) & Sequence Labeling
Hyperparameter Optimization
Active Learning
Time Series: Deep Learning Methods
Explainable AI (XAI): SHAP & LIME
Neural Architecture Search (NAS)
Cost-Sensitive Learning
Missing Data Handling
Model Debugging
AutoML Theory
Federated Learning
TabPFN — Foundation Model for Tabular Data
Production ML Deployment Patterns
Data Drift Detection
Hyperparameter Interactions & Learning Curves
Cross-Validation Edge Cases

K-Nearest Neighbors¶

Q: Почему KNN плохо работает в высоких размерностях?¶

A: Curse of Dimensionality.

Причина: В высокоразмерном пространстве: 1. Все точки "далеко" друг от друга 2. Volume unit ball $\to 0$ exponentially 3. Distances становятся неразличимыми

\[\frac{\text{max distance} - \text{min distance}}{\text{min distance}} \to 0 \text{ as } d \to \infty\]

Solution: Dimensionality reduction (PCA) или другие алгоритмы.

Q: Как выбрать k в KNN?¶

A:

k=1: Overfitting, чувствителен к noise
k=n: Underfitting, всегда majority class
Rule of thumb: $k \approx \sqrt{n}$ (but use CV)

Practical: Cross-validation для optimal k. Обычно нечётное k для бинарной классификации (избежать ties).

Q: Какую метрику расстояния выбрать?¶

A:

Метрика	Формула	Когда использовать
Euclidean	$\sqrt{\sum(x_i-y_i)^2}$	Default, continuous features, масштабированные данные
Manhattan	$\sum\\|x_i-y_i\\|$	High-dim (более устойчив к curse of dimensionality), sparse
Cosine	$1 - \frac{x \cdot y}{\\|x\\|\\|y\\|}$	Text embeddings, TF-IDF, когда важен угол, а не magnitude
Mahalanobis	$\sqrt{(x-y)^T S^{-1} (x-y)}$	Коррелированные фичи, учитывает ковариацию

Gotcha: Euclidean и Manhattan требуют feature scaling. Cosine -- нет (инвариантен к scale).

Q: Weighted KNN -- зачем и как?¶

A:

Проблема: Стандартный KNN даёт равный вес всем k соседям -- далёкий сосед влияет так же, как ближайший.

Решение: Взвешивание по обратному расстоянию: $$\hat{y} = \frac{\sum_{i \in N_k} w_i y_i}{\sum_{i \in N_k} w_i}, \quad w_i = \frac{1}{d(x, x_i)^p}$$

Scikit-learn: KNeighborsClassifier(weights='distance')

Когда помогает: Неравномерная плотность данных, граничные зоны между классами.

Q: Как ускорить KNN? Brute force O(nd) на каждый query.¶

A:

Метод	Сложность query	Когда
Brute force	$O(nd)$	$n < 10K$ или $d > 20$
KD-tree	$O(d \log n)$ avg	$d < 20$, dense data
Ball tree	$O(d \log n)$ avg	Любая метрика, $d < 40$
ANN (приближённые)	$O(d \log n)$	$n > 100K$, допустима ошибка

Approximate Nearest Neighbors (ANN):

Библиотека	Алгоритм	Плюсы
FAISS (Meta)	IVF + PQ	GPU, миллиарды векторов, production-standard
Annoy (Spotify)	Random projections	Read-only, быстрый build, mmap
HNSW (hnswlib)	Hierarchical NSW graph	Лучший recall/speed trade-off
ScaNN (Google)	Anisotropic quantization	Оптимизирован для inner product

Scikit-learn: KNeighborsClassifier(algorithm='auto') выбирает между brute/kd_tree/ball_tree автоматически.

Q: KNN для регрессии vs классификации -- в чём разница?¶

A:

	Classification	Regression
Prediction	Majority vote среди k соседей	Mean/median значений k соседей
Weighted	Взвешенный vote	Взвешенное среднее
Метрики	Accuracy, F1	MSE, MAE

Scikit-learn: KNeighborsClassifier vs KNeighborsRegressor.

Gotcha: Для регрессии weighted KNN почти всегда лучше uniform -- далёкие точки вносят шум.

Q: Почему feature scaling обязателен для KNN?¶

A:

Проблема: KNN основан на расстоянии. Фича с бОльшим масштабом доминирует:

Фича A: зарплата (30000-150000)
Фича B: возраст (18-65)
→ Расстояние определяется почти полностью зарплатой

Решение:

Метод	Когда
StandardScaler	Gaussian-like features
MinMaxScaler	Bounded features, [0,1]
RobustScaler	Есть outliers

Gotcha: Fit scaler ТОЛЬКО на train set, transform и train и test. Иначе -- data leakage.

Заблуждение: KNN всегда хуже complex моделей

На малых датасетах ($n < 1000$, $d < 20$) с чистыми данными KNN часто побеждает Random Forest и SVM. Причина: KNN не имеет bias от функциональной формы -- чисто data-driven. Проблемы начинаются при $d > 20$ (curse of dimensionality) или $n > 50K$ (скорость).

Logistic Regression¶

Q: Почему logistic regression называется "regression"?¶

A: Исторически — потому что использует linear combination признаков:

\[z = w^Tx + b\]

Технически это classification (sigmoid превращает в probability), но underlying model — linear regression + activation.

Q: Multiclass logistic regression — как работает?¶

A:

One-vs-Rest (OvR): K бинарных классификаторов, каждый с sigmoid: $$P(y=k|x) = \sigma(w_k^Tx + b_k) = \frac{1}{1 + e^{-(w_k^Tx + b_k)}}$$

Softmax (Multinomial): Единая модель, все классы одновременно: $$P(y=k|x) = \frac{e^{w_k^Tx}}{\sum_j e^{w_j^Tx}}$$

Scikit-learn: multi_class='ovr' или 'multinomial'

Q: L1 vs L2 regularization в Logistic Regression¶

A:

L1 (Lasso)	L2 (Ridge)
Sparse coefficients	Small coefficients
Feature selection	Handles multicollinearity
Non-differentiable at 0	Smooth
Coordinate descent	Gradient descent

Practical: L1 если нужна интерпретируемость, L2 если все признаки важны.

K-Means¶

Q: K-means всегда сходится?¶

A: Да, но не обязательно к глобальному optimum.

Теорема: K-means сходится к локальному минимуму за конечное число шагов.

Причина: На каждом шаге: 1. Assignment step уменьшает J или оставляет 2. Update step уменьшает J или оставляет 3. J ограничено снизу

Problem: Может застрять в плохом local minimum.

Solution: K-means++ initialization, multiple restarts.

Q: Как работает K-Means++ initialization и почему она важна?¶

A:

Проблема random init: Плохие стартовые центроиды = плохой local minimum. Random init в ~20% случаев даёт результат в 2-5x хуже optimal.

K-Means++ алгоритм: 1. Выбрать первый центроид случайно 2. Для каждой точки вычислить расстояние до ближайшего центроида $D(x)$ 3. Выбрать следующий центроид с вероятностью $\frac{D(x)^2}{\sum D(x)^2}$ 4. Повторять шаги 2-3 пока не будет k центроидов

Гарантия: $O(\log k)$-competitive с оптимальным решением (Arthur & Vassilvitskii, 2007).

Scikit-learn: KMeans(init='k-means++') -- default.

Q: Как выбрать k в K-means?¶

A:

Метод	Как работает	Плюсы/Минусы
Elbow	Plot $J(k)$ vs $k$, найти "локоть"	Субъективный, не всегда есть чёткий elbow
Silhouette	$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$	Объективный, $s \in [-1, 1]$, но $O(n^2)$
Gap Statistic	Сравнение с uniform distribution	Статистически обоснован, дорогой
Calinski-Harabasz	$\frac{B/(k-1)}{W/(n-k)}$ (between/within variance ratio)	Быстрый, но biased к convex clusters

Practical: - Domain knowledge > метрики (если знаешь бизнес = 3 сегмента клиентов, бери k=3) - Silhouette + Elbow -- стандартная комбинация - Если Silhouette < 0.25 -- кластеризация плохая, данные не имеют кластерной структуры

Q: K-means vs K-medoids¶

A:

K-means	K-medoids (PAM)
Centroid = mean	Centroid = actual data point
Sensitive to outliers	Robust to outliers
Только Euclidean	Любая метрика расстояния
$O(nkd)$ per iteration	$O(n^2kd)$ -- значительно медленнее

Когда K-medoids: Outliers, non-Euclidean distances, когда центроид должен быть интерпретируемым (реальная точка данных).

Q: Когда K-Means не работает?¶

A:

Ситуация	Почему ломается	Альтернатива
Non-convex кластеры (полумесяцы, кольца)	K-Means делит пространство Voronoi-разбиением	DBSCAN, Spectral Clustering
Кластеры разного размера	Маленький кластер "поглощается" большим	GMM, HDBSCAN
Кластеры разной плотности	Разреженные точки ошибочно присваиваются плотному кластеру	DBSCAN, OPTICS
High-dimensional ($d > 50$)	Расстояния теряют смысл (curse of dimensionality)	Spectral Clustering, PCA + K-Means
Неизвестное количество кластеров	K задаётся вручную	DBSCAN, HDBSCAN, X-Means

Q: K-Means vs DBSCAN vs GMM -- когда что?¶

A:

Аспект	K-Means	DBSCAN	GMM
Форма кластеров	Сферические	Произвольная	Эллиптическая
K задаётся?	Да	Нет ($\epsilon$, min_pts)	Да
Outliers	Нет (все точки в кластерах)	Да (noise points)	Нет (но low probability)
Soft assignment	Нет (hard)	Нет (hard)	Да ($P(z_k\\|x)$)
Скорость	$O(nkd)$	$O(n \log n)$ с index	$O(nk d^2)$ per EM step
Масштабируемость	Mini-batch до миллионов	Плохо > 100K	Плохо > 50K

Rules of thumb: - K-Means: Сферические кластеры, знаешь k, нужна скорость - DBSCAN: Не знаешь k, есть outliers, non-convex формы - GMM: Нужна soft membership (вероятности), overlapping clusters

Q: Mini-batch K-Means -- зачем?¶

A:

Проблема: Стандартный K-Means: каждый iteration проходит по ВСЕМ $n$ точкам. При $n > 1M$ -- медленно.

Mini-batch: Каждый iteration -- случайная выборка $b$ точек (batch_size=1000 typically).

	Standard K-Means	Mini-batch K-Means
Per iteration	$O(nkd)$	$O(bkd)$, $b \ll n$
Convergence	Стабильная	Slightly noisier
Quality	Baseline	~1-3% хуже inertia
Speed	Slow for $n > 100K$	10-100x faster

Scikit-learn: MiniBatchKMeans(n_clusters=k, batch_size=1000)

Q: Метрики качества кластеризации -- с labels и без?¶

A:

Внешние (есть ground truth):

Метрика	Формула/Суть	Диапазон
ARI (Adjusted Rand Index)	Корректировка Rand Index за chance	$[-1, 1]$, 1 = perfect
NMI (Normalized Mutual Information)	$\frac{2 \cdot MI(U,V)}{H(U) + H(V)}$	$[0, 1]$
Homogeneity / Completeness	Кластер = один класс / класс = один кластер	$[0, 1]$

Внутренние (нет ground truth):

Метрика	Суть	Диапазон
Silhouette	Cohesion vs separation	$[-1, 1]$
Calinski-Harabasz	Between/within variance	$[0, \infty)$, higher = better
Davies-Bouldin	Avg cluster similarity	$[0, \infty)$, lower = better

Gotcha: Silhouette biased к convex clusters. Для DBSCAN результатов лучше использовать DBCV (Density-Based Clustering Validation).

Заблуждение: K-Means всегда находит глобальный оптимум

K-Means гарантирует сходимость к ЛОКАЛЬНОМУ минимуму за конечное число шагов (monotonic decrease of J), но НЕ гарантирует глобальный. На практике: запускай n_init=10 (scikit-learn default) -- 10 запусков с разными init, берёт лучший. K-Means++ снижает разброс между запусками, но не устраняет полностью.

Заблуждение: Feature scaling не нужен для K-Means

K-Means использует Euclidean distance -- фича с большим масштабом доминирует. StandardScaler или MinMaxScaler ОБЯЗАТЕЛЬНЫ перед K-Means. Единственное исключение: если все фичи уже в одном масштабе (e.g., one-hot encoded).

Naive Bayes¶

Q: Почему "naive"? Что если assumption нарушается?¶

A:

Naive assumption: Features условно независимы при фиксированном классе.

\[P(x|y) = \prod_i P(x_i|y)\]

Reality: Features коррелируют.

Но работает потому что: 1. Нужно только ORDERING, не точные probabilities 2. Overestimation всех вероятностей сокращается 3. Strong signal от truly informative features

Q: Gaussian vs Multinomial vs Bernoulli Naive Bayes¶

A:

Type	Data	Distribution
Gaussian	Continuous	Normal
Multinomial	Counts (text)	Multinomial
Bernoulli	Binary features	Bernoulli

Для текста: Multinomial NB standard (TF-IDF counts).

Decision Trees¶

Q: Gini vs Entropy — что выбрать?¶

A:

Gini: $1 - \sum p_k^2$

Entropy: $-\sum p_k \log_2 p_k$

Практически: Почти идентичные результаты.

Различия: - Gini: быстрее (нет log) - Entropy: чуть более sensitive к pure nodes - Gini чаще в sklearn default

Recommendation: Use default (Gini), только если time-critical.

Q: Как предотвратить overfitting в Decision Trees?¶

A:

Pre-pruning: - max_depth: limit tree depth - min_samples_split: minimum samples to split - min_samples_leaf: minimum samples in leaf - max_leaf_nodes: maximum leaves

Post-pruning: - Cost-complexity pruning (α penalty) - Reduced error pruning

Rule of thumb: Start with max_depth=5-10, tune.

Q: Feature importance в Decision Trees — как считается?¶

A:

Gini Importance (Mean Decrease in Impurity): $$\text{Importance}(j) = \sum_{t \in T} p(t) \cdot \Delta i(t) \cdot \mathbb{1}(j \text{ used at } t)$$

где $p(t)$ = fraction of samples at node $t$, $\Delta i(t)$ = impurity decrease.

Warning: Biased towards high-cardinality features!

Alternative: Permutation importance (more reliable).

SVM¶

Q: Что такое support vectors?¶

A: Support vectors -- точки, лежащие на границе margin или внутри.

Математически: Точки с $\alpha_i > 0$ в dual formulation.

Свойства: - Только support vectors определяют hyperplane - Удаление non-support vectors не меняет модель - Обычно 10-30% training samples

Follow-up: Это делает SVM memory-efficient -- prediction зависит только от support vectors, а не от всего dataset.

Q: Hard margin vs Soft margin -- в чём разница и зачем C?¶

A:

Hard margin (линейно разделимые данные): $$\min_{w,b} \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^Tx_i + b) \geq 1$$

Soft margin (реальные данные с шумом): $$\min_{w,b,\xi} \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}\xi_i \quad \text{s.t.} \quad y_i(w^Tx_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$

Параметр C -- trade-off:

C	Margin	Ошибки на train	Risk
Маленький (0.01)	Широкий	Допускает больше	Underfitting
Большой (100)	Узкий	Штрафует сильно	Overfitting

Practical: Подбирать через CV. Default в sklearn: C=1.0. Типичный grid: [0.01, 0.1, 1, 10, 100].

Q: Почему SVM kernel trick работает?¶

A:

Idea: Map data to higher dimension $\phi(x)$, but compute only kernel $K(x,x') = \phi(x)^T\phi(x')$.

Dual formulation: $$f(x) = \sum_i \alpha_i y_i K(x_i, x) + b$$

Key insight: Никогда не вычисляем $\phi(x)$ явно!

RBF kernel: $\exp(-\gamma\|x-x'\|^2)$ = infinite dimensional mapping. Caveat: kernel matrix $O(n^2)$ space/time; for large $n$ (>10K) use approximations (Nystrom, random Fourier features).

Q: Как выбрать kernel?¶

A:

Kernel	Формула	Когда использовать
Linear	$x^Tx'$	$d > n$ (text, genomics), линейно разделимые данные
Polynomial	$(\gamma x^Tx' + r)^d$	Известна полиномиальная связь, NLP (degree 2-3)
RBF (Gaussian)	$\exp(-\gamma\\|x-x'\\|^2)$	Default. Не знаешь структуру данных
Sigmoid	$\tanh(\gamma x^Tx' + r)$	Редко. Аналог нейросети с 1 hidden layer

Decision flow: 1. Начни с Linear SVM (LinearSVC) -- быстрый, часто достаточен 2. Если accuracy < target -- RBF SVM (SVC(kernel='rbf')) 3. Tune $\gamma$ и $C$ через GridSearchCV

$\gamma$ в RBF: - Маленький $\gamma$: каждая точка влияет далеко (smoother boundary, underfitting) - Большой $\gamma$: каждая точка влияет только на ближайших (complex boundary, overfitting)

Q: SVM vs Logistic Regression -- когда что?¶

A:

Аспект	SVM	Logistic Regression
Objective	Max margin (hinge loss)	Max likelihood (log loss)
Output	Decision value (не probability)	Probability $P(y=1\\|x)$
Outliers	Менее чувствителен (hinge = flat beyond margin)	Более чувствителен (log loss grows)
Feature scaling	Обязательно	Желательно
Kernels	Да (non-linear)	Нет (только linear)
Скорость	$O(n^2)$ - $O(n^3)$	$O(nd)$
Большие данные	Плохо > 50K	Хорошо на миллионах

Rules of thumb: - Нужны вероятности → Logistic Regression - Мало данных ($n < 10K$), non-linear → SVM с RBF - Много данных ($n > 50K$) → Logistic Regression (или LinearSVC) - Text classification → LinearSVC (часто лучше LR на sparse data)

Q: Multiclass SVM -- OvO vs OvR¶

A:

	One-vs-Rest (OvR)	One-vs-One (OvO)
Классификаторов	$k$	$\frac{k(k-1)}{2}$
Training	Каждый: $n$ samples	Каждый: $\frac{2n}{k}$ samples
Prediction	Макс confidence score	Majority voting
Когда лучше	$k$ большой, $n$ большой	$n$ маленький, kernel SVM

Scikit-learn: SVC использует OvO по default. LinearSVC использует OvR. Для multiclass > 10 классов: OvR быстрее.

Q: SVM для регрессии (SVR)¶

A:

Идея: $\epsilon$-insensitive tube -- ошибки внутри $\epsilon$ не штрафуются.

\[\min_{w,b} \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}(\xi_i + \xi_i^*)\]

\[\text{s.t.} \quad |y_i - (w^Tx_i + b)| \leq \epsilon + \xi_i\]

Параметр	Эффект
$\epsilon$	Ширина tube (толерантность к ошибкам)
$C$	Штраф за выход из tube

Когда SVR: Маленький dataset, non-linear зависимости, outliers (tube их игнорирует).

Q: SVM для несбалансированных данных¶

A:

Class weights: $$C_+ = C \cdot \frac{n}{2 \cdot n_+}, \quad C_- = C \cdot \frac{n}{2 \cdot n_-}$$

Scikit-learn:

SVC(class_weight='balanced')

Q: Масштабируемость SVM -- когда НЕ использовать?¶

A:

$n$ (samples)	Рекомендация
< 10K	SVM с любым kernel
10K-100K	LinearSVC (или SGDClassifier с hinge loss)
> 100K	Не SVM. Используй LogReg, GBDT, Neural Nets

Причина: Kernel SVM строит kernel matrix $n \times n$ -- $O(n^2)$ memory, $O(n^3)$ training.

Альтернативы для large-scale:

Метод	Сложность
`LinearSVC` (liblinear)	$O(nd)$
`SGDClassifier(loss='hinge')`	$O(nd)$, online
Nystrom approximation	$O(nm^2)$, $m \ll n$
Random Fourier Features	$O(nDd)$, $D$ = projection dim

Q: В каких задачах SVM всё ещё актуален в 2026?¶

A:

Задача	Почему SVM
Text classification (small corpus)	LinearSVC на TF-IDF часто побеждает fine-tuned BERT при $n < 5K$
Bioinformatics (gene expression)	$d \gg n$, kernel methods natural
Anomaly detection (One-Class SVM)	Не нужны аномальные примеры для train
Small dataset + non-linear	RBF SVM при $n < 1K$ часто лучше RF/GBDT

Где SVM проиграл: Tabular > 10K samples (GBDT лучше), Vision (CNN), NLP (Transformers), любой large-scale.

Заблуждение: SVM даёт вероятности

SVC.predict_proba() в sklearn использует Platt scaling (sigmoid calibration поверх decision values). Это НЕ native probability -- это post-hoc calibration, медленная ($O(n^2)$ cross-validation), и может быть неточной. Если нужны вероятности -- используй Logistic Regression.

Заблуждение: RBF kernel всегда лучше Linear

При $d > n$ (high-dimensional, sparse data) linear kernel часто лучше RBF. Причина: в high-dim пространстве данные часто линейно разделимы. RBF добавляет ненужную сложность и overfits. Правило: text/genomics → linear, tabular low-dim → RBF.

Gradient Boosting¶

Q: Gradient Boosting vs Random Forest — когда что?¶

A:

Gradient Boosting	Random Forest
Sequential training	Parallel training
Low bias, higher variance	Higher bias, low variance
Prone to overfitting	Resistant to overfitting
Requires careful tuning	Easy to tune
Better accuracy (potential)	Good baseline

Practical: - Start with RF for baseline - Use GBDT if need max accuracy - XGBoost/LightGBM/CatBoost > sklearn GBDT

Q: Learning rate в Gradient Boosting — как выбрать?¶

A:

Trade-off: - Low LR (0.01): More trees needed, better generalization - High LR (0.3): Fewer trees, faster, may overfit

Rule of thumb: - Start with LR=0.1, n_estimators=100 - If overfitting: decrease LR, increase n_estimators - Typical: LR=0.01-0.1

Relation: $n\_estimators \propto 1/LR$

Q: XGBoost vs LightGBM vs CatBoost¶

A:

Feature	XGBoost	LightGBM	CatBoost
Tree growth	Level-wise	Leaf-wise	Symmetric
Categorical	Manual	Native	Native (best)
Missing values	Native	Native	Native
Speed	Good	Fastest	Good
Memory	Medium	Low	Medium
Tuning	Complex	Medium	Easy

Practical: - CatBoost: Best for categorical, minimal tuning - LightGBM: Fastest for large datasets - XGBoost: Most mature, good default

Feature Engineering¶

Q: Target Encoding vs One-Hot для high-cardinality¶

A:

One-Hot	Target Encoding
1 column per category	1 column total
No leakage risk	Leakage risk
Works for tree models	Works for linear models
O(k) dimensions	O(1) dimension

Target Encoding risks: - Leakage если не использовать CV - Overfitting на rare categories

Solution: Leave-one-out, smoothing, CV-based encoding.

Q: Когда использовать log transform?¶

A:

Для: - Right-skewed distributions (income, prices) - Positive values only - Multiplicative relationships

Effect: - Reduces skewness - Stabilizes variance - Makes relationships more linear

# Log transform
X_log = np.log1p(X)  # log(1+X) handles zeros

# Power transform (Box-Cox)
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='box-cox')
X_transformed = pt.fit_transform(X)

Feature Selection¶

Q: RFE vs Feature Importance — что лучше?¶

A:

RFE (Recursive Feature Elimination): - Trains model, removes weakest feature, repeat - Computationally expensive - More reliable ranking

Feature Importance: - Single model training - Faster - May be biased (high-cardinality)

Practical: - Quick: Feature importance from Random Forest - Important: RFE with cross-validation

Q: Mutual Information vs Correlation для feature selection¶

A:

Correlation	Mutual Information
Linear only	Any relationship
[-1, 1] scale	[0, ∞) scale
Fast to compute	Slower
Gaussian assumption	No assumption

When MI > Correlation: - Non-linear relationships - Categorical features - Complex interactions

Сложные вопросы (Senior+)¶

Q: Выведите gradient для Logistic Regression с L2 regularization¶

A:

\[L = -\sum[y_i\log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)] + \frac{\lambda}{2}\|w\|^2\]

\[\frac{\partial L}{\partial w} = X^T(\hat{y} - y) + \lambda w\]

\[\frac{\partial L}{\partial b} = \sum(\hat{y}_i - y_i)\]

Q: Как работает Early Stopping?¶

A:

Algorithm: 1. Split train → train_sub + validation 2. Train, evaluate on validation each epoch 3. Track best validation score 4. Stop if no improvement for patience epochs 5. Return best model (not last!)

Practical:

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1)

best_score = 0
patience = 10
for epoch in range(max_epochs):
    model.fit(X_train, y_train)
    score = model.score(X_val, y_val)
    if score > best_score:
        best_score = score
        best_model = clone(model)
        wait = 0
    else:
        wait += 1
        if wait >= patience:
            break

Q: Stratified K-Fold vs K-Fold — когда что?¶

A:

K-Fold	Stratified K-Fold
Random split	Preserve class ratios
Works for regression	Classification only
May have unbalanced folds	Balanced folds

Когда Stratified: - Imbalanced classification - Small datasets - Rare classes

Когда K-Fold: - Regression - Large balanced datasets - Time series (use TimeSeriesSplit instead)

Model Interpretability (SHAP & LIME) — NEW 2026¶

Q: Зачем нужна интерпретируемость модели?¶

A:

Business reasons: - Regulatory compliance (GDPR "right to explanation") - Trust building with stakeholders - Debug model biases and errors - Feature leakage detection

Technical reasons: - Validate model behavior matches domain knowledge - Identify spurious correlations - Debug poor performance on specific cases

Q: SHAP vs LIME — в чём разница?¶

A:

SHAP	LIME
Game-theoretic (Shapley values)	Local surrogate model
Consistent, additive	May be inconsistent
Global + local explanations	Local only
Slower (especially KernelSHAP)	Faster
Feature interactions visible	No interactions by default

SHAP formula: $$\phi_i = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!(|N|-|S|-1)!}{|N|!}[f(S \cup \{i\}) - f(S)]$$

LIME formula: $$\xi(x) = \arg\min_{g \in G} \mathcal{L}(f, g, \pi_x) + \Omega(g)$$

Q: Как интерпретировать SHAP values?¶

A:

Global interpretation: - Mean |SHAP| = feature importance - SHAP distribution = effect direction (positive/negative) - Dependence plots = feature interaction effects

Local interpretation: - SHAP value = contribution of feature to this prediction - Sum of all SHAP values = prediction - base_value - Positive SHAP = increases prediction - Negative SHAP = decreases prediction

import shap

# TreeExplainer для tree models (fast)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

# Визуализация
shap.summary_plot(shap_values, X)
shap.dependence_plot("feature_name", shap_values, X)

# Local explanation
shap.force_plot(explainer.expected_value, shap_values[0], X.iloc[0])

Q: Когда использовать SHAP, а когда LIME?¶

A:

Use SHAP when: - Need consistent explanations - Want global feature importance - Tree models (TreeExplainer is fast) - Budget allows for computation

Use LIME when: - Need quick local explanations - Any model type (model-agnostic) - Limited compute resources - Text or image explanations

Production tip: Pre-compute SHAP for common cases, use LIME for real-time ad-hoc explanations.

Q: Проблемы SHAP/LIME в production¶

A:

Challenges: 1. Computational cost: KernelSHAP needs many model calls 2. Stability: Explanations may vary between runs 3. Counterfactual: Doesn't tell "what if" (need different tools) 4. Human interpretation: Still requires ML knowledge to understand

Solutions: - TreeExplainer for tree models (exact, fast) - Pre-compute explanations for common inputs - Cache results - Use for debugging, not as sole explanation

Reinforcement Learning Basics¶

Q: В чём разница между value-based и policy-based методами?¶

A:

Value-based	Policy-based
Учим Q(s,a) или V(s)	Учим π(a\|s) напрямую
Выбираем action через argmax	Сэмплируем из распределения
DQN, Q-learning	REINFORCE, A3C, PPO
Дискретные действия	Непрерывные действия
Sample efficient	Требует много эпизодов
Низкая variance (off-policy)	Высокая variance (Monte Carlo returns)

Гибридный подход (Actor-Critic): Actor учит политику, Critic оценивает value function.

Q: Объясните Q-learning алгоритм¶

A:

Идея: Итеративно обновляем Q-values через Bellman equation.

Q-learning update: $$Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)]$$

Ключевые компоненты: - $\alpha$ — learning rate - $\gamma$ — discount factor (0.9-0.99) - $\epsilon$-greedy — exploration vs exploitation

Deep Q-Network (DQN): - Q-function аппроксимируется нейросетью - Experience replay — учимся на past transitions - Target network — отдельная сеть для стабильности

# Q-learning update
q_table[state, action] += lr * (
    reward + gamma * np.max(q_table[next_state]) - q_table[state, action]
)

Q: Что такое policy gradient? REINFORCE?¶

A:

Policy Gradient Theorem: $$\nabla J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla \log \pi_\theta(a|s) \cdot Q^{\pi}(s,a)]$$

Интуиция: Увеличиваем вероятность действий, которые привели к высокой награде.

REINFORCE algorithm: 1. Сэмплируем trajectory $\tau$ из $\pi_\theta$ 2. Вычисляем return $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ 3. Обновляем: $\theta \leftarrow \theta + \alpha \nabla \log \pi_\theta(a_t|s_t) G_t$

Проблема REINFORCE: Высокая variance (одна trajectory = noisy estimate).

Q: PPO — почему популярен?¶

A:

Proximal Policy Optimization решает проблему instability policy gradient.

Key idea: Не меняем политику слишком сильно за один шаг.

Clipped objective: $$L^{CLIP}(\theta) = \mathbb{E}_t[\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)]$$

Где $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ — probability ratio.

Преимущества: - Стабильный (clipping предотвращает большие обновления) - Sample efficient (reuses samples) - Простой в реализации - SOTA для многих RL задач

Q: Exploration vs Exploitation — как балансировать?¶

A:

Problem: Нужно исследовать новые действия (exploration) и использовать лучшие известные (exploitation).

Strategies: 1. $\epsilon$-greedy: С вероятностью $\epsilon$ — random action, иначе — best action. Decay $\epsilon$ от 1.0 до 0.01.

Upper Confidence Bound (UCB): $$a_t = \arg\max_a [Q_t(a) + c\sqrt{\frac{\ln t}{N_t(a)}}]$$

Балансирует exploitation (Q-value) и exploration (uncertainty term).

Thompson Sampling: Байесовский подход — сэмплируем из posterior over Q-values.
Entropy regularization: Добавляем $-\beta H(\pi)$ к loss для поощрения разнообразия.

In practice: $\epsilon$-greedy для простоты, UCB для bandits, entropy regularization для continuous control.

NLP: Word Embeddings (Word2Vec, GloVe)¶

Q: В чём разница между CBOW и Skip-gram?¶

A:

CBOW	Skip-gram
Предсказывает center word по context	Предсказывает context по center word
Быстрее на частых словах	Лучше на редких словах
Сглаживает context (averaging)	Точный для каждого context word
$P(w_t	w_{t-c}, ..., w_{t+c})$

CBOW: Вход — one-hot контекстных слов → averaging → hidden → softmax для center word.

Skip-gram: Вход — one-hot center word → hidden → K независимых softmax для каждого context слова.

На практике: Skip-gram с negative sampling — стандарт (word2vec Google News).

Q: Как работает Negative Sampling?¶

A:

Проблема: Softmax над всем словарём (100K+ слов) — дорого на каждый training step.

Решение: Заменить softmax на binary classification для каждого примера.

Original softmax: $$P(w_o | w_i) = \frac{\exp(v_{w_o}^T v_{w_i})}{\sum_{w \in V} \exp(v_w^T v_{w_i})}$$

Negative Sampling objective: $$L = \log \sigma(v_{w_o}^T v_{w_i}) + \sum_{k=1}^{K} \mathbb{E}_{w_k \sim P_n(w)}[\log \sigma(-v_{w_k}^T v_{w_i})]$$

Где K = 5-20 negative samples, $P_n(w) \propto f(w)^{3/4}$ (freq^0.75 — повышает редкие слова).

Идея: Положительная пара (center, context) + K отрицательных пар (center, random word).

# Negative sampling loss
def negative_sampling_loss(center, context, negative_samples):
    # Positive: center-context pair
    pos_score = torch.dot(center, context)
    pos_loss = -torch.log(torch.sigmoid(pos_score))

    # Negative: center-random pairs
    neg_scores = torch.matmul(negative_samples, center)
    neg_loss = -torch.sum(torch.log(torch.sigmoid(-neg_scores)))

    return pos_loss + neg_loss

Q: Word2Vec vs GloVe — в чём разница?¶

A:

Word2Vec	GloVe
Predictive (local context)	Count-based (global co-occurrence)
Skip-gram / CBOW	Matrix factorization
Sliding window	Co-occurrence matrix
Online learning	Batch (matrix ops)
Нет explicit global info	Captures global statistics

GloVe objective: $$J = \sum_{i,j=1}^{V} f(X_{ij})(w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij})^2$$

Где $X_{ij}$ — co-occurrence count, $f$ — weighting function (снижает слишком частые пары).

Практика: GloVe часто лучше на analogies, Word2Vec — на downstream tasks с fine-tuning.

Q: Почему word embeddings capture semantics?¶

A:

Distributional Hypothesis: "You shall know a word by the company it keeps" (Firth, 1957).

Механизм: 1. Similar words appear in similar contexts 2. Model learns to predict context → similar vectors for similar contexts 3. Vector space reflects distributional similarity

Аналогии: $\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$

Ограничения: - Polysemy: "bank" (river vs financial) — один вектор - Antonyms могут быть близки (similar context) - No compositional meaning

Современные решения: Contextualized embeddings (BERT, ELMo) — разные векторы для разных контекстов.

Q: Что такое FastText и чем отличается от Word2Vec?¶

A:

FastText (Facebook AI Research, 2016) — extension Word2Vec с subword information.

Ключевое отличие: Представляет слово как bag of character n-grams: - "apple" → ["", "apple"] - Слово = сумма его n-gram векторов

Формула: $$\vec{w} = \sum_{g \in G_w} \vec{z}_g$$

Где $G_w$ — множество n-grams для слова $w$.

Преимущества: 1. OOV handling: Может создать embedding для неизвестных слов 2. Morphology: Captures "running", "runner", "runs" share patterns 3. Rare words: Лучше для редких слов (shared subwords) 4. Multilingual: Works well для morphologically rich languages (Russian, German)

Недостатки: 1. Memory: Больше векторов (n-grams vs words) 2. Noise: Subwords могут вносить noise 3. Slower: Больше parameters

import fasttext

# Train FastText
model = fasttext.train_unsupervised(
    'data.txt',
    model='skipgram',
    dim=300,
    ws=5,          # window size
    minCount=5,
    minn=3,        # min n-gram
    maxn=6         # max n-gram
)

# OOV handling — works!
embedding = model.get_word_vector('unprecedentedword')

Q: Word2Vec vs GloVe vs FastText — когда что использовать?¶

A:

Criteria	Word2Vec	GloVe	FastText
Training	Predictive (local)	Count-based (global)	Predictive + subwords
OOV handling	❌ No	❌ No	✅ Yes (via subwords)
Memory	Low	High (co-occ matrix)	Medium-High
Speed	Fast	Medium	Medium
Rare words	Poor	Medium	Good
Morphology	No	No	Yes
Best for	General NLP, speed	Analogy tasks	OOV, morphological langs

Decision framework:

# Choose Word2Vec when:
# - Speed priority
# - Well-defined vocabulary (no OOV expected)
# - Limited compute

# Choose GloVe when:
# - Global context matters
# - Analogy tasks important
# - Clean, large corpus available

# Choose FastText when:
# - OOV words common (user-generated content)
# - Morphologically rich language (Russian, Finnish)
# - Domain-specific vocabulary

2026 Context: Static embeddings → less common with transformers, but still useful for: - Lightweight production systems - Resource-constrained environments - Baseline comparisons - Word similarity tasks

Q: Как оценить качество word embeddings?¶

A:

Intrinsic evaluation: 1. Analogy tests: a:b :: c:? (Google analogy dataset) 2. Similarity correlation: Spearman с human judgments (WordSim-353, SimLex-999) 3. Concept categorization: Clustering quality (WordNet)

Extrinsic evaluation: 1. Downstream task performance (NER, sentiment, QA) 2. Probe tasks (part-of-speech, syntactic tree depth)

Practical: Intrinsic — для разработки, Extrinsic — для production.

# Cosine similarity для word vectors
from sklearn.metrics.pairwise import cosine_similarity

def word_similarity(w1, w2, embeddings):
    v1 = embeddings[w1]
    v2 = embeddings[w2]
    return cosine_similarity([v1], [v2])[0][0]

# Analogy: king - man + woman = ?
def analogy(a, b, c, embeddings):
    """Find d such that a:b :: c:d"""
    target = embeddings[a] - embeddings[b] + embeddings[c]
    # Find nearest word to target
    similarities = cosine_similarity([target], embeddings.vectors)
    return embeddings.index_to_word[np.argmax(similarities)]

NLP: Named Entity Recognition (NER) & Sequence Labeling¶

Q: Что такое NER и как оценивать?¶

A:

Named Entity Recognition — задача извлечения именованных сущностей (Person, Organization, Location, Date, etc.) из текста.

Формат: BIO tagging (Begin-Inside-Outside) - B-PER, I-PER — person name - B-ORG, I-ORG — organization - O — not an entity

Метрики:

Token-level: - Precision, Recall, F1 для каждого класса - Micro vs Macro averaging

Entity-level (строже): - Exact match: границы и тип должны совпасть - Partial match: overlap > threshold

CoNLL-2003 standard: Entity-level F1.

Q: CRF vs BiLSTM vs BERT для NER?¶

A:

CRF	BiLSTM-CRF	BERT
Hand-crafted features	Learned features	Contextualized embeddings
No deep learning	Sequence model	Pretrained transformer
Fast inference	Medium	Slow (but fine-tuning helps)
Works on small data	Needs more data	Transfer learning

BiLSTM-CRF: - BiLSTM: contextual representations - CRF layer: learns transition constraints (I-PER after B-PER, not I-ORG)

BERT for NER: - Fine-tune BERT + linear classifier - Subword tokenization → use first subword for entity - SOTA on CoNLL-2003 (93+ F1)

Q: Как обрабатывать nested entities?¶

A:

Problem: "University of California" — ORG, но "California" внутри — LOC.

Approaches: 1. Flat NER: Игнорировать вложенность (стандартный подход) 2. Layered NER: Два прохода — сначала outer, потом inner entities 3. Hypergraph decoding: Joint prediction всех уровней 4. Seq2seq: Generate entity spans with markers

Практика: Большинство систем — flat NER, nested — отдельная post-processing или специализированные модели.

Q: POS Tagging — основные подходы?¶

A:

Part-of-Speech Tagging — присвоение грамматических категорий (NOUN, VERB, ADJ, etc.) словам.

Approaches:

HMM (Hidden Markov Model): $$P(t_1^n, w_1^n) = \prod_{i=1}^{n} P(w_i | t_i) P(t_i | t_{i-1})$$
Emission: $P(w|t)$ — word given tag
Transition: $P(t_i | t_{i-1})$ — tag bigram
Viterbi decoding
CRF (Conditional Random Field): $$P(t|w) = \frac{1}{Z(w)} \exp(\sum_i \theta \cdot f(t_{i-1}, t_i, w, i))$$
Features: word, suffix, prefix, neighboring tags
Global normalization
BiLSTM / BiLSTM-CRF:
Learned features, no manual feature engineering
BERT fine-tuning:
Contextual representations
+97% accuracy на Penn Treebank

Practical: BERT для high accuracy, HMM/CRF для скорости и interpretability.

Hyperparameter Optimization¶

Q: Parameters vs Hyperparameters — разница?¶

A:

Parameters	Hyperparameters
Learned from data during training	Set before training
Internal to model (weights, biases)	Control learning process
Optimized by optimizer (SGD, Adam)	Set by practitioner or search
Examples: weights in NN, coefficients in regression	Examples: learning rate, batch size, num layers

Hyperparameters determine HOW model learns, parameters are WHAT model learns.

Q: Grid Search vs Random Search?¶

A:

Grid Search: Exhaustive search over all combinations.

from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['rbf', 'linear'],
    'gamma': ['scale', 'auto']
}
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

Random Search: Sample random combinations.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform

param_dist = {
    'C': loguniform(1e-3, 1e3),
    'kernel': ['rbf', 'linear'],
    'gamma': ['scale', 'auto']
}
random_search = RandomizedSearchCV(SVC(), param_dist, n_iter=50, cv=5)

When Random > Grid: - Most hyperparameters don't matter much (only a few are important) - Random search explores more values for important params - Paper: "Random Search for Hyper-Parameter Optimization" (Bergstra & Bengio, 2012)

Q: Что такое Bayesian Optimization?¶

A:

Idea: Build probabilistic model of objective function, use it to guide search.

Components: 1. Surrogate model: Gaussian Process (GP) approximates f(x) 2. Acquisition function: Decides where to sample next (balance exploration vs exploitation)

Acquisition functions: - Expected Improvement (EI): $EI(x) = \mathbb{E}[\max(f(x) - f^*, 0)]$ - Upper Confidence Bound (UCB): $UCB(x) = \mu(x) + \beta \sigma(x)$ - Probability of Improvement (PI): $PI(x) = P(f(x) > f^*)$

Optuna example:

import optuna

def objective(trial):
    lr = trial.suggest_float('lr', 1e-5, 1e-1, log=True)
    batch_size = trial.suggest_categorical('batch_size', [32, 64, 128])
    layers = trial.suggest_int('layers', 1, 5)

    model = build_model(lr, batch_size, layers)
    score = train_and_evaluate(model)
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

When Bayesian > Random: - Expensive evaluations (training takes hours) - Low-dimensional search space (<20 params) - Smooth objective function

Q: Optuna vs Ray Tune — когда что?¶

A:

Aspect	Optuna	Ray Tune
Focus	Single-node optimization	Distributed at scale
Sampling	TPE, CMA-ES, GP	Same + population-based
Distributed	Via RDB/Redis	Native Ray cluster
Early stopping	Pruning (Median, Async)	PBT, ASHA, Hyperband
Integration	Sklearn, PyTorch, TF	PyTorch, TF, XGBoost
Ease of use	Simpler API	More complex

Use Optuna when: - Single machine or small cluster - Need sophisticated sampling (TPE, CMA-ES) - Simpler setup

Use Ray Tune when: - Large-scale distributed training - Population-Based Training (PBT) - Already using Ray ecosystem

Q: Что такое Early Stopping в HPO?¶

A:

Problem: Many trials are bad — stop them early to save compute.

Approaches:

1. Median Pruning (Optuna):

pruner = optuna.pruners.MedianPruner(n_startup_trials=5, n_warmup_steps=10)
study = optuna.create_study(pruner=pruner)

Mechanism: At step k, if trial's intermediate value < median of previous trials, prune.

2. ASHA (Async Successive Halving): - Run many trials with minimal resources - Promote top performers to more resources - Early stop underperformers

3. Hyperband: - Multiple brackets of ASHA with different resource allocations - Better theoretical guarantees

Q: Как приоритизировать гиперпараметры для тюнинга?¶

A:

Priority order (higher = tune first):

Learning rate — biggest impact on convergence
Batch size — affects generalization and speed
Optimizer — Adam vs SGD with momentum
Architecture — layers, units per layer
Regularization — dropout, weight decay
Data augmentation — for vision

Coarse-to-fine strategy:

# Stage 1: Coarse search
lr_range = [1e-4, 1e-3, 1e-2, 1e-1]  # Log scale

# Stage 2: Fine search around best
best_lr = 1e-3
lr_range = [5e-4, 1e-3, 2e-3, 5e-3]

Q: Nested Cross-Validation для HPO — зачем?¶

A:

Problem: Using same CV split for HPO and evaluation → overfitting to validation set.

Solution: Nested CV — inner loop for HPO, outer loop for evaluation.

Outer CV (k=5 folds):
  For each fold:
    Inner CV (k=5 folds):
      GridSearchCV on training portion
    Evaluate best model on outer test fold

from sklearn.model_selection import cross_val_score, GridSearchCV

# Inner: HPO
inner_cv = KFold(n_splits=5)
clf = GridSearchCV(estimator=model, param_grid=param_grid, cv=inner_cv)

# Outer: Evaluation
outer_cv = KFold(n_splits=5)
nested_score = cross_val_score(clf, X, y, cv=outer_cv)

Trade-off: 5×5 = 25 model fits per HPO candidate → expensive but unbiased.

Q: Sensitivity Analysis для гиперпараметров?¶

A:

Goal: Understand which hyperparameters matter most.

Methods:

1. One-at-a-time (OAT): Vary one param, fix others. - Simple but misses interactions

2. Morris Method: Measure elementary effects. $$EE_i = \frac{f(x_1, ..., x_i + \Delta, ..., x_k) - f(x)}{\Delta}$$

3. Sobol Indices: Variance-based decomposition. - $S_i$ = first-order (main effect) - $S_{Ti}$ = total effect (including interactions)

4. fANOVA (for Optuna):

import optuna
from optuna.importance import FanovaImportanceEvaluator

study = optuna.create_study()
study.optimize(objective, n_trials=100)

importance = optuna.importance.get_param_importances(
    study, evaluator=FanovaImportanceEvaluator()
)

Output: {'lr': 0.45, 'batch_size': 0.30, 'layers': 0.15, 'dropout': 0.10}

Q: Multi-objective HPO — как балансировать accuracy и latency?¶

A:

Problem: Maximize accuracy, minimize latency — conflicting objectives.

Approaches:

1. Scalarization: $$L = \alpha \cdot (1 - accuracy) + (1 - \alpha) \cdot \frac{latency}{max\_latency}$$

2. Pareto Front: Find set of solutions where no objective can improve without worsening another.

Optuna multi-objective:

def objective(trial):
    lr = trial.suggest_float('lr', 1e-4, 1e-1, log=True)

    accuracy, latency = train_and_profile(model)

    return accuracy, latency  # Maximize both

study = optuna.create_study(directions=['maximize', 'minimize'])
study.optimize(objective, n_trials=100)

# Get Pareto front
pareto_trials = study.best_trials

Decision: Choose from Pareto front based on business constraints.

Active Learning¶

Q: Что такое Active Learning?¶

A:

Definition: ML paradigm where algorithm strategically selects most informative samples for labeling, reducing annotation cost.

Key insight: Not all samples equally valuable — some provide more information than others.

Active learning loop: 1. Start with small labeled set $L$, large unlabeled pool $U$ 2. Train model on $L$ 3. Query oracle for labels of most informative samples from $U$ 4. Add newly labeled samples to $L$ 5. Repeat until budget exhausted or target accuracy reached

Goal: Achieve target accuracy with minimum labeling cost.

Q: Query strategies — Uncertainty Sampling?¶

A:

Core idea: Query samples where model is most uncertain.

Metrics:

1. Least Confidence: $$x^* = \arg\max_x (1 - P(\hat{y}|x))$$

Query samples with lowest max probability.

def least_confidence(probas):
    # probas: (n_samples, n_classes)
    max_proba = probas.max(axis=1)
    return np.argmax(1 - max_proba)  # Most uncertain

2. Margin Sampling: $$x^* = \arg\min_x (P(\hat{y}_1|x) - P(\hat{y}_2|x))$$

Query samples where top two classes are closest.

def margin_sampling(probas):
    # Sort probabilities
    sorted_probas = np.sort(probas, axis=1)[:, ::-1]
    margins = sorted_probas[:, 0] - sorted_probas[:, 1]
    return np.argmin(margins)  # Smallest margin

3. Entropy: $$x^* = \arg\max_x \left(-\sum_c P(y_c|x) \log P(y_c|x)\right)$$

Query samples with highest prediction entropy.

def entropy_sampling(probas):
    # Entropy: -sum(p * log(p))
    eps = 1e-10
    entropy = -np.sum(probas * np.log(probas + eps), axis=1)
    return np.argmax(entropy)  # Highest entropy

Comparison: | Strategy | Best for | Limitation | |----------|----------|------------| | Least Confidence | Binary classification | Ignores class distribution | | Margin | Multi-class | Only considers top 2 | | Entropy | Multi-class | Computationally heavier |

Q: Query-by-Committee (QBC)?¶

A:

Idea: Train multiple models (committee), query samples with highest disagreement.

Disagreement measures:

1. Vote Entropy: $$x^* = \arg\max_x \left(-\sum_c \frac{V_c}{C} \log \frac{V_c}{C}\right)$$

Where $V_c$ = votes for class $c$, $C$ = committee size.

2. Kullback-Leibler Divergence: $$x^* = \arg\max_x \frac{1}{C} \sum_{c=1}^{C} D_{KL}(P(y|x;\theta_c) \| P(y|x))$$

Implementation:

class QueryByCommittee:
    def __init__(self, n_models=5):
        self.models = [create_model() for _ in range(n_models)]

    def fit(self, X, y):
        for model in self.models:
            # Bootstrap sample for diversity
            idx = np.random.choice(len(X), len(X), replace=True)
            model.fit(X[idx], y[idx])

    def query(self, X_pool, n_samples=1):
        # Collect predictions
        predictions = np.array([
            model.predict_proba(X_pool) for model in self.models
        ])  # (n_models, n_samples, n_classes)

        # Vote entropy
        votes = np.argmax(predictions, axis=2)  # (n_models, n_samples)
        vote_counts = np.apply_along_axis(
            lambda x: np.bincount(x, minlength=predictions.shape[2]),
            axis=0, arr=votes
        )  # (n_classes, n_samples)
        vote_probas = vote_counts / len(self.models)
        entropy = -np.sum(vote_probas * np.log(vote_probas + 1e-10), axis=0)

        return np.argsort(entropy)[-n_samples:]

Q: Expected Model Change?¶

A:

Idea: Query samples that would cause largest change in model if labeled.

Expected Gradient Length (EGL): $$x^* = \arg\max_x \mathbb{E}_{y \sim P(y|x)} \|\nabla L(x, y)\|$$

Intuition: If gradient would be large regardless of label, sample is informative.

def expected_gradient_length(model, x, possible_labels):
    total_grad_norm = 0

    for y in possible_labels:
        # Compute loss gradient for this label
        loss = compute_loss(model, x, y)
        grads = torch.autograd.grad(loss, model.parameters())
        grad_norm = torch.sqrt(sum(g.pow(2).sum() for g in grads))

        # Weight by probability of this label
        prob = model.predict_proba(x)[y]
        total_grad_norm += prob * grad_norm

    return total_grad_norm

Pros: Theoretically motivated, considers impact on model Cons: Computationally expensive (requires gradients for each candidate)

Q: Diversity-based sampling?¶

A:

Problem: Uncertainty sampling may select redundant samples.

Solution: Balance uncertainty with diversity.

Core-set selection: $$\min_{S \subseteq U} \max_{x \in U} \min_{s \in S} d(x, s)$$

Find subset $S$ that covers unlabeled pool well.

Coreset via k-Center:

def k_center_selection(X_pool, n_samples, already_labeled=None):
    """Greedy k-center for diverse selection."""
    selected = []

    if already_labeled is not None:
        # Start with distances to already labeled
        dist_matrix = cdist(X_pool, already_labeled)
        min_distances = dist_matrix.min(axis=1)
    else:
        min_distances = np.full(len(X_pool), np.inf)
        # Start with random point
        selected.append(np.random.randint(len(X_pool)))
        min_distances[selected[0]] = 0

    while len(selected) < n_samples:
        # Find point furthest from any selected
        next_idx = np.argmax(min_distances)
        selected.append(next_idx)

        # Update distances
        new_dists = cdist(X_pool, X_pool[selected[-1:]])
        min_distances = np.minimum(min_distances, new_dists.flatten())

    return selected

BADGE (Batch Active Learning by Diverse Gradient Embeddings): - Combine uncertainty + diversity - Embed samples using gradient embeddings - k-means++ selection in embedding space

Q: Когда Active Learning НЕ эффективен?¶

A:

Failure cases:

Very small initial labeled set:
Model too weak to identify informative samples
Random sampling may be better initially
Highly imbalanced data:
May oversample minority class unnecessarily
Or ignore rare but important samples
Clustered data structure:
May miss entire clusters if initial samples don't cover them
Solution: Combine with diversity sampling
Noisy labels:
Querying uncertain samples may amplify noise
Solution: Label smoothing, robust loss
Budget too small:
Active learning overhead > benefit
Random sampling competitive for <100 samples

Rule of thumb: Active learning shines when: - Labeling cost >> computation cost - 100+ queries budget - Model has reasonable base accuracy (>50%)

Q: Active Learning в production — best practices?¶

A:

Implementation checklist:

Start with diversity:

# First batch: stratified or k-center
if len(labeled) < initial_size:
    return diversity_sampling(X_unlabeled, batch_size)
else:
    return uncertainty_sampling(model, X_unlabeled, batch_size)

Combine strategies:

# 70% uncertainty + 30% diversity
n_uncertain = int(0.7 * batch_size)
n_diverse = batch_size - n_uncertain

uncertain = uncertainty_sampling(model, X_pool, n_uncertain)
remaining_pool = np.setdiff1d(np.arange(len(X_pool)), uncertain)
diverse = diversity_sampling(X_pool[remaining_pool], n_diverse)

return np.concatenate([uncertain, diverse])

Cold start handling:
First 50-100 samples: random or stratified
After model shows promise: switch to active learning
Human-in-the-loop:
Show model confidence to annotator
Allow annotator to flag "don't know" or "bad sample"
Track annotator agreement
Stopping criteria:
Model accuracy plateaus
Budget exhausted
Remaining samples all low uncertainty

Tools: - Modal: Active learning platform - Label Studio: Annotation with active learning plugin - SuperAnnotate: Computer vision active learning - Prodigy: NLP active learning

Time Series: Deep Learning Methods¶

Q: DeepAR — как работает?¶

A:

Architecture: Autoregressive RNN с probabilistic output.

Key features: 1. Global model: Learns from multiple related time series 2. Autoregressive: Uses past values as input 3. Probabilistic: Outputs distribution (Gaussian with mean + std) 4. Covariates: Can include time-dependent and static features

Training: $$p(y_{t:T} | y_{1:t}, x_{1:T}) = \prod_{t'=t}^{T} p(y_{t'} | y_{1:t'-1}, x_{1:T}, \theta)$$

Inference: Sample from predicted distribution → prediction intervals.

# DeepAR prediction (conceptual)
def predict_deepar(model, context, num_samples=100):
    samples = []
    for _ in range(num_samples):
        # Sample from predicted distribution at each step
        pred_dist = model(context)  # Gaussian(mean, std)
        sample = pred_dist.sample()
        samples.append(sample)
    return {
        'mean': np.mean(samples, axis=0),
        'std': np.std(samples, axis=0),
        'quantiles': np.quantile(samples, [0.1, 0.5, 0.9], axis=0)
    }

Advantages over ARIMA: - Handles multiple related series (learns globally) - Works with covariates - Produces probabilistic forecasts - Can handle cold start with item features

Q: Temporal Fusion Transformer (TFT)?¶

A:

Architecture: 1. Variable Selection Network: Learns which features are important 2. Static Covariate Encoder: Processes time-invariant features 3. Gated Residual Network (GRN): Non-linear processing with skip connections 4. Multi-head Attention: Learns temporal dependencies + interpretability 5. Quantile Regression: Predicts multiple quantiles for intervals

Three input types: - Static: Product category, store location - Known future: Holidays, promotions (available at prediction time) - Historical: Past sales, weather (only available from past)

Key innovation — Interpretability: - Variable importance: Which features matter - Attention weights: Which past time steps matter - Seasonal patterns: Via attention visualization

# TFT attention interpretation
attention_weights = model.get_attention_weights(x)  # (batch, heads, seq_len)
# Identify which past steps influence predictions
important_steps = attention_weights.mean(dim=(0, 1)).argsort(descending=True)[:5]

When to use TFT: - Multiple known future covariates - Need interpretability - Complex temporal patterns - Long-range dependencies

Q: Prophet vs ARIMA vs Deep Learning?¶

A:

Method	Strengths	Weaknesses	Best For
ARIMA	Interpretable, well-understood	Single series, manual tuning	Clean univariate
Prophet	Multiple seasonalities, holidays	Less accurate, no covariates	Business forecasting
DeepAR	Global learning, covariates	Needs many series	Related series
TFT	Interpretability, all covariates	Complex, needs data	Complex systems
N-BEATS	Pure DL, no features	Black box	Pure DL forecasting

Prophet model: $$y(t) = g(t) + s(t) + h(t) + \varepsilon_t$$

Where: - $g(t)$ = trend (piecewise linear or logistic) - $s(t)$ = seasonality (Fourier series) - $h(t)$ = holiday effects

from prophet import Prophet

model = Prophet(
    yearly_seasonality=True,
    weekly_seasonality=True,
    daily_seasonality=False
)
model.add_country_holidays(country_name='US')
model.fit(df)  # df with 'ds' (date) and 'y' (value) columns
forecast = model.predict(future_df)

Q: Time Series Cross-Validation?¶

A:

Critical: Never use random split — temporal order must be preserved!

Rolling origin (expanding window):

Fold 1: Train [0:100],  Test [100:120]
Fold 2: Train [0:120],  Test [120:140]
Fold 3: Train [0:140],  Test [140:160]

Sliding window:

Fold 1: Train [0:100],  Test [100:120]
Fold 2: Train [20:120], Test [120:140]
Fold 3: Train [40:140], Test [140:160]

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    # Train and evaluate

Metrics: - MAPE: Mean Absolute Percentage Error = $\frac{100\%}{n}\sum|\frac{y_i - \hat{y}_i}{y_i}|$ - MASE: Mean Absolute Scaled Error = $\frac{MAE}{MAE_{naive}}$ - RMSE: Root Mean Squared Error - WMAPE: Weighted MAPE = $\frac{\sum|y_i - \hat{y}_i|}{\sum|y_i|}$

Q: N-BEATS architecture?¶

A:

Key idea: Stack of fully-connected blocks with forward and backward residuals.

Architecture: 1. Each block has two outputs: - Forward: Forecast (contribution to final prediction) - Backward: Backcast (explains input, removed for next block)

Two configurations:
Generic: Learns any pattern
Interpretable: Separate trend + seasonality blocks

Formula: $$\hat{y} = \sum_{b=1}^{B} \hat{y}_b, \quad x_{b+1} = x_b - \hat{x}_b$$

Where $\hat{y}_b$ = forecast from block $b$, $\hat{x}_b$ = backcast from block $b$.

Advantages: - Pure deep learning (no feature engineering) - Interpretable mode separates trend/seasonality - Competitive with M4 competition winner (Smyl's ES-RNN); outperformed other neural methods on M4 benchmarks

Explainable AI (XAI): SHAP & LIME¶

Q: Зачем нужен XAI в production?¶

A: 4 ключевые причины:

Regulatory Compliance: EU AI Act (2024), GDPR right to explanation — high-risk AI системы обязаны объяснять решения
Trust Building: 78% enterprise AI rejected due to lack of interpretability (2025)
Debugging: XAI помогает найти почему модель ошибается
Bias Detection: Выявление unfair patterns в predictions

Q: SHAP vs LIME — в чём разница?¶

A:

Критерий	SHAP	LIME
Теория	Game theory (Shapley values)	Local surrogate models
Гарантии	Consistency, Additivity, Efficiency	Local fidelity only
Скорость	TreeSHAP: ~65ms, KernelSHAP: ~450ms	~85ms (tabular)
Stability	95%	82%
Memory	TreeSHAP: 78MB, KernelSHAP: 680MB	92MB
Model-specific	TreeSHAP, DeepSHAP, LinearSHAP	Model-agnostic

SHAP formula: $$\phi_i(f, x) = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!(|N|-|S|-1)!}{|N|!} [f(S \cup \{i\}) - f(S)]$$

LIME formula: $$\xi(x) = \arg\min_{g \in G} \mathcal{L}(f, g, \pi_x) + \Omega(g)$$

Где $\pi_x(z) = \exp(-D(x, z)^2 / \sigma^2)$ — kernel weighting.

Q: Когда использовать SHAP, а когда LIME?¶

A:

Выбирай SHAP когда: - Tree-based модели (Random Forest, XGBoost, LightGBM) — TreeSHAP exact + fast - Нужны global explanations (summary plots, dependence plots) - Regulated industries (finance, healthcare) — theoretical rigor важен - Comparing feature importance across instances

Выбирай LIME когда: - Novel architectures без SHAP implementation - Нужно quick explanation для single prediction - Stakeholders non-technical — local linear понятнее - Ограниченные compute resources

Best practice: Hybrid approach — SHAP для production monitoring, LIME для ad-hoc investigations.

Q: Как SHAP обеспечивает consistency?¶

A: 4 математических свойства (axioms):

Efficiency: $\sum_{i=1}^{M} \phi_i = f(x) - E[f(X)]$ — сумма SHAP values = deviation from baseline
Symmetry: Если features вносят одинаковый вклад во все coalitions → равные SHAP values
Dummy: Features которые не влияют на prediction → SHAP = 0
Additivity: Для ensemble: SHAP_total = SHAP_model1 + SHAP_model2

Эти гарантии делают SHAP единственным методом удовлетворяющим всем desiderata одновременно.

Q: LIME unstability — как решить?¶

A: Problem: Small input changes → significantly different explanations (18% variance).

Solutions:

Multiple runs + average:

explanations = []
for seed in range(5):
    exp = lime_explainer.explain_instance(x, predict_fn, random_state=seed)
    explanations.append(exp.as_map())
stable_explanation = average_explanations(explanations)

Increase num_samples: Default 5000, increase to 15000+ for stability
Cross-validation on explanations: Run LIME multiple times, check variance
Use SHAP instead: 95% stability vs 82% for LIME

Q: Как интерпретировать SHAP values?¶

A:

For single prediction: - $\phi_i > 0$ → feature i pushes prediction UP - $\phi_i < 0$ → feature i pushes prediction DOWN - $|\phi_i|$ = magnitude of contribution

Example (Credit Approval):

Income:     +0.35  (pushes toward approval)
CreditScore: +0.30  (pushes toward approval)
Debt:       -0.22  (pushes toward rejection)
Age:        +0.08  (small positive)
Base value: 0.50 (average approval rate)
Final:      0.50 + 0.35 + 0.30 - 0.22 + 0.08 = 1.01 → APPROVED

Global interpretation: - Summary plot: Feature importance ranking across all instances - Dependence plot: How feature value affects SHAP value - Interaction plot: Feature interactions

Q: SHAP для deep learning — какие подходы?¶

A:

DeepSHAP: Combines SHAP with DeepLIFT backpropagation
Fast for neural networks
Uses gradient * input decomposition
GradientSHAP: Integrates gradients with SHAP
Works for any differentiable model
More expensive but theoretically sound
PartitionSHAP: For hierarchical models (Transformers)
Handles attention layers properly

import shap

# DeepSHAP for PyTorch
explainer = shap.DeepExplainer(model, background_data)
shap_values = explainer.shap_values(test_data)

# GradientSHAP
explainer = shap.GradientExplainer(model, background_data)
shap_values = explainer.shap_values(test_data)

Q: Production XAI pipeline — как построить?¶

A:

Architecture:

Client Request → API Gateway → XAI Engine → Explanation Cache → Response
                                    ↓
                            Model Registry
                                    ↓
                            Monitoring Service

Key components:

Precomputation: Cache common explanations during training
Adaptive sampling: Early stopping when explanation stabilizes
Redis cache: Store precomputed SHAP values
Fallback: LIME for cold-start, SHAP for cached

Performance optimizations: - Caching: Reduce latency from 2.1s → 120ms - Batch explanations: Compute SHAP for multiple instances together - TreeSHAP for tree models: 10x faster than KernelSHAP

Q: Common failure modes XAI?¶

A:

Correlated features: SHAP/LIME underestimate importance when features are highly correlated
Solution: Group correlated features, use conditional expectations
Out-of-distribution: Explanations unreliable for OOD samples
Error can exceed 40% for far-from-training instances
Solution: Flag OOD samples, don't trust explanations blindly
Feature interactions: Linear explanations miss non-linear interactions
Solution: Use SHAP interaction values (expensive: O(n²))
Baseline dependency: Results sensitive to background dataset
Solution: Use representative background, document choice

Neural Architecture Search (NAS)¶

Q: Что такое NAS и зачем он нужен?¶

A: NAS — автоматический поиск оптимальной архитектуры нейросети.

3 компонента: 1. Search Space: Какие операции/связи допустимы (conv, pooling, attention) 2. Search Strategy: Как исследовать пространство (RL, EA, gradient-based) 3. Performance Estimation: Как быстро оценить candidate (proxy tasks, weight sharing)

Зачем: - Ручной дизайн требует 120,000+ GPU hours/month (Tesla, 2025) - NAS находит architectures которые люди не придумают - Hardware-aware NAS оптимизирует под конкретное устройство

Q: Какие search strategies в NAS?¶

A:

Strategy	Как работает	Pros	Cons
RL	RNN controller генерирует architectures, reward = accuracy	Осваивает сложные spaces	1800 GPU-days (NASNet)
Evolutionary	Population, mutation, crossover, selection	Простой, parallelizable	Expensive evaluation
DARTS	Continuous relaxation, gradient descent on architecture params	1 GPU-day	Discretization gap
Bayesian	Gaussian process models performance	Sample-efficient	Struggles with high-dim
Random	Uniform sampling	Baseline, simple	Slow for large spaces
One-Shot	Train supernet once, sample subnets	Fast evaluation	Weight sharing bias

DARTS key insight: $$\bar{o}(x) = \sum_{o \in \mathcal{O}} \frac{\exp(\alpha_o)}{\sum_{o'} \exp(\alpha_{o'})} o(x)$$

Где $\alpha$ — learnable architecture parameters. После training: argmax $\alpha$ → discrete architecture.

Q: Что такое cell-based search space?¶

A: Instead of searching whole network, search small reusable cell.

NASNet cells: - Normal cell: Same spatial resolution - Reduction cell: Halves resolution (stride-2)

Cell = DAG: - Nodes = operations (3x3 conv, 5x5 conv, pooling) - Edges = connections - Stacked N times → full network

Advantages: - Transferable (CIFAR → ImageNet) - Smaller search space - Faster search

Limitations: - Low variance among found architectures - Constrained expressiveness

Q: Hardware-Aware NAS — как работает?¶

A: Incorporate hardware constraints into search.

Metrics added to objective: - Latency: $\mathcal{L} = \text{Accuracy} - \lambda \cdot \text{Latency}$ - Memory: Peak memory usage - Energy: FLOPs × power per operation

Approaches:

ProxylessNAS: Learn to prune paths, measure on target device
MnasNet: Multi-objective optimization (accuracy + latency)
Once-for-All: Train supernet, specialize for different devices

Example (Mobile optimization):

# Loss with latency constraint
loss = ce_loss + lambda * (latency / target_latency - 1)^2

Q: One-Shot NAS — в чём идея?¶

A: Train one supernet containing all architectures, evaluate by sampling.

Once-for-All Network (OFA): 1. Train supernet supporting all configurations 2. At inference: sample subnet with desired constraints 3. No retraining needed

Weight sharing benefits: - 10,000x faster than training from scratch - Single training → multiple deployment targets

Challenge: Weight sharing bias — shared weights may not reflect standalone performance.

Solutions: - Progressive shrinking (OFA): Train large, gradually add smaller configs - Sandwich rule: Train min, max, random each step

Q: Когда NAS НЕ стоит использовать?¶

A:

Не используйте NAS когда: 1. Small scale (<7B params): Overhead не окупается 2. Single-domain task: Нет benefit от specialization 3. Latency-critical: Search overhead too high 4. Limited compute: Search может занять недели 5. Strong baseline exists: ResNet/EfficientNet достаточно

Rule of thumb: NAS оправдан когда: - Unique hardware constraints (edge, mobile) - Novel task без established architectures - Budget ≥ 100 GPU-days for search - Expected significant efficiency gains

Q: EfficientNet — как NAS помог?¶

A: EfficientNet = NAS + Compound Scaling.

Шаг 1: NAS (baseline) - Found EfficientNet-B0 via MnasNet - Optimized for accuracy + latency

Шаг 2: Compound Scaling - Scale all dimensions together: $$\text{depth} = \alpha \cdot \phi$$

\[\text{width} = \beta \cdot \phi\]

$$\text{resolution} = \gamma \cdot \phi$$

Constraint: $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$
$\phi$ = user-specified coefficient (B0→B7)

Result: 8.4x smaller + 6.1x faster than GPipe while achieving similar accuracy.

Cost-Sensitive Learning¶

Источники: CodeGenes: Cost-Sensitive Learning in PyTorch (2025), LinkedIn: Class Weights & Cost-Sensitive Learning (2025), Elkan (2001)

Q: Что такое Cost-Sensitive Learning?¶

A:

Definition: ML technique where different misclassification errors have different costs.

Example: Medical diagnosis - FN (sick → healthy): Missing cancer = very costly - FP (healthy → sick): Unnecessary tests = less costly

Cost matrix: $$C = \begin{bmatrix} 0 & c_{01} \\ c_{10} & 0 \end{bmatrix}$$

Where $C_{ij}$ = cost of predicting class $j$ when true class is $i$.

Q: Как реализовать cost-sensitive learning в PyTorch?¶

A:

Method 1: Weighted Cross-Entropy

import torch.nn as nn

# Define class weights (higher for minority/costly class)
class_weights = torch.tensor([1.0, 10.0])  # [class 0, class 1]

criterion = nn.CrossEntropyLoss(weight=class_weights)
loss = criterion(predictions, targets)

Method 2: Custom Cost-Sensitive Loss

def cost_sensitive_loss(predictions, targets, cost_matrix):
    """
    predictions: (batch, n_classes) logits
    targets: (batch,) class indices
    cost_matrix: (n_classes, n_classes)
    """
    n_classes = cost_matrix.shape[0]
    one_hot = torch.eye(n_classes)[targets]  # (batch, n_classes)

    # Get cost for each sample's true class
    costs = one_hot @ cost_matrix  # (batch, n_classes)

    # Weighted log-likelihood
    log_probs = F.log_softmax(predictions, dim=1)
    loss = -torch.sum(costs * log_probs, dim=1).mean()

    return loss

# Example: FN costs 10x more than FP
cost_matrix = torch.tensor([
    [0, 1],    # True 0: cost of predicting 0, 1
    [10, 0]    # True 1: cost of predicting 0, 1
], dtype=torch.float32)

Method 3: Sample-wise weights

# Different weight for each sample
sample_weights = torch.tensor([1, 5, 1, 10, ...])

# Compute per-sample loss
losses = F.cross_entropy(predictions, targets, reduction='none')
weighted_loss = (losses * sample_weights).mean()

Q: Когда использовать cost-sensitive learning?¶

A:

Scenario	Approach	Cost Matrix Example
Medical diagnosis	High FN cost	FN=10, FP=1
Fraud detection	High FN cost	FN=100, FP=1
Spam filter	High FP cost	FN=1, FP=10
Loan approval	Asymmetric	Default=50, Rejection=1

Rule of thumb: - Set cost ratio = inverse of acceptable error ratio - If FN is 10x worse than FP → weight(class_1) = 10 * weight(class_0)

Q: Cost-sensitive vs class imbalance — в чём разница?¶

A:

Aspect	Class Imbalance	Cost-Sensitive
Focus	Sample frequency	Error cost
Solution	Resampling, class weights	Cost matrix, threshold adjustment
When to use	Minority class underrepresented	Errors have different costs

They're related but not identical: - Class imbalance: 99% negative, 1% positive - Cost-sensitive: Missing a positive costs 100x more

Combined approach:

# Weighted loss for imbalanced + cost-sensitive
class_weights = compute_class_weight('balanced', classes, y_train)
# Adjust for costs
class_weights[1] *= 10  # Further upweight positive class

criterion = nn.CrossEntropyLoss(weight=torch.tensor(class_weights))

Q: Как оценить cost-sensitive model?¶

A:

1. Cost-weighted accuracy:

def cost_weighted_accuracy(y_true, y_pred, cost_matrix):
    total_cost = 0
    for t, p in zip(y_true, y_pred):
        total_cost += cost_matrix[t, p]
    return total_cost / len(y_true)

2. Expected cost: $$\text{Expected Cost} = \sum_{i,j} C_{ij} \cdot P(\text{predict } j | \text{true } i) \cdot P(\text{true } i)$$

3. Cost curves: Plot cost vs threshold for different operating points

4. Business metrics: Connect to actual business KPIs - Fraud: $ caught vs $ lost - Medical: Lives saved vs unnecessary procedures

16. Missing Data Handling¶

Basic¶

Q: Какие типы missing data существуют?

A: Rubin's Classification (1976):

Type Full Name Definition Example Strategy

MCAR Missing Completely At Random P(missing) independent of all variables Data entry error, random sensor failure Deletion OK

MAR Missing At Random P(missing) depends on observed data Men less likely to report depression Imputation OK

MNAR Missing Not At Random P(missing) depends on missing value itself High earners don't report salary Model missingness

Important: MCAR is the only case where deletion is unbiased. MAR/MNAR require imputation.

Q: Когда drop vs impute missing values?

A:

Drop (listwise deletion) when: - MCAR mechanism confirmed - < 5% missing per column - Large dataset, small impact

Impute when: - MAR or MNAR mechanism - > 5% missing per column - Small dataset - Missingness is informative

Code check:
# Check if missingness is related to target
import missingno as msno
msno.matrix(df)  # Visualize patterns
msno.heatmap(df)  # Correlations in missingness

Medium¶

Q: Какие методы imputation существуют?

A:

Method Description Best For Bias Risk

Mean/Median Replace with central tendency Numerical, MCAR Underestimates variance

Mode Most frequent value Categorical Same as mean

Forward/Backward fill Use adjacent values Time series Temporal leakage

KNN Imputer k-nearest neighbors Numerical patterns Computationally expensive

MICE Multiple Imputation by Chained Equations Any Gold standard for MAR

Iterative Model-based (Bayesian Ridge) Complex patterns Assumes MAR

MICE implementation:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=10, random_state=42)
X_imputed = imputer.fit_transform(X)

Q: Что такое Multiple Imputation и зачем нужна?

A: Single imputation problem: Imputed values are treated as certain → underestimates variance.

Multiple Imputation (MI) solution: 1. Create m datasets with different imputed values 2. Analyze each dataset separately 3. Pool results using Rubin's Rules

Rubin's Rules for pooling: $$\bar{Q} = \frac{1}{m}\sum_{i=1}^{m} \hat{Q}_i$$

(pooled estimate)

\[\bar{U} = \frac{1}{m}\sum_{i=1}^{m} U_i\]

(within-imputation variance)

\[B = \frac{1}{m-1}\sum_{i=1}^{m}(\hat{Q}_i - \bar{Q})^2\]

(between-imputation variance)

\[T = \bar{U} + (1 + \frac{1}{m})B\]

(total variance)

When to use: MAR mechanism, research/analysis context, need valid confidence intervals.

Q: Как обрабатывать missing values в categorical features?

A:

Strategies: 1. New category: "Unknown" or "Missing" — simplest, preserves missingness info 2. Mode imputation: Most frequent — can distort distribution 3. Model-based: Predict category from other features 4. Weight of Evidence (WoE): For binary classification, encode as WoE value
# Strategy 1: New category
df['category'].fillna('Missing', inplace=True)

# Strategy 3: Model-based (using other features)
from sklearn.ensemble import RandomForestClassifier
mask = df['category'].isna()
if mask.sum() > 0:
    clf = RandomForestClassifier()
    clf.fit(df.loc[~mask, other_features], df.loc[~mask, 'category'])
    df.loc[mask, 'category'] = clf.predict(df.loc[mask, other_features])

Killer¶

Q: Спроектируйте missing data strategy для fraud detection pipeline.

A:

Analysis Phase:

# 1. Diagnose missingness mechanism
def diagnose_missingness(df, target_col):
    """Check if missingness predicts target"""
    df['is_missing'] = df[target_col].isna().astype(int)
    from scipy.stats import chi2_contingency
    for col in df.select_dtypes(include='object').columns:
        contingency = pd.crosstab(df['is_missing'], df[col])
        chi2, p, _, _ = chi2_contingency(contingency)
        if p < 0.05:
            print(f"Missingness of {target_col} related to {col}: p={p:.4f}")

Pipeline Architecture:

Raw Data → Missing Flag Creation → Imputation Model → Feature Engineering → Model
     ↓              ↓                      ↓
  [is_X_missing=1]  [Predicted value]  [Original + Flag + Imputed]

Implementation:

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.compose import ColumnTransformer

# Create missing indicators for important features
important_features = ['transaction_amount', 'user_age', 'device_score']
for col in important_features:
    df[f'{col}_missing'] = df[col].isna().astype(int)

# Different strategies for different columns
preprocessor = ColumnTransformer([
    ('num_knn', KNNImputer(n_neighbors=5), numerical_cols),
    ('cat_mode', SimpleImputer(strategy='most_frequent'), categorical_cols),
    ('cat_new', SimpleImputer(strategy='constant', fill_value='Unknown'), high_missing_cols)
])

pipeline = Pipeline([
    ('imputer', preprocessor),
    ('scaler', StandardScaler()),
    ('model', XGBClassifier())
])

Key decisions: - Flag missingness for high-value features (model can learn "missing = suspicious") - KNN for numerical with patterns - "Unknown" category for categorical with > 10% missing - Monitor: imputation quality, drift in missingness patterns

17. Model Debugging¶

Basic¶

Q: Что такое slice-based evaluation?

A: Slice-based evaluation — анализ model performance на подмножествах (slices) данных вместо одного aggregate metric.

Зачем: Aggregate metrics скрывают проблемы на underrepresented groups.

Slice types: - Demographic: gender, age, geography - Behavioral: new vs returning users, device type - Data-driven: high confidence vs low confidence, feature-based
# Slice-based evaluation
def evaluate_slices(model, X, y, slice_cols):
    results = {}
    for col in slice_cols:
        for value in X[col].unique():
            mask = X[col] == value
            if mask.sum() >= 50:  # Minimum samples
                results[f"{col}={value}"] = {
                    'accuracy': accuracy_score(y[mask], model.predict(X[mask])),
                    'count': mask.sum()
                }
    return results

Q: Как проводить error analysis для ML модели?

A: Systematic Error Analysis Process:

Collect errors: All misclassified samples

Categorize: By error type (FP, FN), feature values, prediction confidence

Pattern hunt: What do errors have in common?

Hypothesis: Why is model making these errors?

Fix: More data, new features, different model

Code:
# Error analysis
errors = X_test[y_test != y_pred].copy()
errors['true'] = y_test[y_test != y_pred]
errors['pred'] = y_pred[y_test != y_pred]
errors['confidence'] = y_proba[y_test != y_pred].max(axis=1)

# Look for patterns
for col in X_test.columns:
    print(f"\n{col} distribution in errors vs all:")
    print(errors[col].value_counts(normalize=True).head())
    print(X_test[col].value_counts(normalize=True).head())

Medium¶

Q: Что такое data debugging и как его делать?

A: Data debugging — поиск проблем в данных, которые вызывают model issues.

Common data bugs: - Label noise: Incorrect labels in training data - Feature leakage: Target information in features - Distribution shift: Train/test different distributions - Outliers: Extreme values affecting model - Duplicates: Same samples causing overfitting

Debugging techniques:
# 1. Check label consistency
from cleanlab.classification import CleanLearning
clf = CleanLearning(clf=XGBClassifier())
clf.fit(X_train, y_train)
label_issues = clf.get_label_issues()  # Potentially mislabeled samples

# 2. Check for leakage
from sklearn.feature_selection import mutual_info_classif
mi = mutual_info_classif(X_train, y_train)
suspicious = [f for f, score in zip(features, mi) if score > 0.8]  # Too predictive

# 3. Distribution check
from scipy.stats import ks_2samp
for col in X_train.columns:
    stat, p = ks_2samp(X_train[col], X_test[col])
    if p < 0.01:
        print(f"Distribution shift in {col}: p={p:.4f}")

Q: Как организовать regression testing для ML моделей?

A: ML Regression Testing — автоматическая проверка что новая модель не хуже старой на критических сценариях.

Test suite components: 1. Golden dataset: Curated examples representing key scenarios 2. Performance thresholds: Min acceptable metrics 3. Slice-specific checks: Must not degrade on important slices 4. Prediction stability: Similar inputs → similar outputs

class ModelRegressionTest:
    def __init__(self, baseline_model, golden_data, thresholds):
        self.baseline = baseline_model
        self.golden_X, self.golden_y = golden_data
        self.thresholds = thresholds  # {'accuracy': 0.85, 'slice_degradation': 0.02}

    def test(self, new_model):
        # 1. Overall performance
        baseline_acc = accuracy_score(self.golden_y, self.baseline.predict(self.golden_X))
        new_acc = accuracy_score(self.golden_y, new_model.predict(self.golden_X))
        assert new_acc >= self.thresholds['accuracy'], f"Accuracy below threshold: {new_acc}"

        # 2. No significant regression
        assert new_acc >= baseline_acc - self.thresholds['slice_degradation'], \
               f"Regression from baseline: {baseline_acc} → {new_acc}"

        # 3. Slice-specific checks
        for slice_name, mask in self.slices.items():
            baseline_slice = accuracy_score(self.golden_y[mask], self.baseline.predict(self.golden_X[mask]))
            new_slice = accuracy_score(self.golden_y[mask], new_model.predict(self.golden_X[mask]))
            assert new_slice >= baseline_slice - 0.05, f"Regression on {slice_name}"

        return {"status": "PASSED", "baseline_acc": baseline_acc, "new_acc": new_acc}

Killer¶

Q: Спроектируйте model debugging workflow для production recommendation system.

A:

Architecture:

Production Logs → Error Collector → Pattern Analyzer → Alerting → Root Cause → Fix
      ↓               ↓                  ↓              ↓          ↓
  [predictions]    [misclassifies]    [slices]      [on-call]  [retrain]
  [features]       [low confidence]   [drifts]                 [features]
  [outcomes]       [edge cases]       [biases]

Implementation:

class ModelDebugger:
    def __init__(self, model, feature_store):
        self.model = model
        self.fs = feature_store
        self.error_buffer = []
        self.slice_metrics = defaultdict(list)

    def log_prediction(self, user_id, item_id, features, prediction, outcome=None):
        """Log every prediction for debugging."""
        record = {
            'timestamp': datetime.now(),
            'user_id': user_id,
            'item_id': item_id,
            'features': features,
            'prediction': prediction,
            'confidence': prediction.max(),
            'outcome': outcome  # Filled later if available
        }
        self.error_buffer.append(record)

    def analyze_errors(self):
        """Periodic error analysis."""
        # 1. Low confidence predictions
        low_conf = [r for r in self.error_buffer if r['confidence'] < 0.6]
        if len(low_conf) > 100:
            self.alert(f"High rate of low-confidence predictions: {len(low_conf)}")

        # 2. Slice-based analysis
        for slice_col in ['user_segment', 'item_category', 'device']:
            for slice_val in set(r['features'].get(slice_col) for r in self.error_buffer):
                slice_errors = [r for r in self.error_buffer
                               if r['features'].get(slice_col) == slice_val and r.get('outcome') == 'error']
                error_rate = len(slice_errors) / max(1, len([r for r in self.error_buffer
                                                            if r['features'].get(slice_col) == slice_val]))
                if error_rate > 0.1:
                    self.alert(f"High error rate on {slice_col}={slice_val}: {error_rate:.2%}")

        # 3. Feature drift
        recent_features = pd.DataFrame([r['features'] for r in self.error_buffer[-1000:]])
        baseline_features = self.fs.get_historical_features()
        for col in recent_features.columns:
            drift = self._compute_psi(recent_features[col], baseline_features[col])
            if drift > 0.2:
                self.alert(f"Feature drift detected in {col}: PSI={drift:.2f}")

    def _compute_psi(self, expected, actual, buckets=10):
        """Population Stability Index."""
        def scale_range (input, min, max):
            input += -(np.min(input))
            input /= np.max(input) / (max - min)
            input += min
            return input

        breakpoints = np.arange(0, buckets + 1) / buckets * 100
        if len(actual.unique()) == 1:
            return 0
        breakpoints = np.nanpercentile(actual, breakpoints)
        expected_percents = np.histogram(expected, breakpoints)[0] / len(expected)
        actual_percents = np.histogram(actual, breakpoints)[0] / len(actual)
        psi_value = np.sum((actual_percents - expected_percents) * np.log(actual_percents / expected_percents + 0.0001))
        return psi_value

Key metrics to monitor: - Error rate by slice (user segment, item category) - Low confidence rate - Feature drift (PSI > 0.2) - Prediction distribution shift - Latency by model version

18. AutoML Theory¶

Basic¶

Q: Что такое AutoML и какие проблемы решает?

A: AutoML (Automated Machine Learning) автоматизирует полный ML pipeline: - Hyperparameter Optimization (HPO): Поиск оптимальных гиперпараметров - Neural Architecture Search (NAS): Автоматический поиск архитектуры - Feature Engineering: Автоматическое создание фичей - Model Selection: Выбор лучшего алгоритма - Ensembling: Автоматическое объединение моделей

Проблемы которые решает: - Эксперты тратят 60-80% времени на tuning - Человеческие ошибки и предвзятость - Непоследовательность между инженерами - Сложность для новичков

Medium¶

Q: Как работает Bayesian Optimization для HPO?

A: Bayesian Optimization — model-based подход к поиску гиперпараметров.

Формула оптимизации: $$x^* = \text{argmax}_{x \in X} f(x)$$

где $x$ — конфигурация гиперпараметров, $f(x)$ — performance metric.

Gaussian Process Prior: $$f(x) \sim GP(\mu(x), k(x, x'))$$

Acquisition Functions:

Expected Improvement (EI): $$EI(x) = E[\max(f(x) - f(x^*), 0)] = \int_{-\infty}^{\infty} \max(f - f^*, 0) p(f|x) df$$

Probability of Improvement (PI): $$PI(x) = P(f(x) > f(x^*)) = \Phi\left(\frac{\mu(x) - f(x^*) - \xi}{\sigma(x)}\right)$$

Upper Confidence Bound (UCB): $$UCB(x) = \mu(x) + \beta \sigma(x)$$

Python implementation:
import numpy as np
from scipy.stats import norm
from sklearn.gaussian_process import GaussianProcessRegressor

class BayesianOptimizer:
    def __init__(self, param_bounds, n_initial=5):
        self.bounds = param_bounds  # {'lr': (0.0001, 0.1), 'batch_size': (16, 256)}
        self.n_initial = n_initial
        self.X_observed = []
        self.y_observed = []
        self.gp = GaussianProcessRegressor()

    def expected_improvement(self, X, xi=0.01):
        mu, sigma = self.gp.predict(X, return_std=True)
        sigma = np.maximum(sigma, 1e-9)  # avoid div by zero
        f_best = np.max(self.y_observed)

        with np.errstate(divide='warn'):
            imp = mu - f_best - xi
            Z = imp / sigma
            ei = imp * norm.cdf(Z) + sigma * norm.pdf(Z)
            ei[sigma == 0.0] = 0.0
        return ei

    def suggest_next(self, n_candidates=1000):
        if len(self.X_observed) < self.n_initial:
            return self._random_sample()

        self.gp.fit(np.array(self.X_observed), np.array(self.y_observed))
        candidates = self._generate_candidates(n_candidates)
        ei = self.expected_improvement(candidates)
        return candidates[np.argmax(ei)]

    def update(self, x, y):
        self.X_observed.append(x)
        self.y_observed.append(y)
Сравнение методов HPO:

Method Efficiency Parallelizable Best For

Grid Search 45% Yes (embarrassingly) Small param spaces

Random Search 65% Yes Baseline, early exploration

Bayesian (GP) 95% Limited (sequential) Expensive evaluations

TPE 90% Limited High-dimensional spaces

Multi-fidelity 95%+ Yes Large datasets, deep learning

Q: В чём разница между Grid Search, Random Search и Bayesian?

A:

Grid Search: - Перебирает все комбинации на сетке - Экспоненциальный рост: $O(n^d)$ где $d$ — число параметров - Неэффективен: многие комбинации бесполезны - Пример: 3 params × 10 values = 1000 trials

Random Search: - Случайная выборка из пространства - Лучшая эффективность при том же бюджете - Не учитывает предыдущие результаты - Формула: $P(\text{top 5\%}) = 1 - (1 - 0.05)^n$

Bayesian Optimization: - Строит surrogate model (GP) по результатам - Баланс exploration vs exploitation - Каждая новая точка информативна - Идеален для дорогих вычислений

Когда что использовать: - < 10 trials → Random Search - 10-100 trials → Bayesian (GP/TPE) - Cheap evaluation (seconds) → Grid/Random - Expensive (hours) → Bayesian + early stopping

Killer¶

Q: Спроектируйте AutoML систему для команды из 50 DS.

A:

Requirements: 50 DS, 1000+ experiments/week, diverse workloads (tabular, CV, NLP).

Architecture:
graph TD
    subgraph CTRL["AutoML Controller"]
        HPO["HPO Engine<br/>Optuna/TPE"]
        NAS["NAS Engine<br/>DARTS"]
        FE["Feature Engine"]
        ENS["Ensemble Engine"]
        HPO --> SCHED
        NAS --> SCHED
        FE --> SCHED
        ENS --> SCHED
        SCHED["Trial Scheduler (Ray Tune)<br/>Resource allocation, ASHA, 100+ parallel"]
    end
    SCHED --> REG["Model Registry (MLflow)"]

    style HPO fill:#e8eaf6,stroke:#3f51b5
    style NAS fill:#e8eaf6,stroke:#3f51b5
    style FE fill:#e8eaf6,stroke:#3f51b5
    style ENS fill:#e8eaf6,stroke:#3f51b5
    style SCHED fill:#e8f5e9,stroke:#4caf50
    style REG fill:#fff3e0,stroke:#ef6c00
Key components:

HPO Engine: Optuna с TPE sampler для high-dimensional spaces

NAS: DARTS для CV, AutoML-Text для NLP

Early Stopping: ASHA (Asynchronous Successive Halving)

Multi-fidelity: Сначала на 10% данных, потом на 100%

Cost optimization: - Warm-starting: transfer learning между похожими задачами - Budget-aware: остановить если не превосходит baseline после N trials - Meta-learning: использовать историю команды для инициализации

Governance: - Auto-logging всех экспериментов - Comparison vs baseline required для promotion - Weekly AutoML reports: savings, best practices discovered

Q: Что такое Multi-Fidelity Optimization?

A: Multi-Fidelity использует дешевые аппроксимации для ускорения HPO.

Идея: Сначала evaluate на маленьком subset данных/short training, потом только лучшие на full fidelity.

Методы:
Successive Halving (SH):
# Start with N configs, train for r epochs
# Keep top 1/η, train for r*η epochs
# Repeat until 1 config at max epochs
def successive_halving(configs, n_initial, eta=3):
    n = len(configs)
    r = r_min
    while n > 1:
        results = [train(c, epochs=r) for c in configs]
        n_keep = n // eta
        configs = top_k(configs, results, k=n_keep)
        n, r = n_keep, r * eta
    return configs[0]
ASHA (Asynchronous SHA):

Параллельная версия SH

Configs запускаются асинхронно

Promote когда достигли milestone

Hyperband:

Комбинирует SH с разными budget allocations

Робаст к разным типам задач
Формула Hyperband: $$s_{max} = \lfloor \log_\eta(R/r_{min}) \rfloor$$

\[B = (s_{max} + 1) R\]

19. Federated Learning¶

Basic¶

Q: Что такое Federated Learning?

A: ML парадигма, где модель обучается на распределённых данных без перемещения данных на центральный сервер.

Ключевые принципы: - Data stays on device (privacy) - Only model updates are shared - Central server aggregates updates - Model improves collaboratively

Q: Как работает FedAvg (Federated Averaging)?

A:

FedAvg Algorithm (McMahan et al.): 1. Server initializes global model $w^0$ 2. For each round $t$: - Server sends $w^t$ to selected clients $S_t$ - Each client $k$ trains locally: $w_k^{t+1} = w^t - \eta \nabla L_k(w^t)$ - Clients send updates back - Server aggregates: $w^{t+1} = \sum_{k \in S_t} \frac{n_k}{n} w_k^{t+1}$

где $n_k$ — количество samples у клиента $k$, $n = \sum n_k$

Medium¶

Q: Какие проблемы FedAvg и как их решают?

A:

Problem Cause Solution

Client Drift Heterogeneous data FedProx (proximal term)

Communication cost Large model updates Compression, sparse updates

Stragglers Slow clients Async aggregation

Non-IID data Different distributions Data sharing, clustering

FedProx: $$\min_w L_k(w) + \frac{\mu}{2}\|w - w^t\|^2$$

Proximal term keeps local updates close to global model.

Q: Local vs Global updates — в чём разница?

A:

Local updates (client-side): - Multiple SGD steps before sending to server - More computation, less communication - Formula: $w_k \leftarrow w_k - \eta \sum_{i} \nabla \ell(x_i, y_i; w_k)$ for $E$ epochs

Communication-efficiency trade-off: - More local epochs $E$ → less communication, but more drift - Typical: $E \in [1, 5]$ for stability

Q: Что такое Differential Privacy в Federated Learning?

A:

DP-FedAvg: Add noise to updates before sending to server $$\tilde{g}_k = g_k + \mathcal{N}(0, \sigma^2 C^2)$$

где $C$ — clipping norm, $\sigma$ — noise scale

Privacy guarantee: $(\epsilon, \delta)$-DP - Lower $\epsilon$ → stronger privacy, more noise - Trade-off: privacy vs accuracy

Killer¶

Q: Спроектируйте FL систему для предсказания клавиатуры на мобильных устройствах.

A:

Architecture:
[User Devices] → [Secure Aggregation] → [FL Server] → [Global Model]
      ↑                                          ↓
      ←——————— Model Distribution ←———————
Key decisions:

Model: LSTM/Transformer, ~5-10M params (must fit on device)

Participation: Sample 100-1000 users per round from millions

Local training: 1-5 epochs on user's typing data

Aggregation: Weighted by data size $n_k$

Privacy: DP-FedAvg with $\epsilon \approx 8$

Python (simplified):
def fedavg_round(server_model, client_updates, client_sizes):
    total_size = sum(client_sizes)
    new_weights = {}
    for name, param in server_model.named_parameters():
        weighted_sum = sum(
            sizes[i] / total_size * updates[i][name]
            for i, updates in enumerate(client_updates)
        )
        new_weights[name] = weighted_sum
    return new_weights
Challenges: - Device heterogeneity (battery, compute) - Non-IID data (different users, different vocab) - Concept drift (new slang, languages)

Q: FedAvg vs FedProx vs SCAFFOLD — когда что использовать?

A:

Algorithm Best For Key Innovation

FedAvg IID-ish data, stable clients Baseline, simple

FedProx Heterogeneous data Proximal term reduces drift

SCAFFOLD Highly non-IID Control variates correct drift

SCAFFOLD insight: Client drift = $\nabla L_k(w) - \nabla L(w)$ - Maintains control variates $c_k$ to estimate drift - Updates: $w_k \leftarrow w_k - \eta(g_k - c_k + c)$ - Achieves 45% faster convergence on non-IID data (2025 benchmarks)

20. TabPFN — Foundation Model for Tabular Data¶

Basic¶

Q: Что такое TabPFN?

A: Tabular Prior-data Fitted Network — foundation model для tabular data, использующий in-context learning вместо gradient descent.

Ключевые характеристики: - Pre-trained на synthetic tabular datasets - Zero-shot prediction (no training on your data) - Transformer-based architecture - Outperforms XGBoost/LightGBM on small datasets (<10K samples)

Q: В чём разница TabPFN vs традиционные ML модели?

A:

Aspect Traditional (XGBoost) TabPFN

Training Gradient descent on data Pre-trained, no training

Data requirement More data = better Small data specialist

Inference Fast tree traversal Forward pass through transformer

Hyperparameters Many (lr, depth, etc.) Minimal (none for basic use)

Max samples Unlimited 50K (TabPFN-2.5)

Medium¶

Q: Как работает TabPFN?

A:

Pre-training Phase: 1. Generate synthetic tabular datasets from priors 2. Train transformer to predict labels given (X_train, y_train, x_test) 3. Model learns general tabular patterns

Inference (In-Context Learning):
from tabpfn import TabPFNClassifier

classifier = TabPFNClassifier()
classifier.fit(X_train, y_train)  # No actual training!
predictions = classifier.predict(X_test)
Architecture: - Input: Training set + test sample as sequence - Encoder: Feature embedding + positional encoding - Decoder: Transformer predicts label probabilities

Q: Какие ограничения у TabPFN?

A:

Limitation TabPFN v2 TabPFN-2.5

Max samples 10,000 50,000

Max features 100 2,000

Max classes 10 ~100

GPU required Yes Yes

Practical limitations: - Slow on large datasets (O(n²) attention) - Categorical features need preprocessing - No native support for missing values - Regression needs separate model

Q: Когда использовать TabPFN vs XGBoost?

A:

Use TabPFN when: - Dataset < 50K samples - Limited time for hyperparameter tuning - Quick baseline needed - Data is clean (no missing values)

Use XGBoost/LightGBM when: - Large datasets (>50K) - Need feature importance - Complex preprocessing needed - Production deployment (no GPU)

Benchmarks (2025 Nature paper): - TabPFN outperforms on 57% of datasets <10K samples - Average accuracy gain: +2.7% vs best competitor

Killer¶

Q: Как интегрировать TabPFN в production pipeline?

A:

Hybrid approach:

def smart_classifier(X_train, y_train, X_test):
    n_samples = len(X_train)

    if n_samples < 5000:
        # TabPFN for small data
        model = TabPFNClassifier()
        model.fit(X_train, y_train)
        return model.predict(X_test)
    elif n_samples < 50000:
        # Compare TabPFN vs XGBoost
        tabpfn_score = cross_val_score(TabPFNClassifier(), X_train, y_train)
        xgb_score = cross_val_score(XGBClassifier(), X_train, y_train)

        if tabpfn_score > xgb_score:
            return TabPFNClassifier().fit(X_train, y_train).predict(X_test)
        else:
            return XGBClassifier().fit(X_train, y_train).predict(X_test)
    else:
        # Large data: traditional methods
        return XGBClassifier().fit(X_train, y_train).predict(X_test)

Production considerations: - GPU required for inference - Batch inference for throughput - Fallback to XGBoost on timeout - Model versioning (TabPFN updates)

Q: Что нового в TabPFN-2.5?

A:

Key improvements (Nov 2025): - 20x increase in data cells (50K samples × 2K features) - Better handling of high-cardinality categorical features - Improved regression support - Faster inference (optimized attention)

When to upgrade: - Datasets near v2 limits - Need for more classes - Large feature sets

21. Production ML Deployment Patterns¶

Источники: MatterAI Deployment Strategies (Jan 2026), ML Journey Shadow vs Canary (Sept 2025), Raghu's Deployment Patterns, FICO Champion/Challenger (Dec 2025)

Basic¶

Q: Какие основные паттерны deployment для ML моделей?

A:

Pattern Описание Risk Level Use Case

Blue-Green Two identical environments, instant switch Low Critical systems, zero downtime

Canary Gradual traffic shift (1%→100%) Medium Risk mitigation with real users

Shadow Parallel run, no user impact None Model validation, load testing

A/B Testing Deterministic routing by user Medium Statistical comparison

Champion-Challenger Continuous model competition Low Continuous improvement

Q: Что такое Blue-Green deployment?

A: Поддержка двух идентичных production environments: - Blue — текущая production версия - Green — новая версия для deployment

Process: 1. Deploy new model to Green 2. Run validation tests 3. Switch traffic via load balancer: Blue (0%) → Green (100%) 4. Blue становится standby для instant rollback

Infrastructure (Kubernetes + Istio):
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: inference-service
spec:
  hosts:
    - inference-service
  http:
    - route:
        - destination:
            host: inference-service
            subset: blue
          weight: 0
        - destination:
            host: inference-service
            subset: green
          weight: 100

Medium¶

Q: Как работает Canary deployment?

A: Gradual rollout с progressive traffic shifting:

Traffic Ramp:

1% → 5% → 10% → 25% → 50% → 100%

Kubernetes Argo Rollouts:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: model-inference
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: {duration: 10m}
        - setWeight: 20
        - pause: {duration: 10m}
        - setWeight: 50
        - pause: {duration: 10m}
      analysis:
        templates:
          - templateName: success-rate

Automated Gates (triggers rollback if breached): - P95 latency < 200ms - Error rate < 0.1% - Prediction distribution drift (KL divergence < 0.1) - Business metrics (conversion rate stable)

Q: Что такое Shadow deployment и когда его использовать?

A: Shadow model получает те же input данные, что и production, но predictions НЕ влияют на пользователей.

Architecture:

[Request] → [Production Model] → [User Response]
         ↘ [Shadow Model] → [Log for Analysis]

Implementation:

class ShadowDeployment:
    def __init__(self, production_model, shadow_model):
        self.prod = production_model
        self.shadow = shadow_model
        self.logger = PredictionLogger()

    async def predict(self, features):
        # Production prediction (returned to user)
        prod_pred = await self.prod.predict(features)

        # Shadow prediction (logged, not returned)
        shadow_pred = await self.shadow.predict(features)
        self.logger.log(
            features=features,
            prod_prediction=prod_pred,
            shadow_prediction=shadow_pred,
            timestamp=datetime.now()
        )

        return prod_pred  # Only production prediction

Use Cases: - Validate new model on real traffic (risk-free) - Compare prediction distributions - Load testing new infrastructure - Data drift detection

Q: Чем A/B Testing отличается от Canary?

A:

Aspect Canary A/B Testing

Traffic split Random percentage Deterministic (user ID hash)

Purpose Risk mitigation Statistical comparison

User consistency May see different models Same user sees same model

Duration Until full rollout Fixed experiment period

Analysis Operational metrics Business metrics + significance

A/B User Segmentation:
import hashlib

def get_model_variant(user_id, variants=['v1', 'v2']):
    hash_value = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
    index = hash_value % len(variants)
    return variants[index]

# Consistent routing: same user always sees same model
variant = get_model_variant("user_12345")

Killer¶

Q: Спроектируйте Champion-Challenger pipeline для recommendation system.

A:

Architecture:

graph TD
    subgraph REG["Model Registry"]
        CHAMP["Champion v2.3<br/>87.2%"]
        CH1["Challenger 1<br/>v2.4-alpha, 86.8%"]
        CH2["Challenger 2<br/>v2.4-beta, 87.5%"]
    end

    CHAMP -->|"90%"| ROUTER["Traffic Router"]
    CH1 -->|"5% shadow"| ROUTER
    CH2 -->|"5%"| ROUTER

    ROUTER --> METRICS["Metrics Collector<br/>CTR, Conversion, Revenue, Latency"]
    METRICS --> DECISION{"Promotion Decision<br/>challenger > champion by 2%+<br/>for 7 days?"}
    DECISION -->|"Yes"| PROMOTE["Promote to Champion"]
    DECISION -->|"No"| KEEP["Keep current Champion"]

    style CHAMP fill:#e8f5e9,stroke:#4caf50
    style CH1 fill:#e8eaf6,stroke:#3f51b5
    style CH2 fill:#e8eaf6,stroke:#3f51b5
    style ROUTER fill:#fff3e0,stroke:#ef6c00
    style METRICS fill:#f3e5f5,stroke:#9c27b0
    style PROMOTE fill:#e8f5e9,stroke:#4caf50
    style KEEP fill:#fce4ec,stroke:#c62828

Implementation:

class ChampionChallengerPipeline:
    def __init__(self, registry, traffic_router, metrics):
        self.registry = registry
        self.router = traffic_router
        self.metrics = metrics
        self.promotion_threshold = 0.02  # 2% improvement
        self.min_observation_days = 7

    def get_model(self, user_id, context):
        champion = self.registry.get_champion()
        challengers = self.registry.get_challengers()

        # Route traffic
        assignment = self.router.assign(user_id)

        if assignment == 'champion':
            return champion
        else:
            # Shadow: return champion prediction but log challenger
            challenger = challengers[assignment]
            return self.shadow_predict(champion, challenger, context)

    async def evaluate_promotion(self):
        champion = self.registry.get_champion()
        challengers = self.registry.get_challengers()

        for challenger in challengers:
            if challenger.observation_days < self.min_observation_days:
                continue

            # Statistical significance test
            improvement = self.metrics.compare(
                challenger, champion, metric='conversion_rate'
            )

            if (improvement > self.promotion_threshold and
                self.metrics.is_significant(challenger, champion)):
                await self.promote(challenger)

    async def promote(self, new_champion):
        old_champion = self.registry.get_champion()
        self.registry.demote(old_champion)
        self.registry.promote(new_champion)
        self.router.update_weights(champion=1.0)

Promotion Criteria: - Metric improvement > threshold (e.g., 2%) - Statistical significance (p < 0.05) - Minimum observation period - No degradation on critical slices - Stakeholder approval (for major changes)

Q: Когда использовать каждый deployment pattern?

A:

Decision Tree:
Is zero downtime required?
├── Yes → Blue-Green (critical systems: payments, auth)
└── No → Is risk tolerance low?
          ├── Yes → Shadow → Canary → Full
          └── No → Canary (fast iteration)

Need statistical comparison?
└── Yes → A/B Testing with significance analysis

Continuous improvement culture?
└── Yes → Champion-Challenger with automation
Pattern Combinations (Best Practice): 1. Shadow + Canary: Shadow for 2 weeks → Canary 1%→100% 2. Champion-Challenger + Shadow: Multiple challengers in shadow mode 3. A/B + Canary: A/B test on canary traffic only

Cost Comparison:

Pattern Infra Cost Rollback Speed Real-User Validation

Blue-Green High (2x) Instant No

Canary Medium (1.2x) Fast Yes

Shadow Medium (1.5x) N/A No

A/B Testing Medium Fast Yes

Champion-Challenger Medium Fast Yes

Q: Как реализовать automated rollback для ML deployment?

A:

class AutomatedRollback:
    def __init__(self, thresholds, monitoring):
        self.thresholds = {
            'p95_latency_ms': 200,
            'error_rate': 0.001,
            'prediction_drift_kl': 0.1,
            'conversion_rate_drop': 0.05,
        }
        self.monitoring = monitoring

    async def check_and_rollback(self, deployment):
        metrics = await self.monitoring.get_metrics(deployment)

        for metric, threshold in self.thresholds.items():
            current = metrics.get(metric, 0)

            if self._breaches_threshold(metric, current, threshold):
                await self.rollback(deployment)
                await self.alert(
                    f"Rollback triggered: {metric}={current}, threshold={threshold}"
                )
                return True

        return False

    def _breaches_threshold(self, metric, current, threshold):
        if 'drop' in metric:
            return current > threshold  # Higher drop is bad
        else:
            return current > threshold  # Higher latency/error is bad

    async def rollback(self, deployment):
        # Switch back to previous stable version
        await deployment.switch_to_previous()
        await deployment.scale_down_canary()

Rollback Triggers: 1. Latency spike > 2x baseline 2. Error rate > 0.1% 3. Prediction distribution shift (PSI > 0.2) 4. Business metric drop > 5% 5. Manual trigger from on-call

22. Data Drift Detection¶

Источники: AllDays Tech Model Drift 2026, Label Your Data Drift Detection, Towards Data Science Drift (Jan 2026)

Basic¶

Q: Что такое data drift и почему это проблема?

A: Data drift — изменение распределения входных данных с течением времени:

\[P_{t_0}(X) \neq P_t(X), \quad t > t_0\]

Типы Drift:

Type Definition Example

Data Drift Input distribution changes New user demographics, seasonality

Concept Drift P(y|X) changes Fraud patterns evolve, buying behavior shifts

Label Drift P(y) changes Class imbalance shifts, policy changes

Formal decomposition: $$P(X, y) = P(X) \times P(y|X)$$

Q: Почему drift неизбежен в production?

A: 1. Real-world change: Seasonality, macro events, adversaries adapt 2. Product change: New features, UI changes, pricing changes 3. Pipeline change: Schema changes, logging changes, feature computation bugs

Medium¶

Q: Какие методы обнаружения drift существуют?

A:

Method Use Case Formula/Approach

KS Test Continuous features $D = \max

Chi-Square Categorical features $\chi^2 = \sum \frac{(O-E)^2}{E}$

PSI Score/bin distribution $\sum (Actual\% - Expected\%) \times \ln\frac{Actual\%}{Expected\%}$

Wasserstein Continuous, sensitive Earth Mover's Distance

PSI Thresholds: - PSI < 0.1: No significant drift - 0.1 ≤ PSI < 0.25: Moderate drift, monitor - PSI ≥ 0.25: Significant drift, investigate

Q: Как реализовать PSI (Population Stability Index)?

A:

import numpy as np

def compute_psi(expected, actual, buckets=10):
    """Compute Population Stability Index."""
    breakpoints = np.percentile(expected, np.linspace(0, 100, buckets + 1))
    expected_counts, _ = np.histogram(expected, bins=breakpoints)
    actual_counts, _ = np.histogram(actual, bins=breakpoints)

    expected_pct = expected_counts / len(expected) + 1e-10
    actual_pct = actual_counts / len(actual) + 1e-10

    psi = np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct))
    return psi

Q: Что такое Adversarial Validation?

A: Метод определения насколько train и production distribution различаются:

Label train data as 0, production data as 1

Train classifier to distinguish them

If AUC ≈ 0.5 → distributions similar (good)

If AUC > 0.7 → significant drift (problem)

Killer¶

Q: Когда retraining необходим vs достаточно мониторинга?

A:

Drift Type Performance Impact Action

Data drift only No degradation Monitor, no action

Data drift + perf drop Model degrading Investigate root cause

Concept drift Always impacts Retrain with recent data

Pipeline bug Varies Fix pipeline first

Retrain Triggers: - Business metric drop > 5% - Model accuracy drops below threshold - Multiple features showing drift simultaneously

23. Hyperparameter Interactions & Learning Curves¶

Comprehensive guide to hyperparameter tuning strategies and training diagnostics.

Hyperparameter Tuning Strategies Comparison¶

Aspect	Grid Search	Random Search	Bayesian Optimization
Strategy	Exhaustive, all combinations	Random sampling	Probabilistic modeling
Efficiency	Exponential growth	Efficient for large spaces	Very efficient, fewer evaluations
Implementation	Easy (sklearn GridSearchCV)	Easy (sklearn RandomizedSearchCV)	Complex (Optuna, Hyperopt)
Best For	Small spaces (<10 params)	High-dimensional spaces	Expensive evaluations
Scalability	Limited	Good	Excellent
Exploration	Thorough but wasteful	Broad coverage	Smart exploration/exploitation

Grid Search Details¶

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10]
}

# Total combinations: 3 * 4 * 3 = 36
grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)

Pros: Comprehensive, simple, reproducible Cons: $O(n^k)$ complexity, wastes resources on unimportant dimensions

Random Search Details¶

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform, randint

param_distributions = {
    'n_estimators': randint(50, 500),
    'max_depth': randint(3, 20),
    'learning_rate': loguniform(1e-4, 1e-1),
    'min_samples_split': randint(2, 20)
}

random_search = RandomizedSearchCV(
    GradientBoostingClassifier(),
    param_distributions,
    n_iter=50,  # Number of random samples
    cv=5,
    scoring='f1',
    n_jobs=-1
)

Key Insight (Bergstra & Bengio 2012): Random search often finds better configs in fewer trials because: - Not all hyperparameters are equally important - Random sampling covers more distinct values per dimension

Bayesian Optimization¶

import optuna

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'max_depth': trial.suggest_int('max_depth', 3, 20),
        'learning_rate': trial.suggest_float('learning_rate', 1e-4, 1e-1, log=True),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 20)
    }

    model = GradientBoostingClassifier(**params)
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1')
    return scores.mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

How it works: 1. Builds probabilistic model (Gaussian Process) of objective function 2. Uses acquisition function (EI, UCB) to select next hyperparameters 3. Balances exploration (new regions) vs exploitation (known good regions)

Learning Curves Interpretation¶

Learning curves plot training/validation error vs training set size or epochs.

Well-Fitted Model¶

Error
  │
  │  Train ----___
  │              ---___
  │  Val   -------___
  │                  ---
  └───────────────────────── Size/Epochs

- Small gap between train and validation - Both curves converge to low error - Action: Model is ready

Overfitting Model¶

Error
  │
  │  Train ----------
  │                   (approaches zero)
  │  Val   ----___
  │              ___/‾‾‾  (increases!)
  └───────────────────────── Size/Epochs

- Training error very low, validation error high - Large gap between curves - Actions: More data, regularization, simpler model, early stopping

Underfitting Model¶

Error
  │
  │  Train ------ (high)
  │
  │  Val   ------- (high, similar)
  │
  └───────────────────────── Size/Epochs

- Both errors high and plateau - Small gap but poor performance - Actions: More complex model, more features, less regularization

Learning Curve Analysis Code¶

from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt

def plot_learning_curve(estimator, X, y, cv=5):
    train_sizes, train_scores, val_scores = learning_curve(
        estimator, X, y,
        train_sizes=np.linspace(0.1, 1.0, 10),
        cv=cv,
        scoring='neg_mean_squared_error',
        n_jobs=-1
    )

    train_mean = -np.mean(train_scores, axis=1)
    val_mean = -np.mean(val_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    val_std = np.std(val_scores, axis=1)

    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, 'o-', label='Training error')
    plt.plot(train_sizes, val_mean, 'o-', label='Validation error')
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1)
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1)
    plt.xlabel('Training set size')
    plt.ylabel('MSE')
    plt.legend()
    plt.grid(True)
    plt.show()

Early Stopping Strategies¶

Early stopping prevents overfitting by stopping training when validation performance stops improving.

Basic Early Stopping¶

from sklearn.model_selection import train_test_split

class EarlyStopping:
    def __init__(self, patience=5, min_delta=0.001):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.best_score = None
        self.should_stop = False

    def __call__(self, val_score):
        if self.best_score is None:
            self.best_score = val_score
        elif val_score < self.best_score + self.min_delta:
            self.counter += 1
            if self.counter >= self.patience:
                self.should_stop = True
        else:
            self.best_score = val_score
            self.counter = 0
        return self.should_stop

# Usage in training loop
early_stopping = EarlyStopping(patience=10, min_delta=0.001)
for epoch in range(max_epochs):
    train_loss = train_one_epoch(model, train_loader)
    val_loss = validate(model, val_loader)

    if early_stopping(-val_loss):  # Negative because we want to maximize
        print(f"Early stopping at epoch {epoch}")
        break

Early Stopping in Gradient Boosting¶

from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(
    n_estimators=1000,
    learning_rate=0.01,
    validation_fraction=0.1,
    n_iter_no_change=10,  # Early stopping patience
    tol=1e-4  # Minimum improvement
)
model.fit(X_train, y_train)
print(f"Actual n_estimators used: {model.n_estimators_}")

PyTorch Early Stopping with Checkpointing¶

import torch

def train_with_early_stopping(model, train_loader, val_loader, epochs, patience=5):
    optimizer = torch.optim.Adam(model.parameters())
    criterion = torch.nn.CrossEntropyLoss()

    best_val_loss = float('inf')
    patience_counter = 0
    best_model_state = None

    for epoch in range(epochs):
        # Training
        model.train()
        for X, y in train_loader:
            optimizer.zero_grad()
            loss = criterion(model(X), y)
            loss.backward()
            optimizer.step()

        # Validation
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for X, y in val_loader:
                val_loss += criterion(model(X), y).item()

        val_loss /= len(val_loader)

        # Early stopping logic
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            patience_counter = 0
            best_model_state = model.state_dict().copy()
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print(f"Early stopping at epoch {epoch}")
                model.load_state_dict(best_model_state)
                break

    return model

Decision Framework: Which Tuning Strategy to Use¶

Scenario	Recommended Strategy	Why
<5 hyperparameters	Grid Search	Small space, comprehensive
5-20 hyperparameters	Random Search	Efficient exploration
>20 hyperparameters	Bayesian (Optuna)	Smart search
Expensive training (>1hr)	Bayesian + Early Stopping	Minimize evaluations
Limited compute budget	Random (n=50) + Early Stopping	Good coverage, low cost
Production deployment	Bayesian + Cross-validation	Robust, reproducible

Best Practices¶

Start coarse, then fine: Wide search first, narrow later
Use domain knowledge: Set sensible ranges based on experience
Monitor learning curves: Diagnose over/underfitting early
Apply early stopping: Save compute, prevent overfitting
Document experiments: Track all configurations and results
Cross-validation: Use k-fold CV for reliable estimates
Parallelize: Use n_jobs=-1 or distributed tuning (Ray Tune)

Источники: AICompetence Grid vs Random vs Bayesian (May 2025), GeeksforGeeks Learning Curves (Jul 2025), Bergstra & Bengio (2012), Snoek et al. (2012)

Basic¶

Q: Почему Random Search часто работает лучше Grid Search?

A: Bergstra & Bengio (2012) показали: 1. Не все параметры важны: Важен только ~1-2 параметра, остальные мало влияют 2. Grid тратит ресурсы: Перебирает все комбинации неважных параметров 3. Random покрывает больше: При том же бюджете исследует больше значений важных параметров

Q: Что показывает Learning Curve?

A: График зависимости ошибки от размера обучающей выборки или эпох: - X-axis: Training set size или epochs - Y-axis: Error (MSE, loss) или accuracy - Две линии: Training error и Validation error

Medium¶

Q: Как диагностировать переобучение по Learning Curve?

A:

Symptom Training Error Validation Error Gap

Overfitting Very low High, increasing Large

Underfitting High High Small

Good fit Low Low Small

Actions for overfitting: More data, regularization, early stopping, simpler model

Q: Как работает Early Stopping?

A: Остановка обучения когда validation loss перестаёт улучшаться:
if val_loss < best_val_loss - min_delta:
    best_val_loss = val_loss
    counter = 0
else:
    counter += 1
    if counter >= patience:
        stop_training()
Параметры: patience (сколько эпох ждать), min_delta (минимальное улучшение)

Killer¶

Q: Как выбрать стратегию tuning для production системы?

A:

Budget assessment: Сколько времени/ресурсов доступно?

Model complexity: Deep learning → Bayesian, Classical ML → Random/Grid

Iteration cost: Дорогое обучение → Bayesian + early stopping

Risk tolerance: Production → k-fold CV + multiple runs

Recommended pipeline:
# Stage 1: Coarse random search
random_search = RandomizedSearchCV(..., n_iter=50, cv=3)

# Stage 2: Fine Bayesian around best region
study = optuna.create_study()
study.optimize(objective, n_trials=100)

# Stage 3: Final validation with full CV
final_cv = cross_val_score(best_model, X, y, cv=10)

Q: Как избежать overfitting на validation set при tuning?

A: 1. Nested CV: Inner loop для tuning, outer loop для оценки 2. Hold-out test set: Не использовать для tuning вообще 3. Ограничить trials: Не перебирать тысячи комбинаций 4. Early stopping: Не "подгонять" под validation
# Nested CV prevents overfitting to validation
from sklearn.model_selection import cross_val_score, GridSearchCV

inner_cv = KFold(n_splits=3)
outer_cv = KFold(n_splits=5)

clf = GridSearchCV(estimator=model, param_grid=param_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X, y, cv=outer_cv)

Обновлено: 2026-02-12, Ralph iteration 106 — добавлен Cross-Validation Edge Cases (Section 24)

24. Cross-Validation Edge Cases¶

Advanced cross-validation techniques for robust model evaluation.

Nested Cross-Validation¶

Problem: When tuning hyperparameters, standard CV causes optimism bias — we use validation data both to select hyperparameters AND to report performance.

Solution: Nested CV separates model selection from evaluation: - Outer loop (evaluation): Honest test of tuned model generalization - Inner loop (selection): Hyperparameter tuning on training data only

from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier

# Inner loop: hyperparameter search
inner_cv = KFold(n_splits=3, shuffle=True, random_state=1)
param_grid = {'n_estimators': [50, 100], 'max_depth': [3, 5, 10]}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=inner_cv)

# Outer loop: evaluation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
nested_scores = cross_val_score(grid, X, y, cv=outer_cv)

print(f"Nested CV score: {nested_scores.mean():.3f} (+/- {nested_scores.std():.3f})")

Nested vs Standard CV Comparison¶

Aspect	Standard CV (GridSearchCV)	Nested CV
Purpose	Tune hyperparameters	Evaluate tuning pipeline
Data leakage	Possible (optimistic bias)	Prevented
Computation	$k \times n_{params}$	$k_{outer} \times k_{inner} \times n_{params}$
When to use	Final model selection	Model comparison, publication

Time Series Cross-Validation¶

Problem: Standard K-fold CV breaks temporal structure — training on future data to predict past = data leakage.

Walk-Forward Validation (Expanding Window)¶

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
    print(f"Fold {fold}: Train={len(train_idx)}, Test={len(test_idx)}")

Sliding Window (Fixed Size)¶

class SlidingWindowCV:
    def __init__(self, window_size, step=1):
        self.window_size = window_size
        self.step = step

    def split(self, X):
        n = len(X)
        for i in range(self.window_size, n, self.step):
            train_idx = np.arange(i - self.window_size, i)
            test_idx = np.arange(i, min(i + self.step, n))
            yield train_idx, test_idx

Blocked Time Series CV (with Embargo)¶

class BlockedTimeSeriesCV:
    def __init__(self, n_splits=5, embargo=0):
        self.n_splits = n_splits
        self.embargo = embargo  # Gap between train and test

    def split(self, X):
        n = len(X)
        k = n // (self.n_splits + 1)
        for i in range(self.n_splits):
            test_start = i * k + k
            test_end = test_start + k
            train_end = test_start - self.embargo
            yield np.arange(0, train_end), np.arange(test_start, test_end)

Time Series CV Comparison¶

Method	Window	Memory	Best For
Expanding	Grows	All history	Stable systems
Sliding	Fixed	Recent only	Concept drift
Blocked	Fixed + gap	No leakage	Financial data

Bootstrap .632 Estimator¶

Formula: $\hat{Err}^{.632} = 0.368 \times \overline{err} + 0.632 \times \hat{Err}^{(1)}$

Where $\overline{err}$ = training error, $\hat{Err}^{(1)}$ = OOB error

def bootstrap_632_score(model, X, y, n_bootstraps=100):
    n = len(y)
    oob_errors, train_errors = [], []

    for _ in range(n_bootstraps):
        indices = resample(range(n), n_samples=n, replace=True)
        oob_mask = ~np.isin(range(n), indices)
        if oob_mask.sum() == 0: continue

        model.fit(X[indices], y[indices])
        train_errors.append(np.mean(model.predict(X[indices]) != y[indices]))
        oob_errors.append(np.mean(model.predict(X[oob_mask]) != y[oob_mask]))

    return 0.368 * np.mean(train_errors) + 0.632 * np.mean(oob_errors)

Why 0.632? Bootstrap sample includes ~63.2% of data (1 - 1/e).

Repeated K-Fold CV¶

from sklearn.model_selection import RepeatedKFold

cv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
scores = cross_val_score(model, X, y, cv=cv)  # 50 evaluations

Decision Framework: Which CV to Use¶

Data Type	CV Method	Why
Standard (i.i.d.)	K-fold (k=5 or 10)	Good bias-variance trade-off
Small dataset	LOOCV or .632 Bootstrap	Maximize training data
Imbalanced	Stratified K-fold	Preserve class ratios
Time series	Walk-forward / Blocked	Respect temporal order
Grouped (clusters)	GroupKFold	Keep groups together
Hyperparameter tuning	Nested CV	Prevent optimism bias

Источники: Medium Nested CV (May 2025), MLMastery Time Series CV (Jan 2026)

Basic¶

Q: Зачем нужен Nested Cross-Validation?

A: Стандартный GridSearchCV даёт optimistic bias — мы используем validation данные дважды (для выбора гиперпараметров И для оценки качества).

Nested CV: Outer loop = честная оценка, Inner loop = подбор гиперпараметров.

Q: Почему нельзя использовать K-fold для Time Series?

A: Random shuffle ломает временной порядок — train на будущем, test на прошлом = data leakage.

Medium¶

Q: В чём разница Expanding vs Sliding Window?

A: Expanding растёт (вся история), Sliding фиксирован (только недавние). Expanding для стабильных систем, Sliding для concept drift.

Q: Что такое Bootstrap .632?

A: Комбинация training error и OOB error: $0.368 \times \text{train\_err} + 0.632 \times \text{OOB\_err}$. Снижает bias для малых данных.

Killer¶

Q: Как правильно организовать CV pipeline с preprocessing?

A: КРИТИЧНО: Pipeline внутри CV, не снаружи:

# WRONG - data leakage
scaler = StandardScaler().fit(X)
X_scaled = scaler.transform(X)
cross_val_score(model, X_scaled, y, cv=5)

# CORRECT
pipeline = Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())])
cross_val_score(pipeline, X, y, cv=5)

Типичные заблуждения¶

Заблуждение: Переобучение можно определить только по метрикам на валидации

Learning curves (train vs val) -- необходимый, но не достаточный инструмент. По данным Kaggle 2025, 34% случаев "переобучения" на самом деле вызваны data leakage в preprocessing pipeline (например, fit scaler на всех данных до split). Всегда проверяйте pipeline целиком через nested cross-validation.

Заблуждение: Feature selection всегда улучшает качество модели

На практике агрессивный feature selection может удалить признаки с weak-but-useful сигналом. Исследование (Boulesteix et al., 2024) показало, что на datasets с >50 признаками L1-регуляризация (Lasso) в среднем на 2-4% хуже по AUC чем Ridge (L2) + все признаки, если между признаками высокая корреляция. Используйте Elastic Net как компромисс.

Заблуждение: Gradient Boosting всегда лучше Random Forest

Согласно мета-анализу AutoML Benchmark 2025 на 104 табличных датасетах, Random Forest побеждает GBDT в 38% случаев -- особенно на зашумленных данных (SNR < 2), малых выборках (<500 строк) и при дисбалансе классов 1:50+. RF также значительно устойчивее к гиперпараметрам: default RF проигрывает tuned RF на 1-2%, а default XGBoost проигрывает tuned XGBoost на 5-8%.

Метод	Сложность query	Когда
Brute force	\(O(nd)\)	\(n < 10K\) или \(d > 20\)
KD-tree	\(O(d \log n)\) avg	\(d < 20\), dense data
Ball tree	\(O(d \log n)\) avg	Любая метрика, \(d < 40\)
ANN (приближённые)	\(O(d \log n)\)	\(n > 100K\), допустима ошибка

Метрика	Формула	Когда использовать
Euclidean	\(\sqrt{\sum(x_i-y_i)^2}\)	Default, continuous features, масштабированные данные
Manhattan	\(\sum\\|x_i-y_i\\|\)	High-dim (более устойчив к curse of dimensionality), sparse
Cosine	\(1 - \frac{x \cdot y}{\\|x\\|\\|y\\|}\)	Text embeddings, TF-IDF, когда важен угол, а не magnitude
Mahalanobis	\(\sqrt{(x-y)^T S^{-1} (x-y)}\)	Коррелированные фичи, учитывает ковариацию

Метод	Как работает	Плюсы/Минусы
Elbow	Plot \(J(k)\) vs \(k\), найти "локоть"	Субъективный, не всегда есть чёткий elbow
Silhouette	\(s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}\)	Объективный, \(s \in [-1, 1]\), но \(O(n^2)\)
Gap Statistic	Сравнение с uniform distribution	Статистически обоснован, дорогой
Calinski-Harabasz	\(\frac{B/(k-1)}{W/(n-k)}\) (between/within variance ratio)	Быстрый, но biased к convex clusters

	Standard K-Means	Mini-batch K-Means
Per iteration	\(O(nkd)\)	\(O(bkd)\), \(b \ll n\)
Convergence	Стабильная	Slightly noisier
Quality	Baseline	~1-3% хуже inertia
Speed	Slow for \(n > 100K\)	10-100x faster

Метрика	Формула/Суть	Диапазон
ARI (Adjusted Rand Index)	Корректировка Rand Index за chance	\([-1, 1]\), 1 = perfect
NMI (Normalized Mutual Information)	\(\frac{2 \cdot MI(U,V)}{H(U) + H(V)}\)	\([0, 1]\)
Homogeneity / Completeness	Кластер = один класс / класс = один кластер	\([0, 1]\)

Метрика	Суть	Диапазон
Silhouette	Cohesion vs separation	\([-1, 1]\)
Calinski-Harabasz	Between/within variance	\([0, \infty)\), higher = better
Davies-Bouldin	Avg cluster similarity	\([0, \infty)\), lower = better

Kernel	Формула	Когда использовать
Linear	\(x^Tx'\)	\(d > n\) (text, genomics), линейно разделимые данные
Polynomial	\((\gamma x^Tx' + r)^d\)	Известна полиномиальная связь, NLP (degree 2-3)
RBF (Gaussian)	\(\exp(-\gamma\\|x-x'\\|^2)\)	Default. Не знаешь структуру данных
Sigmoid	\(\tanh(\gamma x^Tx' + r)\)	Редко. Аналог нейросети с 1 hidden layer

	One-vs-Rest (OvR)	One-vs-One (OvO)
Классификаторов	\(k\)	\(\frac{k(k-1)}{2}\)
Training	Каждый: \(n\) samples	Каждый: \(\frac{2n}{k}\) samples
Prediction	Макс confidence score	Majority voting
Когда лучше	\(k\) большой, \(n\) большой	\(n\) маленький, kernel SVM

Параметр	Эффект
\(\epsilon\)	Ширина tube (толерантность к ошибкам)
\(C\)	Штраф за выход из tube

Type	Full Name	Definition	Example	Strategy
MCAR	Missing Completely At Random	P(missing) independent of all variables	Data entry error, random sensor failure	Deletion OK
MAR	Missing At Random	P(missing) depends on observed data	Men less likely to report depression	Imputation OK
MNAR	Missing Not At Random	P(missing) depends on missing value itself	High earners don't report salary	Model missingness

Method	Description	Best For	Bias Risk
Mean/Median	Replace with central tendency	Numerical, MCAR	Underestimates variance
Mode	Most frequent value	Categorical	Same as mean
Forward/Backward fill	Use adjacent values	Time series	Temporal leakage
KNN Imputer	k-nearest neighbors	Numerical patterns	Computationally expensive
MICE	Multiple Imputation by Chained Equations	Any	Gold standard for MAR
Iterative	Model-based (Bayesian Ridge)	Complex patterns	Assumes MAR

Method	Efficiency	Parallelizable	Best For
Grid Search	45%	Yes (embarrassingly)	Small param spaces
Random Search	65%	Yes	Baseline, early exploration
Bayesian (GP)	95%	Limited (sequential)	Expensive evaluations
TPE	90%	Limited	High-dimensional spaces
Multi-fidelity	95%+	Yes	Large datasets, deep learning

Problem	Cause	Solution
Client Drift	Heterogeneous data	FedProx (proximal term)
Communication cost	Large model updates	Compression, sparse updates
Stragglers	Slow clients	Async aggregation
Non-IID data	Different distributions	Data sharing, clustering

Algorithm	Best For	Key Innovation
FedAvg	IID-ish data, stable clients	Baseline, simple
FedProx	Heterogeneous data	Proximal term reduces drift
SCAFFOLD	Highly non-IID	Control variates correct drift

Aspect	Traditional (XGBoost)	TabPFN
Training	Gradient descent on data	Pre-trained, no training
Data requirement	More data = better	Small data specialist
Inference	Fast tree traversal	Forward pass through transformer
Hyperparameters	Many (lr, depth, etc.)	Minimal (none for basic use)
Max samples	Unlimited	50K (TabPFN-2.5)

Limitation	TabPFN v2	TabPFN-2.5
Max samples	10,000	50,000
Max features	100	2,000
Max classes	10	~100
GPU required	Yes	Yes

Pattern	Описание	Risk Level	Use Case
Blue-Green	Two identical environments, instant switch	Low	Critical systems, zero downtime
Canary	Gradual traffic shift (1%→100%)	Medium	Risk mitigation with real users
Shadow	Parallel run, no user impact	None	Model validation, load testing
A/B Testing	Deterministic routing by user	Medium	Statistical comparison
Champion-Challenger	Continuous model competition	Low	Continuous improvement

Aspect	Canary	A/B Testing
Traffic split	Random percentage	Deterministic (user ID hash)
Purpose	Risk mitigation	Statistical comparison
User consistency	May see different models	Same user sees same model
Duration	Until full rollout	Fixed experiment period
Analysis	Operational metrics	Business metrics + significance

Pattern	Infra Cost	Rollback Speed	Real-User Validation
Blue-Green	High (2x)	Instant	No
Canary	Medium (1.2x)	Fast	Yes
Shadow	Medium (1.5x)	N/A	No
A/B Testing	Medium	Fast	Yes
Champion-Challenger	Medium	Fast	Yes

Type	Definition	Example
Data Drift	Input distribution changes	New user demographics, seasonality
Concept Drift	P(y\|X) changes	Fraud patterns evolve, buying behavior shifts
Label Drift	P(y) changes	Class imbalance shifts, policy changes

Method	Use Case	Formula/Approach
KS Test	Continuous features	$D = \max
Chi-Square	Categorical features	\(\chi^2 = \sum \frac{(O-E)^2}{E}\)
PSI	Score/bin distribution	\(\sum (Actual\% - Expected\%) \times \ln\frac{Actual\%}{Expected\%}\)
Wasserstein	Continuous, sensitive	Earth Mover's Distance

Drift Type	Performance Impact	Action
Data drift only	No degradation	Monitor, no action
Data drift + perf drop	Model degrading	Investigate root cause
Concept drift	Always impacts	Retrain with recent data
Pipeline bug	Varies	Fix pipeline first

Symptom	Training Error	Validation Error	Gap
Overfitting	Very low	High, increasing	Large
Underfitting	High	High	Small
Good fit	Low	Low	Small