Перейти к содержанию

Classical ML: Interview Q&A

~58 минут чтения

Типичные вопросы с собеседований 2025-2026 Формат: Q: вопрос / A: развернутый ответ Обновлено: 2026-02-11

Предварительно: Материалы


Содержание

  1. K-Nearest Neighbors
  2. Logistic Regression
  3. K-Means
  4. Naive Bayes
  5. Decision Trees
  6. SVM
  7. Gradient Boosting
  8. Feature Engineering
  9. Feature Selection
  10. Сложные вопросы (Senior+)
  11. Model Interpretability (SHAP & LIME) — NEW 2026
  12. Reinforcement Learning Basics
  13. NLP: Word Embeddings (Word2Vec, GloVe)
  14. NLP: Named Entity Recognition (NER) & Sequence Labeling
  15. Hyperparameter Optimization
  16. Active Learning
  17. Time Series: Deep Learning Methods
  18. Explainable AI (XAI): SHAP & LIME
  19. Neural Architecture Search (NAS)
  20. Cost-Sensitive Learning
  21. Missing Data Handling
  22. Model Debugging
  23. AutoML Theory
  24. Federated Learning
  25. TabPFN — Foundation Model for Tabular Data
  26. Production ML Deployment Patterns
  27. Data Drift Detection
  28. Hyperparameter Interactions & Learning Curves
  29. Cross-Validation Edge Cases

K-Nearest Neighbors

Q: Почему KNN плохо работает в высоких размерностях?

A: Curse of Dimensionality.

Причина: В высокоразмерном пространстве: 1. Все точки "далеко" друг от друга 2. Volume unit ball \(\to 0\) exponentially 3. Distances становятся неразличимыми

\[\frac{\text{max distance} - \text{min distance}}{\text{min distance}} \to 0 \text{ as } d \to \infty\]

Solution: Dimensionality reduction (PCA) или другие алгоритмы.

Q: Как выбрать k в KNN?

A:

  • k=1: Overfitting, чувствителен к noise
  • k=n: Underfitting, всегда majority class
  • Rule of thumb: \(k \approx \sqrt{n}\) (but use CV)

Practical: Cross-validation для optimal k. Обычно нечётное k для бинарной классификации (избежать ties).

Q: Какую метрику расстояния выбрать?

A:

Метрика Формула Когда использовать
Euclidean \(\sqrt{\sum(x_i-y_i)^2}\) Default, continuous features, масштабированные данные
Manhattan \(\sum\|x_i-y_i\|\) High-dim (более устойчив к curse of dimensionality), sparse
Cosine \(1 - \frac{x \cdot y}{\|x\|\|y\|}\) Text embeddings, TF-IDF, когда важен угол, а не magnitude
Mahalanobis \(\sqrt{(x-y)^T S^{-1} (x-y)}\) Коррелированные фичи, учитывает ковариацию

Gotcha: Euclidean и Manhattan требуют feature scaling. Cosine -- нет (инвариантен к scale).

Q: Weighted KNN -- зачем и как?

A:

Проблема: Стандартный KNN даёт равный вес всем k соседям -- далёкий сосед влияет так же, как ближайший.

Решение: Взвешивание по обратному расстоянию: $\(\hat{y} = \frac{\sum_{i \in N_k} w_i y_i}{\sum_{i \in N_k} w_i}, \quad w_i = \frac{1}{d(x, x_i)^p}\)$

Scikit-learn: KNeighborsClassifier(weights='distance')

Когда помогает: Неравномерная плотность данных, граничные зоны между классами.

Q: Как ускорить KNN? Brute force O(nd) на каждый query.

A:

Метод Сложность query Когда
Brute force \(O(nd)\) \(n < 10K\) или \(d > 20\)
KD-tree \(O(d \log n)\) avg \(d < 20\), dense data
Ball tree \(O(d \log n)\) avg Любая метрика, \(d < 40\)
ANN (приближённые) \(O(d \log n)\) \(n > 100K\), допустима ошибка

Approximate Nearest Neighbors (ANN):

Библиотека Алгоритм Плюсы
FAISS (Meta) IVF + PQ GPU, миллиарды векторов, production-standard
Annoy (Spotify) Random projections Read-only, быстрый build, mmap
HNSW (hnswlib) Hierarchical NSW graph Лучший recall/speed trade-off
ScaNN (Google) Anisotropic quantization Оптимизирован для inner product

Scikit-learn: KNeighborsClassifier(algorithm='auto') выбирает между brute/kd_tree/ball_tree автоматически.

Q: KNN для регрессии vs классификации -- в чём разница?

A:

Classification Regression
Prediction Majority vote среди k соседей Mean/median значений k соседей
Weighted Взвешенный vote Взвешенное среднее
Метрики Accuracy, F1 MSE, MAE

Scikit-learn: KNeighborsClassifier vs KNeighborsRegressor.

Gotcha: Для регрессии weighted KNN почти всегда лучше uniform -- далёкие точки вносят шум.

Q: Почему feature scaling обязателен для KNN?

A:

Проблема: KNN основан на расстоянии. Фича с бОльшим масштабом доминирует:

Фича A: зарплата (30000-150000)
Фича B: возраст (18-65)
→ Расстояние определяется почти полностью зарплатой

Решение:

Метод Когда
StandardScaler Gaussian-like features
MinMaxScaler Bounded features, [0,1]
RobustScaler Есть outliers

Gotcha: Fit scaler ТОЛЬКО на train set, transform и train и test. Иначе -- data leakage.

Заблуждение: KNN всегда хуже complex моделей

На малых датасетах (\(n < 1000\), \(d < 20\)) с чистыми данными KNN часто побеждает Random Forest и SVM. Причина: KNN не имеет bias от функциональной формы -- чисто data-driven. Проблемы начинаются при \(d > 20\) (curse of dimensionality) или \(n > 50K\) (скорость).


Logistic Regression

Q: Почему logistic regression называется "regression"?

A: Исторически — потому что использует linear combination признаков:

\[z = w^Tx + b\]

Технически это classification (sigmoid превращает в probability), но underlying model — linear regression + activation.

Q: Multiclass logistic regression — как работает?

A:

One-vs-Rest (OvR): K бинарных классификаторов, каждый с sigmoid: $\(P(y=k|x) = \sigma(w_k^Tx + b_k) = \frac{1}{1 + e^{-(w_k^Tx + b_k)}}\)$

Softmax (Multinomial): Единая модель, все классы одновременно: $\(P(y=k|x) = \frac{e^{w_k^Tx}}{\sum_j e^{w_j^Tx}}\)$

Scikit-learn: multi_class='ovr' или 'multinomial'

Q: L1 vs L2 regularization в Logistic Regression

A:

L1 (Lasso) L2 (Ridge)
Sparse coefficients Small coefficients
Feature selection Handles multicollinearity
Non-differentiable at 0 Smooth
Coordinate descent Gradient descent

Practical: L1 если нужна интерпретируемость, L2 если все признаки важны.


K-Means

Q: K-means всегда сходится?

A: Да, но не обязательно к глобальному optimum.

Теорема: K-means сходится к локальному минимуму за конечное число шагов.

Причина: На каждом шаге: 1. Assignment step уменьшает J или оставляет 2. Update step уменьшает J или оставляет 3. J ограничено снизу

Problem: Может застрять в плохом local minimum.

Solution: K-means++ initialization, multiple restarts.

Q: Как работает K-Means++ initialization и почему она важна?

A:

Проблема random init: Плохие стартовые центроиды = плохой local minimum. Random init в ~20% случаев даёт результат в 2-5x хуже optimal.

K-Means++ алгоритм: 1. Выбрать первый центроид случайно 2. Для каждой точки вычислить расстояние до ближайшего центроида \(D(x)\) 3. Выбрать следующий центроид с вероятностью \(\frac{D(x)^2}{\sum D(x)^2}\) 4. Повторять шаги 2-3 пока не будет k центроидов

Гарантия: \(O(\log k)\)-competitive с оптимальным решением (Arthur & Vassilvitskii, 2007).

Scikit-learn: KMeans(init='k-means++') -- default.

Q: Как выбрать k в K-means?

A:

Метод Как работает Плюсы/Минусы
Elbow Plot \(J(k)\) vs \(k\), найти "локоть" Субъективный, не всегда есть чёткий elbow
Silhouette \(s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}\) Объективный, \(s \in [-1, 1]\), но \(O(n^2)\)
Gap Statistic Сравнение с uniform distribution Статистически обоснован, дорогой
Calinski-Harabasz \(\frac{B/(k-1)}{W/(n-k)}\) (between/within variance ratio) Быстрый, но biased к convex clusters

Practical: - Domain knowledge > метрики (если знаешь бизнес = 3 сегмента клиентов, бери k=3) - Silhouette + Elbow -- стандартная комбинация - Если Silhouette < 0.25 -- кластеризация плохая, данные не имеют кластерной структуры

Q: K-means vs K-medoids

A:

K-means K-medoids (PAM)
Centroid = mean Centroid = actual data point
Sensitive to outliers Robust to outliers
Только Euclidean Любая метрика расстояния
\(O(nkd)\) per iteration \(O(n^2kd)\) -- значительно медленнее

Когда K-medoids: Outliers, non-Euclidean distances, когда центроид должен быть интерпретируемым (реальная точка данных).

Q: Когда K-Means не работает?

A:

Ситуация Почему ломается Альтернатива
Non-convex кластеры (полумесяцы, кольца) K-Means делит пространство Voronoi-разбиением DBSCAN, Spectral Clustering
Кластеры разного размера Маленький кластер "поглощается" большим GMM, HDBSCAN
Кластеры разной плотности Разреженные точки ошибочно присваиваются плотному кластеру DBSCAN, OPTICS
High-dimensional (\(d > 50\)) Расстояния теряют смысл (curse of dimensionality) Spectral Clustering, PCA + K-Means
Неизвестное количество кластеров K задаётся вручную DBSCAN, HDBSCAN, X-Means

Q: K-Means vs DBSCAN vs GMM -- когда что?

A:

Аспект K-Means DBSCAN GMM
Форма кластеров Сферические Произвольная Эллиптическая
K задаётся? Да Нет (\(\epsilon\), min_pts) Да
Outliers Нет (все точки в кластерах) Да (noise points) Нет (но low probability)
Soft assignment Нет (hard) Нет (hard) Да (\(P(z_k\|x)\))
Скорость \(O(nkd)\) \(O(n \log n)\) с index \(O(nk d^2)\) per EM step
Масштабируемость Mini-batch до миллионов Плохо > 100K Плохо > 50K

Rules of thumb: - K-Means: Сферические кластеры, знаешь k, нужна скорость - DBSCAN: Не знаешь k, есть outliers, non-convex формы - GMM: Нужна soft membership (вероятности), overlapping clusters

Q: Mini-batch K-Means -- зачем?

A:

Проблема: Стандартный K-Means: каждый iteration проходит по ВСЕМ \(n\) точкам. При \(n > 1M\) -- медленно.

Mini-batch: Каждый iteration -- случайная выборка \(b\) точек (batch_size=1000 typically).

Standard K-Means Mini-batch K-Means
Per iteration \(O(nkd)\) \(O(bkd)\), \(b \ll n\)
Convergence Стабильная Slightly noisier
Quality Baseline ~1-3% хуже inertia
Speed Slow for \(n > 100K\) 10-100x faster

Scikit-learn: MiniBatchKMeans(n_clusters=k, batch_size=1000)

Q: Метрики качества кластеризации -- с labels и без?

A:

Внешние (есть ground truth):

Метрика Формула/Суть Диапазон
ARI (Adjusted Rand Index) Корректировка Rand Index за chance \([-1, 1]\), 1 = perfect
NMI (Normalized Mutual Information) \(\frac{2 \cdot MI(U,V)}{H(U) + H(V)}\) \([0, 1]\)
Homogeneity / Completeness Кластер = один класс / класс = один кластер \([0, 1]\)

Внутренние (нет ground truth):

Метрика Суть Диапазон
Silhouette Cohesion vs separation \([-1, 1]\)
Calinski-Harabasz Between/within variance \([0, \infty)\), higher = better
Davies-Bouldin Avg cluster similarity \([0, \infty)\), lower = better

Gotcha: Silhouette biased к convex clusters. Для DBSCAN результатов лучше использовать DBCV (Density-Based Clustering Validation).

Заблуждение: K-Means всегда находит глобальный оптимум

K-Means гарантирует сходимость к ЛОКАЛЬНОМУ минимуму за конечное число шагов (monotonic decrease of J), но НЕ гарантирует глобальный. На практике: запускай n_init=10 (scikit-learn default) -- 10 запусков с разными init, берёт лучший. K-Means++ снижает разброс между запусками, но не устраняет полностью.

Заблуждение: Feature scaling не нужен для K-Means

K-Means использует Euclidean distance -- фича с большим масштабом доминирует. StandardScaler или MinMaxScaler ОБЯЗАТЕЛЬНЫ перед K-Means. Единственное исключение: если все фичи уже в одном масштабе (e.g., one-hot encoded).


Naive Bayes

Q: Почему "naive"? Что если assumption нарушается?

A:

Naive assumption: Features условно независимы при фиксированном классе.

\[P(x|y) = \prod_i P(x_i|y)\]

Reality: Features коррелируют.

Но работает потому что: 1. Нужно только ORDERING, не точные probabilities 2. Overestimation всех вероятностей сокращается 3. Strong signal от truly informative features

Q: Gaussian vs Multinomial vs Bernoulli Naive Bayes

A:

Type Data Distribution
Gaussian Continuous Normal
Multinomial Counts (text) Multinomial
Bernoulli Binary features Bernoulli

Для текста: Multinomial NB standard (TF-IDF counts).


Decision Trees

Q: Gini vs Entropy — что выбрать?

A:

Gini: \(1 - \sum p_k^2\)

Entropy: \(-\sum p_k \log_2 p_k\)

Практически: Почти идентичные результаты.

Различия: - Gini: быстрее (нет log) - Entropy: чуть более sensitive к pure nodes - Gini чаще в sklearn default

Recommendation: Use default (Gini), только если time-critical.

Q: Как предотвратить overfitting в Decision Trees?

A:

Pre-pruning: - max_depth: limit tree depth - min_samples_split: minimum samples to split - min_samples_leaf: minimum samples in leaf - max_leaf_nodes: maximum leaves

Post-pruning: - Cost-complexity pruning (α penalty) - Reduced error pruning

Rule of thumb: Start with max_depth=5-10, tune.

Q: Feature importance в Decision Trees — как считается?

A:

Gini Importance (Mean Decrease in Impurity): $\(\text{Importance}(j) = \sum_{t \in T} p(t) \cdot \Delta i(t) \cdot \mathbb{1}(j \text{ used at } t)\)$

где \(p(t)\) = fraction of samples at node \(t\), \(\Delta i(t)\) = impurity decrease.

Warning: Biased towards high-cardinality features!

Alternative: Permutation importance (more reliable).


SVM

Q: Что такое support vectors?

A: Support vectors -- точки, лежащие на границе margin или внутри.

Математически: Точки с \(\alpha_i > 0\) в dual formulation.

Свойства: - Только support vectors определяют hyperplane - Удаление non-support vectors не меняет модель - Обычно 10-30% training samples

Follow-up: Это делает SVM memory-efficient -- prediction зависит только от support vectors, а не от всего dataset.

Q: Hard margin vs Soft margin -- в чём разница и зачем C?

A:

Hard margin (линейно разделимые данные): $\(\min_{w,b} \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^Tx_i + b) \geq 1\)$

Soft margin (реальные данные с шумом): $\(\min_{w,b,\xi} \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}\xi_i \quad \text{s.t.} \quad y_i(w^Tx_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0\)$

Параметр C -- trade-off:

C Margin Ошибки на train Risk
Маленький (0.01) Широкий Допускает больше Underfitting
Большой (100) Узкий Штрафует сильно Overfitting

Practical: Подбирать через CV. Default в sklearn: C=1.0. Типичный grid: [0.01, 0.1, 1, 10, 100].

Q: Почему SVM kernel trick работает?

A:

Idea: Map data to higher dimension \(\phi(x)\), but compute only kernel \(K(x,x') = \phi(x)^T\phi(x')\).

Dual formulation: $\(f(x) = \sum_i \alpha_i y_i K(x_i, x) + b\)$

Key insight: Никогда не вычисляем \(\phi(x)\) явно!

RBF kernel: \(\exp(-\gamma\|x-x'\|^2)\) = infinite dimensional mapping. Caveat: kernel matrix \(O(n^2)\) space/time; for large \(n\) (>10K) use approximations (Nystrom, random Fourier features).

Q: Как выбрать kernel?

A:

Kernel Формула Когда использовать
Linear \(x^Tx'\) \(d > n\) (text, genomics), линейно разделимые данные
Polynomial \((\gamma x^Tx' + r)^d\) Известна полиномиальная связь, NLP (degree 2-3)
RBF (Gaussian) \(\exp(-\gamma\|x-x'\|^2)\) Default. Не знаешь структуру данных
Sigmoid \(\tanh(\gamma x^Tx' + r)\) Редко. Аналог нейросети с 1 hidden layer

Decision flow: 1. Начни с Linear SVM (LinearSVC) -- быстрый, часто достаточен 2. Если accuracy < target -- RBF SVM (SVC(kernel='rbf')) 3. Tune \(\gamma\) и \(C\) через GridSearchCV

\(\gamma\) в RBF: - Маленький \(\gamma\): каждая точка влияет далеко (smoother boundary, underfitting) - Большой \(\gamma\): каждая точка влияет только на ближайших (complex boundary, overfitting)

Q: SVM vs Logistic Regression -- когда что?

A:

Аспект SVM Logistic Regression
Objective Max margin (hinge loss) Max likelihood (log loss)
Output Decision value (не probability) Probability \(P(y=1\|x)\)
Outliers Менее чувствителен (hinge = flat beyond margin) Более чувствителен (log loss grows)
Feature scaling Обязательно Желательно
Kernels Да (non-linear) Нет (только linear)
Скорость \(O(n^2)\) - \(O(n^3)\) \(O(nd)\)
Большие данные Плохо > 50K Хорошо на миллионах

Rules of thumb: - Нужны вероятности → Logistic Regression - Мало данных (\(n < 10K\)), non-linear → SVM с RBF - Много данных (\(n > 50K\)) → Logistic Regression (или LinearSVC) - Text classification → LinearSVC (часто лучше LR на sparse data)

Q: Multiclass SVM -- OvO vs OvR

A:

One-vs-Rest (OvR) One-vs-One (OvO)
Классификаторов \(k\) \(\frac{k(k-1)}{2}\)
Training Каждый: \(n\) samples Каждый: \(\frac{2n}{k}\) samples
Prediction Макс confidence score Majority voting
Когда лучше \(k\) большой, \(n\) большой \(n\) маленький, kernel SVM

Scikit-learn: SVC использует OvO по default. LinearSVC использует OvR. Для multiclass > 10 классов: OvR быстрее.

Q: SVM для регрессии (SVR)

A:

Идея: \(\epsilon\)-insensitive tube -- ошибки внутри \(\epsilon\) не штрафуются.

\[\min_{w,b} \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}(\xi_i + \xi_i^*)\]
\[\text{s.t.} \quad |y_i - (w^Tx_i + b)| \leq \epsilon + \xi_i\]
Параметр Эффект
\(\epsilon\) Ширина tube (толерантность к ошибкам)
\(C\) Штраф за выход из tube

Когда SVR: Маленький dataset, non-linear зависимости, outliers (tube их игнорирует).

Q: SVM для несбалансированных данных

A:

Class weights: $\(C_+ = C \cdot \frac{n}{2 \cdot n_+}, \quad C_- = C \cdot \frac{n}{2 \cdot n_-}\)$

Scikit-learn:

SVC(class_weight='balanced')

Q: Масштабируемость SVM -- когда НЕ использовать?

A:

\(n\) (samples) Рекомендация
< 10K SVM с любым kernel
10K-100K LinearSVC (или SGDClassifier с hinge loss)
> 100K Не SVM. Используй LogReg, GBDT, Neural Nets

Причина: Kernel SVM строит kernel matrix \(n \times n\) -- \(O(n^2)\) memory, \(O(n^3)\) training.

Альтернативы для large-scale:

Метод Сложность
LinearSVC (liblinear) \(O(nd)\)
SGDClassifier(loss='hinge') \(O(nd)\), online
Nystrom approximation \(O(nm^2)\), \(m \ll n\)
Random Fourier Features \(O(nDd)\), \(D\) = projection dim

Q: В каких задачах SVM всё ещё актуален в 2026?

A:

Задача Почему SVM
Text classification (small corpus) LinearSVC на TF-IDF часто побеждает fine-tuned BERT при \(n < 5K\)
Bioinformatics (gene expression) \(d \gg n\), kernel methods natural
Anomaly detection (One-Class SVM) Не нужны аномальные примеры для train
Small dataset + non-linear RBF SVM при \(n < 1K\) часто лучше RF/GBDT

Где SVM проиграл: Tabular > 10K samples (GBDT лучше), Vision (CNN), NLP (Transformers), любой large-scale.

Заблуждение: SVM даёт вероятности

SVC.predict_proba() в sklearn использует Platt scaling (sigmoid calibration поверх decision values). Это НЕ native probability -- это post-hoc calibration, медленная (\(O(n^2)\) cross-validation), и может быть неточной. Если нужны вероятности -- используй Logistic Regression.

Заблуждение: RBF kernel всегда лучше Linear

При \(d > n\) (high-dimensional, sparse data) linear kernel часто лучше RBF. Причина: в high-dim пространстве данные часто линейно разделимы. RBF добавляет ненужную сложность и overfits. Правило: text/genomics → linear, tabular low-dim → RBF.


Gradient Boosting

Q: Gradient Boosting vs Random Forest — когда что?

A:

Gradient Boosting Random Forest
Sequential training Parallel training
Low bias, higher variance Higher bias, low variance
Prone to overfitting Resistant to overfitting
Requires careful tuning Easy to tune
Better accuracy (potential) Good baseline

Practical: - Start with RF for baseline - Use GBDT if need max accuracy - XGBoost/LightGBM/CatBoost > sklearn GBDT

Q: Learning rate в Gradient Boosting — как выбрать?

A:

Trade-off: - Low LR (0.01): More trees needed, better generalization - High LR (0.3): Fewer trees, faster, may overfit

Rule of thumb: - Start with LR=0.1, n_estimators=100 - If overfitting: decrease LR, increase n_estimators - Typical: LR=0.01-0.1

Relation: \(n\_estimators \propto 1/LR\)

Q: XGBoost vs LightGBM vs CatBoost

A:

Feature XGBoost LightGBM CatBoost
Tree growth Level-wise Leaf-wise Symmetric
Categorical Manual Native Native (best)
Missing values Native Native Native
Speed Good Fastest Good
Memory Medium Low Medium
Tuning Complex Medium Easy

Practical: - CatBoost: Best for categorical, minimal tuning - LightGBM: Fastest for large datasets - XGBoost: Most mature, good default


Feature Engineering

Q: Target Encoding vs One-Hot для high-cardinality

A:

One-Hot Target Encoding
1 column per category 1 column total
No leakage risk Leakage risk
Works for tree models Works for linear models
O(k) dimensions O(1) dimension

Target Encoding risks: - Leakage если не использовать CV - Overfitting на rare categories

Solution: Leave-one-out, smoothing, CV-based encoding.

Q: Когда использовать log transform?

A:

Для: - Right-skewed distributions (income, prices) - Positive values only - Multiplicative relationships

Effect: - Reduces skewness - Stabilizes variance - Makes relationships more linear

# Log transform
X_log = np.log1p(X)  # log(1+X) handles zeros

# Power transform (Box-Cox)
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='box-cox')
X_transformed = pt.fit_transform(X)

Feature Selection

Q: RFE vs Feature Importance — что лучше?

A:

RFE (Recursive Feature Elimination): - Trains model, removes weakest feature, repeat - Computationally expensive - More reliable ranking

Feature Importance: - Single model training - Faster - May be biased (high-cardinality)

Practical: - Quick: Feature importance from Random Forest - Important: RFE with cross-validation

Q: Mutual Information vs Correlation для feature selection

A:

Correlation Mutual Information
Linear only Any relationship
[-1, 1] scale [0, ∞) scale
Fast to compute Slower
Gaussian assumption No assumption

When MI > Correlation: - Non-linear relationships - Categorical features - Complex interactions


Сложные вопросы (Senior+)

Q: Выведите gradient для Logistic Regression с L2 regularization

A:

\[L = -\sum[y_i\log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)] + \frac{\lambda}{2}\|w\|^2\]
\[\frac{\partial L}{\partial w} = X^T(\hat{y} - y) + \lambda w\]
\[\frac{\partial L}{\partial b} = \sum(\hat{y}_i - y_i)\]

Q: Как работает Early Stopping?

A:

Algorithm: 1. Split train → train_sub + validation 2. Train, evaluate on validation each epoch 3. Track best validation score 4. Stop if no improvement for patience epochs 5. Return best model (not last!)

Practical:

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1)

best_score = 0
patience = 10
for epoch in range(max_epochs):
    model.fit(X_train, y_train)
    score = model.score(X_val, y_val)
    if score > best_score:
        best_score = score
        best_model = clone(model)
        wait = 0
    else:
        wait += 1
        if wait >= patience:
            break

Q: Stratified K-Fold vs K-Fold — когда что?

A:

K-Fold Stratified K-Fold
Random split Preserve class ratios
Works for regression Classification only
May have unbalanced folds Balanced folds

Когда Stratified: - Imbalanced classification - Small datasets - Rare classes

Когда K-Fold: - Regression - Large balanced datasets - Time series (use TimeSeriesSplit instead)


Model Interpretability (SHAP & LIME) — NEW 2026

Q: Зачем нужна интерпретируемость модели?

A:

Business reasons: - Regulatory compliance (GDPR "right to explanation") - Trust building with stakeholders - Debug model biases and errors - Feature leakage detection

Technical reasons: - Validate model behavior matches domain knowledge - Identify spurious correlations - Debug poor performance on specific cases

Q: SHAP vs LIME — в чём разница?

A:

SHAP LIME
Game-theoretic (Shapley values) Local surrogate model
Consistent, additive May be inconsistent
Global + local explanations Local only
Slower (especially KernelSHAP) Faster
Feature interactions visible No interactions by default

SHAP formula: $\(\phi_i = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!(|N|-|S|-1)!}{|N|!}[f(S \cup \{i\}) - f(S)]\)$

LIME formula: $\(\xi(x) = \arg\min_{g \in G} \mathcal{L}(f, g, \pi_x) + \Omega(g)\)$

Q: Как интерпретировать SHAP values?

A:

Global interpretation: - Mean |SHAP| = feature importance - SHAP distribution = effect direction (positive/negative) - Dependence plots = feature interaction effects

Local interpretation: - SHAP value = contribution of feature to this prediction - Sum of all SHAP values = prediction - base_value - Positive SHAP = increases prediction - Negative SHAP = decreases prediction

import shap

# TreeExplainer для tree models (fast)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

# Визуализация
shap.summary_plot(shap_values, X)
shap.dependence_plot("feature_name", shap_values, X)

# Local explanation
shap.force_plot(explainer.expected_value, shap_values[0], X.iloc[0])

Q: Когда использовать SHAP, а когда LIME?

A:

Use SHAP when: - Need consistent explanations - Want global feature importance - Tree models (TreeExplainer is fast) - Budget allows for computation

Use LIME when: - Need quick local explanations - Any model type (model-agnostic) - Limited compute resources - Text or image explanations

Production tip: Pre-compute SHAP for common cases, use LIME for real-time ad-hoc explanations.

Q: Проблемы SHAP/LIME в production

A:

Challenges: 1. Computational cost: KernelSHAP needs many model calls 2. Stability: Explanations may vary between runs 3. Counterfactual: Doesn't tell "what if" (need different tools) 4. Human interpretation: Still requires ML knowledge to understand

Solutions: - TreeExplainer for tree models (exact, fast) - Pre-compute explanations for common inputs - Cache results - Use for debugging, not as sole explanation


Reinforcement Learning Basics

Q: В чём разница между value-based и policy-based методами?

A:

Value-based Policy-based
Учим Q(s,a) или V(s) Учим π(a|s) напрямую
Выбираем action через argmax Сэмплируем из распределения
DQN, Q-learning REINFORCE, A3C, PPO
Дискретные действия Непрерывные действия
Sample efficient Требует много эпизодов
Низкая variance (off-policy) Высокая variance (Monte Carlo returns)

Гибридный подход (Actor-Critic): Actor учит политику, Critic оценивает value function.

Q: Объясните Q-learning алгоритм

A:

Идея: Итеративно обновляем Q-values через Bellman equation.

Q-learning update: $\(Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)]\)$

Ключевые компоненты: - \(\alpha\) — learning rate - \(\gamma\) — discount factor (0.9-0.99) - \(\epsilon\)-greedy — exploration vs exploitation

Deep Q-Network (DQN): - Q-function аппроксимируется нейросетью - Experience replay — учимся на past transitions - Target network — отдельная сеть для стабильности

# Q-learning update
q_table[state, action] += lr * (
    reward + gamma * np.max(q_table[next_state]) - q_table[state, action]
)

Q: Что такое policy gradient? REINFORCE?

A:

Policy Gradient Theorem: $\(\nabla J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla \log \pi_\theta(a|s) \cdot Q^{\pi}(s,a)]\)$

Интуиция: Увеличиваем вероятность действий, которые привели к высокой награде.

REINFORCE algorithm: 1. Сэмплируем trajectory \(\tau\) из \(\pi_\theta\) 2. Вычисляем return \(G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k\) 3. Обновляем: \(\theta \leftarrow \theta + \alpha \nabla \log \pi_\theta(a_t|s_t) G_t\)

Проблема REINFORCE: Высокая variance (одна trajectory = noisy estimate).

Q: PPO — почему популярен?

A:

Proximal Policy Optimization решает проблему instability policy gradient.

Key idea: Не меняем политику слишком сильно за один шаг.

Clipped objective: $\(L^{CLIP}(\theta) = \mathbb{E}_t[\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)]\)$

Где \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\) — probability ratio.

Преимущества: - Стабильный (clipping предотвращает большие обновления) - Sample efficient (reuses samples) - Простой в реализации - SOTA для многих RL задач

Q: Exploration vs Exploitation — как балансировать?

A:

Problem: Нужно исследовать новые действия (exploration) и использовать лучшие известные (exploitation).

Strategies: 1. \(\epsilon\)-greedy: С вероятностью \(\epsilon\) — random action, иначе — best action. Decay \(\epsilon\) от 1.0 до 0.01.

  1. Upper Confidence Bound (UCB): $\(a_t = \arg\max_a [Q_t(a) + c\sqrt{\frac{\ln t}{N_t(a)}}]\)$

Балансирует exploitation (Q-value) и exploration (uncertainty term).

  1. Thompson Sampling: Байесовский подход — сэмплируем из posterior over Q-values.

  2. Entropy regularization: Добавляем \(-\beta H(\pi)\) к loss для поощрения разнообразия.

In practice: \(\epsilon\)-greedy для простоты, UCB для bandits, entropy regularization для continuous control.


NLP: Word Embeddings (Word2Vec, GloVe)

Q: В чём разница между CBOW и Skip-gram?

A:

CBOW Skip-gram
Предсказывает center word по context Предсказывает context по center word
Быстрее на частых словах Лучше на редких словах
Сглаживает context (averaging) Точный для каждого context word
$P(w_t w_{t-c}, ..., w_{t+c})$

CBOW: Вход — one-hot контекстных слов → averaging → hidden → softmax для center word.

Skip-gram: Вход — one-hot center word → hidden → K независимых softmax для каждого context слова.

На практике: Skip-gram с negative sampling — стандарт (word2vec Google News).

Q: Как работает Negative Sampling?

A:

Проблема: Softmax над всем словарём (100K+ слов) — дорого на каждый training step.

Решение: Заменить softmax на binary classification для каждого примера.

Original softmax: $\(P(w_o | w_i) = \frac{\exp(v_{w_o}^T v_{w_i})}{\sum_{w \in V} \exp(v_w^T v_{w_i})}\)$

Negative Sampling objective: $\(L = \log \sigma(v_{w_o}^T v_{w_i}) + \sum_{k=1}^{K} \mathbb{E}_{w_k \sim P_n(w)}[\log \sigma(-v_{w_k}^T v_{w_i})]\)$

Где K = 5-20 negative samples, \(P_n(w) \propto f(w)^{3/4}\) (freq^0.75 — повышает редкие слова).

Идея: Положительная пара (center, context) + K отрицательных пар (center, random word).

# Negative sampling loss
def negative_sampling_loss(center, context, negative_samples):
    # Positive: center-context pair
    pos_score = torch.dot(center, context)
    pos_loss = -torch.log(torch.sigmoid(pos_score))

    # Negative: center-random pairs
    neg_scores = torch.matmul(negative_samples, center)
    neg_loss = -torch.sum(torch.log(torch.sigmoid(-neg_scores)))

    return pos_loss + neg_loss

Q: Word2Vec vs GloVe — в чём разница?

A:

Word2Vec GloVe
Predictive (local context) Count-based (global co-occurrence)
Skip-gram / CBOW Matrix factorization
Sliding window Co-occurrence matrix
Online learning Batch (matrix ops)
Нет explicit global info Captures global statistics

GloVe objective: $\(J = \sum_{i,j=1}^{V} f(X_{ij})(w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij})^2\)$

Где \(X_{ij}\) — co-occurrence count, \(f\) — weighting function (снижает слишком частые пары).

Практика: GloVe часто лучше на analogies, Word2Vec — на downstream tasks с fine-tuning.

Q: Почему word embeddings capture semantics?

A:

Distributional Hypothesis: "You shall know a word by the company it keeps" (Firth, 1957).

Механизм: 1. Similar words appear in similar contexts 2. Model learns to predict context → similar vectors for similar contexts 3. Vector space reflects distributional similarity

Аналогии: \(\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}\)

Ограничения: - Polysemy: "bank" (river vs financial) — один вектор - Antonyms могут быть близки (similar context) - No compositional meaning

Современные решения: Contextualized embeddings (BERT, ELMo) — разные векторы для разных контекстов.

Q: Что такое FastText и чем отличается от Word2Vec?

A:

FastText (Facebook AI Research, 2016) — extension Word2Vec с subword information.

Ключевое отличие: Представляет слово как bag of character n-grams: - "apple" → ["", "apple"] - Слово = сумма его n-gram векторов

Формула: $\(\vec{w} = \sum_{g \in G_w} \vec{z}_g\)$

Где \(G_w\) — множество n-grams для слова \(w\).

Преимущества: 1. OOV handling: Может создать embedding для неизвестных слов 2. Morphology: Captures "running", "runner", "runs" share patterns 3. Rare words: Лучше для редких слов (shared subwords) 4. Multilingual: Works well для morphologically rich languages (Russian, German)

Недостатки: 1. Memory: Больше векторов (n-grams vs words) 2. Noise: Subwords могут вносить noise 3. Slower: Больше parameters

import fasttext

# Train FastText
model = fasttext.train_unsupervised(
    'data.txt',
    model='skipgram',
    dim=300,
    ws=5,          # window size
    minCount=5,
    minn=3,        # min n-gram
    maxn=6         # max n-gram
)

# OOV handling — works!
embedding = model.get_word_vector('unprecedentedword')

Q: Word2Vec vs GloVe vs FastText — когда что использовать?

A:

Criteria Word2Vec GloVe FastText
Training Predictive (local) Count-based (global) Predictive + subwords
OOV handling ❌ No ❌ No ✅ Yes (via subwords)
Memory Low High (co-occ matrix) Medium-High
Speed Fast Medium Medium
Rare words Poor Medium Good
Morphology No No Yes
Best for General NLP, speed Analogy tasks OOV, morphological langs

Decision framework:

# Choose Word2Vec when:
# - Speed priority
# - Well-defined vocabulary (no OOV expected)
# - Limited compute

# Choose GloVe when:
# - Global context matters
# - Analogy tasks important
# - Clean, large corpus available

# Choose FastText when:
# - OOV words common (user-generated content)
# - Morphologically rich language (Russian, Finnish)
# - Domain-specific vocabulary

2026 Context: Static embeddings → less common with transformers, but still useful for: - Lightweight production systems - Resource-constrained environments - Baseline comparisons - Word similarity tasks

Q: Как оценить качество word embeddings?

A:

Intrinsic evaluation: 1. Analogy tests: a:b :: c:? (Google analogy dataset) 2. Similarity correlation: Spearman с human judgments (WordSim-353, SimLex-999) 3. Concept categorization: Clustering quality (WordNet)

Extrinsic evaluation: 1. Downstream task performance (NER, sentiment, QA) 2. Probe tasks (part-of-speech, syntactic tree depth)

Practical: Intrinsic — для разработки, Extrinsic — для production.

# Cosine similarity для word vectors
from sklearn.metrics.pairwise import cosine_similarity

def word_similarity(w1, w2, embeddings):
    v1 = embeddings[w1]
    v2 = embeddings[w2]
    return cosine_similarity([v1], [v2])[0][0]

# Analogy: king - man + woman = ?
def analogy(a, b, c, embeddings):
    """Find d such that a:b :: c:d"""
    target = embeddings[a] - embeddings[b] + embeddings[c]
    # Find nearest word to target
    similarities = cosine_similarity([target], embeddings.vectors)
    return embeddings.index_to_word[np.argmax(similarities)]

NLP: Named Entity Recognition (NER) & Sequence Labeling

Q: Что такое NER и как оценивать?

A:

Named Entity Recognition — задача извлечения именованных сущностей (Person, Organization, Location, Date, etc.) из текста.

Формат: BIO tagging (Begin-Inside-Outside) - B-PER, I-PER — person name - B-ORG, I-ORG — organization - O — not an entity

Метрики:

Token-level: - Precision, Recall, F1 для каждого класса - Micro vs Macro averaging

Entity-level (строже): - Exact match: границы и тип должны совпасть - Partial match: overlap > threshold

CoNLL-2003 standard: Entity-level F1.

Q: CRF vs BiLSTM vs BERT для NER?

A:

CRF BiLSTM-CRF BERT
Hand-crafted features Learned features Contextualized embeddings
No deep learning Sequence model Pretrained transformer
Fast inference Medium Slow (but fine-tuning helps)
Works on small data Needs more data Transfer learning

BiLSTM-CRF: - BiLSTM: contextual representations - CRF layer: learns transition constraints (I-PER after B-PER, not I-ORG)

BERT for NER: - Fine-tune BERT + linear classifier - Subword tokenization → use first subword for entity - SOTA on CoNLL-2003 (93+ F1)

Q: Как обрабатывать nested entities?

A:

Problem: "University of California" — ORG, но "California" внутри — LOC.

Approaches: 1. Flat NER: Игнорировать вложенность (стандартный подход) 2. Layered NER: Два прохода — сначала outer, потом inner entities 3. Hypergraph decoding: Joint prediction всех уровней 4. Seq2seq: Generate entity spans with markers

Практика: Большинство систем — flat NER, nested — отдельная post-processing или специализированные модели.

Q: POS Tagging — основные подходы?

A:

Part-of-Speech Tagging — присвоение грамматических категорий (NOUN, VERB, ADJ, etc.) словам.

Approaches:

  1. HMM (Hidden Markov Model): $\(P(t_1^n, w_1^n) = \prod_{i=1}^{n} P(w_i | t_i) P(t_i | t_{i-1})\)$

  2. Emission: \(P(w|t)\) — word given tag

  3. Transition: \(P(t_i | t_{i-1})\) — tag bigram
  4. Viterbi decoding

  5. CRF (Conditional Random Field): $\(P(t|w) = \frac{1}{Z(w)} \exp(\sum_i \theta \cdot f(t_{i-1}, t_i, w, i))\)$

  6. Features: word, suffix, prefix, neighboring tags

  7. Global normalization

  8. BiLSTM / BiLSTM-CRF:

  9. Learned features, no manual feature engineering

  10. BERT fine-tuning:

  11. Contextual representations
  12. +97% accuracy на Penn Treebank

Practical: BERT для high accuracy, HMM/CRF для скорости и interpretability.


Hyperparameter Optimization

Q: Parameters vs Hyperparameters — разница?

A:

Parameters Hyperparameters
Learned from data during training Set before training
Internal to model (weights, biases) Control learning process
Optimized by optimizer (SGD, Adam) Set by practitioner or search
Examples: weights in NN, coefficients in regression Examples: learning rate, batch size, num layers

Hyperparameters determine HOW model learns, parameters are WHAT model learns.

A:

Grid Search: Exhaustive search over all combinations.

from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['rbf', 'linear'],
    'gamma': ['scale', 'auto']
}
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

Random Search: Sample random combinations.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform

param_dist = {
    'C': loguniform(1e-3, 1e3),
    'kernel': ['rbf', 'linear'],
    'gamma': ['scale', 'auto']
}
random_search = RandomizedSearchCV(SVC(), param_dist, n_iter=50, cv=5)

When Random > Grid: - Most hyperparameters don't matter much (only a few are important) - Random search explores more values for important params - Paper: "Random Search for Hyper-Parameter Optimization" (Bergstra & Bengio, 2012)

Q: Что такое Bayesian Optimization?

A:

Idea: Build probabilistic model of objective function, use it to guide search.

Components: 1. Surrogate model: Gaussian Process (GP) approximates f(x) 2. Acquisition function: Decides where to sample next (balance exploration vs exploitation)

Acquisition functions: - Expected Improvement (EI): \(EI(x) = \mathbb{E}[\max(f(x) - f^*, 0)]\) - Upper Confidence Bound (UCB): \(UCB(x) = \mu(x) + \beta \sigma(x)\) - Probability of Improvement (PI): \(PI(x) = P(f(x) > f^*)\)

Optuna example:

import optuna

def objective(trial):
    lr = trial.suggest_float('lr', 1e-5, 1e-1, log=True)
    batch_size = trial.suggest_categorical('batch_size', [32, 64, 128])
    layers = trial.suggest_int('layers', 1, 5)

    model = build_model(lr, batch_size, layers)
    score = train_and_evaluate(model)
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

When Bayesian > Random: - Expensive evaluations (training takes hours) - Low-dimensional search space (<20 params) - Smooth objective function

Q: Optuna vs Ray Tune — когда что?

A:

Aspect Optuna Ray Tune
Focus Single-node optimization Distributed at scale
Sampling TPE, CMA-ES, GP Same + population-based
Distributed Via RDB/Redis Native Ray cluster
Early stopping Pruning (Median, Async) PBT, ASHA, Hyperband
Integration Sklearn, PyTorch, TF PyTorch, TF, XGBoost
Ease of use Simpler API More complex

Use Optuna when: - Single machine or small cluster - Need sophisticated sampling (TPE, CMA-ES) - Simpler setup

Use Ray Tune when: - Large-scale distributed training - Population-Based Training (PBT) - Already using Ray ecosystem

Q: Что такое Early Stopping в HPO?

A:

Problem: Many trials are bad — stop them early to save compute.

Approaches:

1. Median Pruning (Optuna):

pruner = optuna.pruners.MedianPruner(n_startup_trials=5, n_warmup_steps=10)
study = optuna.create_study(pruner=pruner)

Mechanism: At step k, if trial's intermediate value < median of previous trials, prune.

2. ASHA (Async Successive Halving): - Run many trials with minimal resources - Promote top performers to more resources - Early stop underperformers

3. Hyperband: - Multiple brackets of ASHA with different resource allocations - Better theoretical guarantees

Q: Как приоритизировать гиперпараметры для тюнинга?

A:

Priority order (higher = tune first):

  1. Learning rate — biggest impact on convergence
  2. Batch size — affects generalization and speed
  3. Optimizer — Adam vs SGD with momentum
  4. Architecture — layers, units per layer
  5. Regularization — dropout, weight decay
  6. Data augmentation — for vision

Coarse-to-fine strategy:

# Stage 1: Coarse search
lr_range = [1e-4, 1e-3, 1e-2, 1e-1]  # Log scale

# Stage 2: Fine search around best
best_lr = 1e-3
lr_range = [5e-4, 1e-3, 2e-3, 5e-3]

Q: Nested Cross-Validation для HPO — зачем?

A:

Problem: Using same CV split for HPO and evaluation → overfitting to validation set.

Solution: Nested CV — inner loop for HPO, outer loop for evaluation.

Outer CV (k=5 folds):
  For each fold:
    Inner CV (k=5 folds):
      GridSearchCV on training portion
    Evaluate best model on outer test fold
from sklearn.model_selection import cross_val_score, GridSearchCV

# Inner: HPO
inner_cv = KFold(n_splits=5)
clf = GridSearchCV(estimator=model, param_grid=param_grid, cv=inner_cv)

# Outer: Evaluation
outer_cv = KFold(n_splits=5)
nested_score = cross_val_score(clf, X, y, cv=outer_cv)

Trade-off: 5×5 = 25 model fits per HPO candidate → expensive but unbiased.

Q: Sensitivity Analysis для гиперпараметров?

A:

Goal: Understand which hyperparameters matter most.

Methods:

1. One-at-a-time (OAT): Vary one param, fix others. - Simple but misses interactions

2. Morris Method: Measure elementary effects. $\(EE_i = \frac{f(x_1, ..., x_i + \Delta, ..., x_k) - f(x)}{\Delta}\)$

3. Sobol Indices: Variance-based decomposition. - \(S_i\) = first-order (main effect) - \(S_{Ti}\) = total effect (including interactions)

4. fANOVA (for Optuna):

import optuna
from optuna.importance import FanovaImportanceEvaluator

study = optuna.create_study()
study.optimize(objective, n_trials=100)

importance = optuna.importance.get_param_importances(
    study, evaluator=FanovaImportanceEvaluator()
)

Output: {'lr': 0.45, 'batch_size': 0.30, 'layers': 0.15, 'dropout': 0.10}

Q: Multi-objective HPO — как балансировать accuracy и latency?

A:

Problem: Maximize accuracy, minimize latency — conflicting objectives.

Approaches:

1. Scalarization: $\(L = \alpha \cdot (1 - accuracy) + (1 - \alpha) \cdot \frac{latency}{max\_latency}\)$

2. Pareto Front: Find set of solutions where no objective can improve without worsening another.

Optuna multi-objective:

def objective(trial):
    lr = trial.suggest_float('lr', 1e-4, 1e-1, log=True)

    accuracy, latency = train_and_profile(model)

    return accuracy, latency  # Maximize both

study = optuna.create_study(directions=['maximize', 'minimize'])
study.optimize(objective, n_trials=100)

# Get Pareto front
pareto_trials = study.best_trials

Decision: Choose from Pareto front based on business constraints.


Active Learning

Q: Что такое Active Learning?

A:

Definition: ML paradigm where algorithm strategically selects most informative samples for labeling, reducing annotation cost.

Key insight: Not all samples equally valuable — some provide more information than others.

Active learning loop: 1. Start with small labeled set \(L\), large unlabeled pool \(U\) 2. Train model on \(L\) 3. Query oracle for labels of most informative samples from \(U\) 4. Add newly labeled samples to \(L\) 5. Repeat until budget exhausted or target accuracy reached

Goal: Achieve target accuracy with minimum labeling cost.

Q: Query strategies — Uncertainty Sampling?

A:

Core idea: Query samples where model is most uncertain.

Metrics:

1. Least Confidence: $\(x^* = \arg\max_x (1 - P(\hat{y}|x))\)$

Query samples with lowest max probability.

def least_confidence(probas):
    # probas: (n_samples, n_classes)
    max_proba = probas.max(axis=1)
    return np.argmax(1 - max_proba)  # Most uncertain

2. Margin Sampling: $\(x^* = \arg\min_x (P(\hat{y}_1|x) - P(\hat{y}_2|x))\)$

Query samples where top two classes are closest.

def margin_sampling(probas):
    # Sort probabilities
    sorted_probas = np.sort(probas, axis=1)[:, ::-1]
    margins = sorted_probas[:, 0] - sorted_probas[:, 1]
    return np.argmin(margins)  # Smallest margin

3. Entropy: $\(x^* = \arg\max_x \left(-\sum_c P(y_c|x) \log P(y_c|x)\right)\)$

Query samples with highest prediction entropy.

def entropy_sampling(probas):
    # Entropy: -sum(p * log(p))
    eps = 1e-10
    entropy = -np.sum(probas * np.log(probas + eps), axis=1)
    return np.argmax(entropy)  # Highest entropy

Comparison: | Strategy | Best for | Limitation | |----------|----------|------------| | Least Confidence | Binary classification | Ignores class distribution | | Margin | Multi-class | Only considers top 2 | | Entropy | Multi-class | Computationally heavier |

Q: Query-by-Committee (QBC)?

A:

Idea: Train multiple models (committee), query samples with highest disagreement.

Disagreement measures:

1. Vote Entropy: $\(x^* = \arg\max_x \left(-\sum_c \frac{V_c}{C} \log \frac{V_c}{C}\right)\)$

Where \(V_c\) = votes for class \(c\), \(C\) = committee size.

2. Kullback-Leibler Divergence: $\(x^* = \arg\max_x \frac{1}{C} \sum_{c=1}^{C} D_{KL}(P(y|x;\theta_c) \| P(y|x))\)$

Implementation:

class QueryByCommittee:
    def __init__(self, n_models=5):
        self.models = [create_model() for _ in range(n_models)]

    def fit(self, X, y):
        for model in self.models:
            # Bootstrap sample for diversity
            idx = np.random.choice(len(X), len(X), replace=True)
            model.fit(X[idx], y[idx])

    def query(self, X_pool, n_samples=1):
        # Collect predictions
        predictions = np.array([
            model.predict_proba(X_pool) for model in self.models
        ])  # (n_models, n_samples, n_classes)

        # Vote entropy
        votes = np.argmax(predictions, axis=2)  # (n_models, n_samples)
        vote_counts = np.apply_along_axis(
            lambda x: np.bincount(x, minlength=predictions.shape[2]),
            axis=0, arr=votes
        )  # (n_classes, n_samples)
        vote_probas = vote_counts / len(self.models)
        entropy = -np.sum(vote_probas * np.log(vote_probas + 1e-10), axis=0)

        return np.argsort(entropy)[-n_samples:]

Q: Expected Model Change?

A:

Idea: Query samples that would cause largest change in model if labeled.

Expected Gradient Length (EGL): $\(x^* = \arg\max_x \mathbb{E}_{y \sim P(y|x)} \|\nabla L(x, y)\|\)$

Intuition: If gradient would be large regardless of label, sample is informative.

def expected_gradient_length(model, x, possible_labels):
    total_grad_norm = 0

    for y in possible_labels:
        # Compute loss gradient for this label
        loss = compute_loss(model, x, y)
        grads = torch.autograd.grad(loss, model.parameters())
        grad_norm = torch.sqrt(sum(g.pow(2).sum() for g in grads))

        # Weight by probability of this label
        prob = model.predict_proba(x)[y]
        total_grad_norm += prob * grad_norm

    return total_grad_norm

Pros: Theoretically motivated, considers impact on model Cons: Computationally expensive (requires gradients for each candidate)

Q: Diversity-based sampling?

A:

Problem: Uncertainty sampling may select redundant samples.

Solution: Balance uncertainty with diversity.

Core-set selection: $\(\min_{S \subseteq U} \max_{x \in U} \min_{s \in S} d(x, s)\)$

Find subset \(S\) that covers unlabeled pool well.

Coreset via k-Center:

def k_center_selection(X_pool, n_samples, already_labeled=None):
    """Greedy k-center for diverse selection."""
    selected = []

    if already_labeled is not None:
        # Start with distances to already labeled
        dist_matrix = cdist(X_pool, already_labeled)
        min_distances = dist_matrix.min(axis=1)
    else:
        min_distances = np.full(len(X_pool), np.inf)
        # Start with random point
        selected.append(np.random.randint(len(X_pool)))
        min_distances[selected[0]] = 0

    while len(selected) < n_samples:
        # Find point furthest from any selected
        next_idx = np.argmax(min_distances)
        selected.append(next_idx)

        # Update distances
        new_dists = cdist(X_pool, X_pool[selected[-1:]])
        min_distances = np.minimum(min_distances, new_dists.flatten())

    return selected

BADGE (Batch Active Learning by Diverse Gradient Embeddings): - Combine uncertainty + diversity - Embed samples using gradient embeddings - k-means++ selection in embedding space

Q: Когда Active Learning НЕ эффективен?

A:

Failure cases:

  1. Very small initial labeled set:
  2. Model too weak to identify informative samples
  3. Random sampling may be better initially

  4. Highly imbalanced data:

  5. May oversample minority class unnecessarily
  6. Or ignore rare but important samples

  7. Clustered data structure:

  8. May miss entire clusters if initial samples don't cover them
  9. Solution: Combine with diversity sampling

  10. Noisy labels:

  11. Querying uncertain samples may amplify noise
  12. Solution: Label smoothing, robust loss

  13. Budget too small:

  14. Active learning overhead > benefit
  15. Random sampling competitive for <100 samples

Rule of thumb: Active learning shines when: - Labeling cost >> computation cost - 100+ queries budget - Model has reasonable base accuracy (>50%)

Q: Active Learning в production — best practices?

A:

Implementation checklist:

  1. Start with diversity:

    # First batch: stratified or k-center
    if len(labeled) < initial_size:
        return diversity_sampling(X_unlabeled, batch_size)
    else:
        return uncertainty_sampling(model, X_unlabeled, batch_size)
    

  2. Combine strategies:

    # 70% uncertainty + 30% diversity
    n_uncertain = int(0.7 * batch_size)
    n_diverse = batch_size - n_uncertain
    
    uncertain = uncertainty_sampling(model, X_pool, n_uncertain)
    remaining_pool = np.setdiff1d(np.arange(len(X_pool)), uncertain)
    diverse = diversity_sampling(X_pool[remaining_pool], n_diverse)
    
    return np.concatenate([uncertain, diverse])
    

  3. Cold start handling:

  4. First 50-100 samples: random or stratified
  5. After model shows promise: switch to active learning

  6. Human-in-the-loop:

  7. Show model confidence to annotator
  8. Allow annotator to flag "don't know" or "bad sample"
  9. Track annotator agreement

  10. Stopping criteria:

  11. Model accuracy plateaus
  12. Budget exhausted
  13. Remaining samples all low uncertainty

Tools: - Modal: Active learning platform - Label Studio: Annotation with active learning plugin - SuperAnnotate: Computer vision active learning - Prodigy: NLP active learning


Time Series: Deep Learning Methods

Q: DeepAR — как работает?

A:

Architecture: Autoregressive RNN с probabilistic output.

Key features: 1. Global model: Learns from multiple related time series 2. Autoregressive: Uses past values as input 3. Probabilistic: Outputs distribution (Gaussian with mean + std) 4. Covariates: Can include time-dependent and static features

Training: $\(p(y_{t:T} | y_{1:t}, x_{1:T}) = \prod_{t'=t}^{T} p(y_{t'} | y_{1:t'-1}, x_{1:T}, \theta)\)$

Inference: Sample from predicted distribution → prediction intervals.

# DeepAR prediction (conceptual)
def predict_deepar(model, context, num_samples=100):
    samples = []
    for _ in range(num_samples):
        # Sample from predicted distribution at each step
        pred_dist = model(context)  # Gaussian(mean, std)
        sample = pred_dist.sample()
        samples.append(sample)
    return {
        'mean': np.mean(samples, axis=0),
        'std': np.std(samples, axis=0),
        'quantiles': np.quantile(samples, [0.1, 0.5, 0.9], axis=0)
    }

Advantages over ARIMA: - Handles multiple related series (learns globally) - Works with covariates - Produces probabilistic forecasts - Can handle cold start with item features

Q: Temporal Fusion Transformer (TFT)?

A:

Architecture: 1. Variable Selection Network: Learns which features are important 2. Static Covariate Encoder: Processes time-invariant features 3. Gated Residual Network (GRN): Non-linear processing with skip connections 4. Multi-head Attention: Learns temporal dependencies + interpretability 5. Quantile Regression: Predicts multiple quantiles for intervals

Three input types: - Static: Product category, store location - Known future: Holidays, promotions (available at prediction time) - Historical: Past sales, weather (only available from past)

Key innovation — Interpretability: - Variable importance: Which features matter - Attention weights: Which past time steps matter - Seasonal patterns: Via attention visualization

# TFT attention interpretation
attention_weights = model.get_attention_weights(x)  # (batch, heads, seq_len)
# Identify which past steps influence predictions
important_steps = attention_weights.mean(dim=(0, 1)).argsort(descending=True)[:5]

When to use TFT: - Multiple known future covariates - Need interpretability - Complex temporal patterns - Long-range dependencies

Q: Prophet vs ARIMA vs Deep Learning?

A:

Method Strengths Weaknesses Best For
ARIMA Interpretable, well-understood Single series, manual tuning Clean univariate
Prophet Multiple seasonalities, holidays Less accurate, no covariates Business forecasting
DeepAR Global learning, covariates Needs many series Related series
TFT Interpretability, all covariates Complex, needs data Complex systems
N-BEATS Pure DL, no features Black box Pure DL forecasting

Prophet model: $\(y(t) = g(t) + s(t) + h(t) + \varepsilon_t\)$

Where: - \(g(t)\) = trend (piecewise linear or logistic) - \(s(t)\) = seasonality (Fourier series) - \(h(t)\) = holiday effects

from prophet import Prophet

model = Prophet(
    yearly_seasonality=True,
    weekly_seasonality=True,
    daily_seasonality=False
)
model.add_country_holidays(country_name='US')
model.fit(df)  # df with 'ds' (date) and 'y' (value) columns
forecast = model.predict(future_df)

Q: Time Series Cross-Validation?

A:

Critical: Never use random split — temporal order must be preserved!

Rolling origin (expanding window):

Fold 1: Train [0:100],  Test [100:120]
Fold 2: Train [0:120],  Test [120:140]
Fold 3: Train [0:140],  Test [140:160]

Sliding window:

Fold 1: Train [0:100],  Test [100:120]
Fold 2: Train [20:120], Test [120:140]
Fold 3: Train [40:140], Test [140:160]

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    # Train and evaluate

Metrics: - MAPE: Mean Absolute Percentage Error = \(\frac{100\%}{n}\sum|\frac{y_i - \hat{y}_i}{y_i}|\) - MASE: Mean Absolute Scaled Error = \(\frac{MAE}{MAE_{naive}}\) - RMSE: Root Mean Squared Error - WMAPE: Weighted MAPE = \(\frac{\sum|y_i - \hat{y}_i|}{\sum|y_i|}\)

Q: N-BEATS architecture?

A:

Key idea: Stack of fully-connected blocks with forward and backward residuals.

Architecture: 1. Each block has two outputs: - Forward: Forecast (contribution to final prediction) - Backward: Backcast (explains input, removed for next block)

  1. Two configurations:
  2. Generic: Learns any pattern
  3. Interpretable: Separate trend + seasonality blocks

Formula: $\(\hat{y} = \sum_{b=1}^{B} \hat{y}_b, \quad x_{b+1} = x_b - \hat{x}_b\)$

Where \(\hat{y}_b\) = forecast from block \(b\), \(\hat{x}_b\) = backcast from block \(b\).

Advantages: - Pure deep learning (no feature engineering) - Interpretable mode separates trend/seasonality - Competitive with M4 competition winner (Smyl's ES-RNN); outperformed other neural methods on M4 benchmarks


Explainable AI (XAI): SHAP & LIME

Q: Зачем нужен XAI в production?

A: 4 ключевые причины:

  1. Regulatory Compliance: EU AI Act (2024), GDPR right to explanation — high-risk AI системы обязаны объяснять решения
  2. Trust Building: 78% enterprise AI rejected due to lack of interpretability (2025)
  3. Debugging: XAI помогает найти почему модель ошибается
  4. Bias Detection: Выявление unfair patterns в predictions

Q: SHAP vs LIME — в чём разница?

A:

Критерий SHAP LIME
Теория Game theory (Shapley values) Local surrogate models
Гарантии Consistency, Additivity, Efficiency Local fidelity only
Скорость TreeSHAP: ~65ms, KernelSHAP: ~450ms ~85ms (tabular)
Stability 95% 82%
Memory TreeSHAP: 78MB, KernelSHAP: 680MB 92MB
Model-specific TreeSHAP, DeepSHAP, LinearSHAP Model-agnostic

SHAP formula: $\(\phi_i(f, x) = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!(|N|-|S|-1)!}{|N|!} [f(S \cup \{i\}) - f(S)]\)$

LIME formula: $\(\xi(x) = \arg\min_{g \in G} \mathcal{L}(f, g, \pi_x) + \Omega(g)\)$

Где \(\pi_x(z) = \exp(-D(x, z)^2 / \sigma^2)\) — kernel weighting.

Q: Когда использовать SHAP, а когда LIME?

A:

Выбирай SHAP когда: - Tree-based модели (Random Forest, XGBoost, LightGBM) — TreeSHAP exact + fast - Нужны global explanations (summary plots, dependence plots) - Regulated industries (finance, healthcare) — theoretical rigor важен - Comparing feature importance across instances

Выбирай LIME когда: - Novel architectures без SHAP implementation - Нужно quick explanation для single prediction - Stakeholders non-technical — local linear понятнее - Ограниченные compute resources

Best practice: Hybrid approach — SHAP для production monitoring, LIME для ad-hoc investigations.

Q: Как SHAP обеспечивает consistency?

A: 4 математических свойства (axioms):

  1. Efficiency: \(\sum_{i=1}^{M} \phi_i = f(x) - E[f(X)]\) — сумма SHAP values = deviation from baseline
  2. Symmetry: Если features вносят одинаковый вклад во все coalitions → равные SHAP values
  3. Dummy: Features которые не влияют на prediction → SHAP = 0
  4. Additivity: Для ensemble: SHAP_total = SHAP_model1 + SHAP_model2

Эти гарантии делают SHAP единственным методом удовлетворяющим всем desiderata одновременно.

Q: LIME unstability — как решить?

A: Problem: Small input changes → significantly different explanations (18% variance).

Solutions:

  1. Multiple runs + average:

    explanations = []
    for seed in range(5):
        exp = lime_explainer.explain_instance(x, predict_fn, random_state=seed)
        explanations.append(exp.as_map())
    stable_explanation = average_explanations(explanations)
    

  2. Increase num_samples: Default 5000, increase to 15000+ for stability

  3. Cross-validation on explanations: Run LIME multiple times, check variance
  4. Use SHAP instead: 95% stability vs 82% for LIME

Q: Как интерпретировать SHAP values?

A:

For single prediction: - \(\phi_i > 0\) → feature i pushes prediction UP - \(\phi_i < 0\) → feature i pushes prediction DOWN - \(|\phi_i|\) = magnitude of contribution

Example (Credit Approval):

Income:     +0.35  (pushes toward approval)
CreditScore: +0.30  (pushes toward approval)
Debt:       -0.22  (pushes toward rejection)
Age:        +0.08  (small positive)
Base value: 0.50 (average approval rate)
Final:      0.50 + 0.35 + 0.30 - 0.22 + 0.08 = 1.01 → APPROVED

Global interpretation: - Summary plot: Feature importance ranking across all instances - Dependence plot: How feature value affects SHAP value - Interaction plot: Feature interactions

Q: SHAP для deep learning — какие подходы?

A:

  1. DeepSHAP: Combines SHAP with DeepLIFT backpropagation
  2. Fast for neural networks
  3. Uses gradient * input decomposition

  4. GradientSHAP: Integrates gradients with SHAP

  5. Works for any differentiable model
  6. More expensive but theoretically sound

  7. PartitionSHAP: For hierarchical models (Transformers)

  8. Handles attention layers properly
import shap

# DeepSHAP for PyTorch
explainer = shap.DeepExplainer(model, background_data)
shap_values = explainer.shap_values(test_data)

# GradientSHAP
explainer = shap.GradientExplainer(model, background_data)
shap_values = explainer.shap_values(test_data)

Q: Production XAI pipeline — как построить?

A:

Architecture:

Client Request → API Gateway → XAI Engine → Explanation Cache → Response
                            Model Registry
                            Monitoring Service

Key components:

  1. Precomputation: Cache common explanations during training
  2. Adaptive sampling: Early stopping when explanation stabilizes
  3. Redis cache: Store precomputed SHAP values
  4. Fallback: LIME for cold-start, SHAP for cached

Performance optimizations: - Caching: Reduce latency from 2.1s → 120ms - Batch explanations: Compute SHAP for multiple instances together - TreeSHAP for tree models: 10x faster than KernelSHAP

Q: Common failure modes XAI?

A:

  1. Correlated features: SHAP/LIME underestimate importance when features are highly correlated
  2. Solution: Group correlated features, use conditional expectations

  3. Out-of-distribution: Explanations unreliable for OOD samples

  4. Error can exceed 40% for far-from-training instances
  5. Solution: Flag OOD samples, don't trust explanations blindly

  6. Feature interactions: Linear explanations miss non-linear interactions

  7. Solution: Use SHAP interaction values (expensive: O(n²))

  8. Baseline dependency: Results sensitive to background dataset

  9. Solution: Use representative background, document choice

Neural Architecture Search (NAS)

Q: Что такое NAS и зачем он нужен?

A: NAS — автоматический поиск оптимальной архитектуры нейросети.

3 компонента: 1. Search Space: Какие операции/связи допустимы (conv, pooling, attention) 2. Search Strategy: Как исследовать пространство (RL, EA, gradient-based) 3. Performance Estimation: Как быстро оценить candidate (proxy tasks, weight sharing)

Зачем: - Ручной дизайн требует 120,000+ GPU hours/month (Tesla, 2025) - NAS находит architectures которые люди не придумают - Hardware-aware NAS оптимизирует под конкретное устройство

Q: Какие search strategies в NAS?

A:

Strategy Как работает Pros Cons
RL RNN controller генерирует architectures, reward = accuracy Осваивает сложные spaces 1800 GPU-days (NASNet)
Evolutionary Population, mutation, crossover, selection Простой, parallelizable Expensive evaluation
DARTS Continuous relaxation, gradient descent on architecture params 1 GPU-day Discretization gap
Bayesian Gaussian process models performance Sample-efficient Struggles with high-dim
Random Uniform sampling Baseline, simple Slow for large spaces
One-Shot Train supernet once, sample subnets Fast evaluation Weight sharing bias

DARTS key insight: $\(\bar{o}(x) = \sum_{o \in \mathcal{O}} \frac{\exp(\alpha_o)}{\sum_{o'} \exp(\alpha_{o'})} o(x)\)$

Где \(\alpha\) — learnable architecture parameters. После training: argmax \(\alpha\) → discrete architecture.

Q: Что такое cell-based search space?

A: Instead of searching whole network, search small reusable cell.

NASNet cells: - Normal cell: Same spatial resolution - Reduction cell: Halves resolution (stride-2)

Cell = DAG: - Nodes = operations (3x3 conv, 5x5 conv, pooling) - Edges = connections - Stacked N times → full network

Advantages: - Transferable (CIFAR → ImageNet) - Smaller search space - Faster search

Limitations: - Low variance among found architectures - Constrained expressiveness

Q: Hardware-Aware NAS — как работает?

A: Incorporate hardware constraints into search.

Metrics added to objective: - Latency: \(\mathcal{L} = \text{Accuracy} - \lambda \cdot \text{Latency}\) - Memory: Peak memory usage - Energy: FLOPs × power per operation

Approaches:

  1. ProxylessNAS: Learn to prune paths, measure on target device
  2. MnasNet: Multi-objective optimization (accuracy + latency)
  3. Once-for-All: Train supernet, specialize for different devices

Example (Mobile optimization):

# Loss with latency constraint
loss = ce_loss + lambda * (latency / target_latency - 1)^2

Q: One-Shot NAS — в чём идея?

A: Train one supernet containing all architectures, evaluate by sampling.

Once-for-All Network (OFA): 1. Train supernet supporting all configurations 2. At inference: sample subnet with desired constraints 3. No retraining needed

Weight sharing benefits: - 10,000x faster than training from scratch - Single training → multiple deployment targets

Challenge: Weight sharing bias — shared weights may not reflect standalone performance.

Solutions: - Progressive shrinking (OFA): Train large, gradually add smaller configs - Sandwich rule: Train min, max, random each step

Q: Когда NAS НЕ стоит использовать?

A:

Не используйте NAS когда: 1. Small scale (<7B params): Overhead не окупается 2. Single-domain task: Нет benefit от specialization 3. Latency-critical: Search overhead too high 4. Limited compute: Search может занять недели 5. Strong baseline exists: ResNet/EfficientNet достаточно

Rule of thumb: NAS оправдан когда: - Unique hardware constraints (edge, mobile) - Novel task без established architectures - Budget ≥ 100 GPU-days for search - Expected significant efficiency gains

Q: EfficientNet — как NAS помог?

A: EfficientNet = NAS + Compound Scaling.

Шаг 1: NAS (baseline) - Found EfficientNet-B0 via MnasNet - Optimized for accuracy + latency

Шаг 2: Compound Scaling - Scale all dimensions together: $\(\text{depth} = \alpha \cdot \phi\)$

\[\text{width} = \beta \cdot \phi\]

$\(\text{resolution} = \gamma \cdot \phi\)$

  • Constraint: \(\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2\)
  • \(\phi\) = user-specified coefficient (B0→B7)

Result: 8.4x smaller + 6.1x faster than GPipe while achieving similar accuracy.



Cost-Sensitive Learning

Источники: CodeGenes: Cost-Sensitive Learning in PyTorch (2025), LinkedIn: Class Weights & Cost-Sensitive Learning (2025), Elkan (2001)

Q: Что такое Cost-Sensitive Learning?

A:

Definition: ML technique where different misclassification errors have different costs.

Example: Medical diagnosis - FN (sick → healthy): Missing cancer = very costly - FP (healthy → sick): Unnecessary tests = less costly

Cost matrix: $\(C = \begin{bmatrix} 0 & c_{01} \\ c_{10} & 0 \end{bmatrix}\)$

Where \(C_{ij}\) = cost of predicting class \(j\) when true class is \(i\).

Q: Как реализовать cost-sensitive learning в PyTorch?

A:

Method 1: Weighted Cross-Entropy

import torch.nn as nn

# Define class weights (higher for minority/costly class)
class_weights = torch.tensor([1.0, 10.0])  # [class 0, class 1]

criterion = nn.CrossEntropyLoss(weight=class_weights)
loss = criterion(predictions, targets)

Method 2: Custom Cost-Sensitive Loss

def cost_sensitive_loss(predictions, targets, cost_matrix):
    """
    predictions: (batch, n_classes) logits
    targets: (batch,) class indices
    cost_matrix: (n_classes, n_classes)
    """
    n_classes = cost_matrix.shape[0]
    one_hot = torch.eye(n_classes)[targets]  # (batch, n_classes)

    # Get cost for each sample's true class
    costs = one_hot @ cost_matrix  # (batch, n_classes)

    # Weighted log-likelihood
    log_probs = F.log_softmax(predictions, dim=1)
    loss = -torch.sum(costs * log_probs, dim=1).mean()

    return loss

# Example: FN costs 10x more than FP
cost_matrix = torch.tensor([
    [0, 1],    # True 0: cost of predicting 0, 1
    [10, 0]    # True 1: cost of predicting 0, 1
], dtype=torch.float32)

Method 3: Sample-wise weights

# Different weight for each sample
sample_weights = torch.tensor([1, 5, 1, 10, ...])

# Compute per-sample loss
losses = F.cross_entropy(predictions, targets, reduction='none')
weighted_loss = (losses * sample_weights).mean()

Q: Когда использовать cost-sensitive learning?

A:

Scenario Approach Cost Matrix Example
Medical diagnosis High FN cost FN=10, FP=1
Fraud detection High FN cost FN=100, FP=1
Spam filter High FP cost FN=1, FP=10
Loan approval Asymmetric Default=50, Rejection=1

Rule of thumb: - Set cost ratio = inverse of acceptable error ratio - If FN is 10x worse than FP → weight(class_1) = 10 * weight(class_0)

Q: Cost-sensitive vs class imbalance — в чём разница?

A:

Aspect Class Imbalance Cost-Sensitive
Focus Sample frequency Error cost
Solution Resampling, class weights Cost matrix, threshold adjustment
When to use Minority class underrepresented Errors have different costs

They're related but not identical: - Class imbalance: 99% negative, 1% positive - Cost-sensitive: Missing a positive costs 100x more

Combined approach:

# Weighted loss for imbalanced + cost-sensitive
class_weights = compute_class_weight('balanced', classes, y_train)
# Adjust for costs
class_weights[1] *= 10  # Further upweight positive class

criterion = nn.CrossEntropyLoss(weight=torch.tensor(class_weights))

Q: Как оценить cost-sensitive model?

A:

1. Cost-weighted accuracy:

def cost_weighted_accuracy(y_true, y_pred, cost_matrix):
    total_cost = 0
    for t, p in zip(y_true, y_pred):
        total_cost += cost_matrix[t, p]
    return total_cost / len(y_true)

2. Expected cost: $\(\text{Expected Cost} = \sum_{i,j} C_{ij} \cdot P(\text{predict } j | \text{true } i) \cdot P(\text{true } i)\)$

3. Cost curves: Plot cost vs threshold for different operating points

4. Business metrics: Connect to actual business KPIs - Fraud: $ caught vs $ lost - Medical: Lives saved vs unnecessary procedures


16. Missing Data Handling

Basic

Q: Какие типы missing data существуют?

A: Rubin's Classification (1976):

Type Full Name Definition Example Strategy
MCAR Missing Completely At Random P(missing) independent of all variables Data entry error, random sensor failure Deletion OK
MAR Missing At Random P(missing) depends on observed data Men less likely to report depression Imputation OK
MNAR Missing Not At Random P(missing) depends on missing value itself High earners don't report salary Model missingness

Important: MCAR is the only case where deletion is unbiased. MAR/MNAR require imputation.

Q: Когда drop vs impute missing values?

A:

Drop (listwise deletion) when: - MCAR mechanism confirmed - < 5% missing per column - Large dataset, small impact

Impute when: - MAR or MNAR mechanism - > 5% missing per column - Small dataset - Missingness is informative

Code check:

# Check if missingness is related to target
import missingno as msno
msno.matrix(df)  # Visualize patterns
msno.heatmap(df)  # Correlations in missingness

Medium

Q: Какие методы imputation существуют?

A:

Method Description Best For Bias Risk
Mean/Median Replace with central tendency Numerical, MCAR Underestimates variance
Mode Most frequent value Categorical Same as mean
Forward/Backward fill Use adjacent values Time series Temporal leakage
KNN Imputer k-nearest neighbors Numerical patterns Computationally expensive
MICE Multiple Imputation by Chained Equations Any Gold standard for MAR
Iterative Model-based (Bayesian Ridge) Complex patterns Assumes MAR

MICE implementation:

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=10, random_state=42)
X_imputed = imputer.fit_transform(X)

Q: Что такое Multiple Imputation и зачем нужна?

A: Single imputation problem: Imputed values are treated as certain → underestimates variance.

Multiple Imputation (MI) solution: 1. Create m datasets with different imputed values 2. Analyze each dataset separately 3. Pool results using Rubin's Rules

Rubin's Rules for pooling: $\(\bar{Q} = \frac{1}{m}\sum_{i=1}^{m} \hat{Q}_i\)$

(pooled estimate)

\[\bar{U} = \frac{1}{m}\sum_{i=1}^{m} U_i\]

(within-imputation variance)

\[B = \frac{1}{m-1}\sum_{i=1}^{m}(\hat{Q}_i - \bar{Q})^2\]

(between-imputation variance)

\[T = \bar{U} + (1 + \frac{1}{m})B\]

(total variance)

When to use: MAR mechanism, research/analysis context, need valid confidence intervals.

Q: Как обрабатывать missing values в categorical features?

A:

Strategies: 1. New category: "Unknown" or "Missing" — simplest, preserves missingness info 2. Mode imputation: Most frequent — can distort distribution 3. Model-based: Predict category from other features 4. Weight of Evidence (WoE): For binary classification, encode as WoE value

# Strategy 1: New category
df['category'].fillna('Missing', inplace=True)

# Strategy 3: Model-based (using other features)
from sklearn.ensemble import RandomForestClassifier
mask = df['category'].isna()
if mask.sum() > 0:
    clf = RandomForestClassifier()
    clf.fit(df.loc[~mask, other_features], df.loc[~mask, 'category'])
    df.loc[mask, 'category'] = clf.predict(df.loc[mask, other_features])

Killer

Q: Спроектируйте missing data strategy для fraud detection pipeline.

A:

Analysis Phase:

# 1. Diagnose missingness mechanism
def diagnose_missingness(df, target_col):
    """Check if missingness predicts target"""
    df['is_missing'] = df[target_col].isna().astype(int)
    from scipy.stats import chi2_contingency
    for col in df.select_dtypes(include='object').columns:
        contingency = pd.crosstab(df['is_missing'], df[col])
        chi2, p, _, _ = chi2_contingency(contingency)
        if p < 0.05:
            print(f"Missingness of {target_col} related to {col}: p={p:.4f}")

Pipeline Architecture:

Raw Data → Missing Flag Creation → Imputation Model → Feature Engineering → Model
     ↓              ↓                      ↓
  [is_X_missing=1]  [Predicted value]  [Original + Flag + Imputed]

Implementation:

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.compose import ColumnTransformer

# Create missing indicators for important features
important_features = ['transaction_amount', 'user_age', 'device_score']
for col in important_features:
    df[f'{col}_missing'] = df[col].isna().astype(int)

# Different strategies for different columns
preprocessor = ColumnTransformer([
    ('num_knn', KNNImputer(n_neighbors=5), numerical_cols),
    ('cat_mode', SimpleImputer(strategy='most_frequent'), categorical_cols),
    ('cat_new', SimpleImputer(strategy='constant', fill_value='Unknown'), high_missing_cols)
])

pipeline = Pipeline([
    ('imputer', preprocessor),
    ('scaler', StandardScaler()),
    ('model', XGBClassifier())
])

Key decisions: - Flag missingness for high-value features (model can learn "missing = suspicious") - KNN for numerical with patterns - "Unknown" category for categorical with > 10% missing - Monitor: imputation quality, drift in missingness patterns


17. Model Debugging

Basic

Q: Что такое slice-based evaluation?

A: Slice-based evaluation — анализ model performance на подмножествах (slices) данных вместо одного aggregate metric.

Зачем: Aggregate metrics скрывают проблемы на underrepresented groups.

Slice types: - Demographic: gender, age, geography - Behavioral: new vs returning users, device type - Data-driven: high confidence vs low confidence, feature-based

# Slice-based evaluation
def evaluate_slices(model, X, y, slice_cols):
    results = {}
    for col in slice_cols:
        for value in X[col].unique():
            mask = X[col] == value
            if mask.sum() >= 50:  # Minimum samples
                results[f"{col}={value}"] = {
                    'accuracy': accuracy_score(y[mask], model.predict(X[mask])),
                    'count': mask.sum()
                }
    return results

Q: Как проводить error analysis для ML модели?

A: Systematic Error Analysis Process:

  1. Collect errors: All misclassified samples
  2. Categorize: By error type (FP, FN), feature values, prediction confidence
  3. Pattern hunt: What do errors have in common?
  4. Hypothesis: Why is model making these errors?
  5. Fix: More data, new features, different model

Code:

# Error analysis
errors = X_test[y_test != y_pred].copy()
errors['true'] = y_test[y_test != y_pred]
errors['pred'] = y_pred[y_test != y_pred]
errors['confidence'] = y_proba[y_test != y_pred].max(axis=1)

# Look for patterns
for col in X_test.columns:
    print(f"\n{col} distribution in errors vs all:")
    print(errors[col].value_counts(normalize=True).head())
    print(X_test[col].value_counts(normalize=True).head())

Medium

Q: Что такое data debugging и как его делать?

A: Data debugging — поиск проблем в данных, которые вызывают model issues.

Common data bugs: - Label noise: Incorrect labels in training data - Feature leakage: Target information in features - Distribution shift: Train/test different distributions - Outliers: Extreme values affecting model - Duplicates: Same samples causing overfitting

Debugging techniques:

# 1. Check label consistency
from cleanlab.classification import CleanLearning
clf = CleanLearning(clf=XGBClassifier())
clf.fit(X_train, y_train)
label_issues = clf.get_label_issues()  # Potentially mislabeled samples

# 2. Check for leakage
from sklearn.feature_selection import mutual_info_classif
mi = mutual_info_classif(X_train, y_train)
suspicious = [f for f, score in zip(features, mi) if score > 0.8]  # Too predictive

# 3. Distribution check
from scipy.stats import ks_2samp
for col in X_train.columns:
    stat, p = ks_2samp(X_train[col], X_test[col])
    if p < 0.01:
        print(f"Distribution shift in {col}: p={p:.4f}")

Q: Как организовать regression testing для ML моделей?

A: ML Regression Testing — автоматическая проверка что новая модель не хуже старой на критических сценариях.

Test suite components: 1. Golden dataset: Curated examples representing key scenarios 2. Performance thresholds: Min acceptable metrics 3. Slice-specific checks: Must not degrade on important slices 4. Prediction stability: Similar inputs → similar outputs

class ModelRegressionTest:
    def __init__(self, baseline_model, golden_data, thresholds):
        self.baseline = baseline_model
        self.golden_X, self.golden_y = golden_data
        self.thresholds = thresholds  # {'accuracy': 0.85, 'slice_degradation': 0.02}

    def test(self, new_model):
        # 1. Overall performance
        baseline_acc = accuracy_score(self.golden_y, self.baseline.predict(self.golden_X))
        new_acc = accuracy_score(self.golden_y, new_model.predict(self.golden_X))
        assert new_acc >= self.thresholds['accuracy'], f"Accuracy below threshold: {new_acc}"

        # 2. No significant regression
        assert new_acc >= baseline_acc - self.thresholds['slice_degradation'], \
               f"Regression from baseline: {baseline_acc}{new_acc}"

        # 3. Slice-specific checks
        for slice_name, mask in self.slices.items():
            baseline_slice = accuracy_score(self.golden_y[mask], self.baseline.predict(self.golden_X[mask]))
            new_slice = accuracy_score(self.golden_y[mask], new_model.predict(self.golden_X[mask]))
            assert new_slice >= baseline_slice - 0.05, f"Regression on {slice_name}"

        return {"status": "PASSED", "baseline_acc": baseline_acc, "new_acc": new_acc}

Killer

Q: Спроектируйте model debugging workflow для production recommendation system.

A:

Architecture:

Production Logs → Error Collector → Pattern Analyzer → Alerting → Root Cause → Fix
      ↓               ↓                  ↓              ↓          ↓
  [predictions]    [misclassifies]    [slices]      [on-call]  [retrain]
  [features]       [low confidence]   [drifts]                 [features]
  [outcomes]       [edge cases]       [biases]

Implementation:

class ModelDebugger:
    def __init__(self, model, feature_store):
        self.model = model
        self.fs = feature_store
        self.error_buffer = []
        self.slice_metrics = defaultdict(list)

    def log_prediction(self, user_id, item_id, features, prediction, outcome=None):
        """Log every prediction for debugging."""
        record = {
            'timestamp': datetime.now(),
            'user_id': user_id,
            'item_id': item_id,
            'features': features,
            'prediction': prediction,
            'confidence': prediction.max(),
            'outcome': outcome  # Filled later if available
        }
        self.error_buffer.append(record)

    def analyze_errors(self):
        """Periodic error analysis."""
        # 1. Low confidence predictions
        low_conf = [r for r in self.error_buffer if r['confidence'] < 0.6]
        if len(low_conf) > 100:
            self.alert(f"High rate of low-confidence predictions: {len(low_conf)}")

        # 2. Slice-based analysis
        for slice_col in ['user_segment', 'item_category', 'device']:
            for slice_val in set(r['features'].get(slice_col) for r in self.error_buffer):
                slice_errors = [r for r in self.error_buffer
                               if r['features'].get(slice_col) == slice_val and r.get('outcome') == 'error']
                error_rate = len(slice_errors) / max(1, len([r for r in self.error_buffer
                                                            if r['features'].get(slice_col) == slice_val]))
                if error_rate > 0.1:
                    self.alert(f"High error rate on {slice_col}={slice_val}: {error_rate:.2%}")

        # 3. Feature drift
        recent_features = pd.DataFrame([r['features'] for r in self.error_buffer[-1000:]])
        baseline_features = self.fs.get_historical_features()
        for col in recent_features.columns:
            drift = self._compute_psi(recent_features[col], baseline_features[col])
            if drift > 0.2:
                self.alert(f"Feature drift detected in {col}: PSI={drift:.2f}")

    def _compute_psi(self, expected, actual, buckets=10):
        """Population Stability Index."""
        def scale_range (input, min, max):
            input += -(np.min(input))
            input /= np.max(input) / (max - min)
            input += min
            return input

        breakpoints = np.arange(0, buckets + 1) / buckets * 100
        if len(actual.unique()) == 1:
            return 0
        breakpoints = np.nanpercentile(actual, breakpoints)
        expected_percents = np.histogram(expected, breakpoints)[0] / len(expected)
        actual_percents = np.histogram(actual, breakpoints)[0] / len(actual)
        psi_value = np.sum((actual_percents - expected_percents) * np.log(actual_percents / expected_percents + 0.0001))
        return psi_value

Key metrics to monitor: - Error rate by slice (user segment, item category) - Low confidence rate - Feature drift (PSI > 0.2) - Prediction distribution shift - Latency by model version


18. AutoML Theory

Basic

Q: Что такое AutoML и какие проблемы решает?

A: AutoML (Automated Machine Learning) автоматизирует полный ML pipeline: - Hyperparameter Optimization (HPO): Поиск оптимальных гиперпараметров - Neural Architecture Search (NAS): Автоматический поиск архитектуры - Feature Engineering: Автоматическое создание фичей - Model Selection: Выбор лучшего алгоритма - Ensembling: Автоматическое объединение моделей

Проблемы которые решает: - Эксперты тратят 60-80% времени на tuning - Человеческие ошибки и предвзятость - Непоследовательность между инженерами - Сложность для новичков

Medium

Q: Как работает Bayesian Optimization для HPO?

A: Bayesian Optimization — model-based подход к поиску гиперпараметров.

Формула оптимизации: $\(x^* = \text{argmax}_{x \in X} f(x)\)$

где \(x\) — конфигурация гиперпараметров, \(f(x)\) — performance metric.

Gaussian Process Prior: $\(f(x) \sim GP(\mu(x), k(x, x'))\)$

Acquisition Functions:

  1. Expected Improvement (EI): $\(EI(x) = E[\max(f(x) - f(x^*), 0)] = \int_{-\infty}^{\infty} \max(f - f^*, 0) p(f|x) df\)$

  2. Probability of Improvement (PI): $\(PI(x) = P(f(x) > f(x^*)) = \Phi\left(\frac{\mu(x) - f(x^*) - \xi}{\sigma(x)}\right)\)$

  3. Upper Confidence Bound (UCB): $\(UCB(x) = \mu(x) + \beta \sigma(x)\)$

Python implementation:

import numpy as np
from scipy.stats import norm
from sklearn.gaussian_process import GaussianProcessRegressor

class BayesianOptimizer:
    def __init__(self, param_bounds, n_initial=5):
        self.bounds = param_bounds  # {'lr': (0.0001, 0.1), 'batch_size': (16, 256)}
        self.n_initial = n_initial
        self.X_observed = []
        self.y_observed = []
        self.gp = GaussianProcessRegressor()

    def expected_improvement(self, X, xi=0.01):
        mu, sigma = self.gp.predict(X, return_std=True)
        sigma = np.maximum(sigma, 1e-9)  # avoid div by zero
        f_best = np.max(self.y_observed)

        with np.errstate(divide='warn'):
            imp = mu - f_best - xi
            Z = imp / sigma
            ei = imp * norm.cdf(Z) + sigma * norm.pdf(Z)
            ei[sigma == 0.0] = 0.0
        return ei

    def suggest_next(self, n_candidates=1000):
        if len(self.X_observed) < self.n_initial:
            return self._random_sample()

        self.gp.fit(np.array(self.X_observed), np.array(self.y_observed))
        candidates = self._generate_candidates(n_candidates)
        ei = self.expected_improvement(candidates)
        return candidates[np.argmax(ei)]

    def update(self, x, y):
        self.X_observed.append(x)
        self.y_observed.append(y)

Сравнение методов HPO:

Method Efficiency Parallelizable Best For
Grid Search 45% Yes (embarrassingly) Small param spaces
Random Search 65% Yes Baseline, early exploration
Bayesian (GP) 95% Limited (sequential) Expensive evaluations
TPE 90% Limited High-dimensional spaces
Multi-fidelity 95%+ Yes Large datasets, deep learning

Q: В чём разница между Grid Search, Random Search и Bayesian?

A:

Grid Search: - Перебирает все комбинации на сетке - Экспоненциальный рост: \(O(n^d)\) где \(d\) — число параметров - Неэффективен: многие комбинации бесполезны - Пример: 3 params × 10 values = 1000 trials

Random Search: - Случайная выборка из пространства - Лучшая эффективность при том же бюджете - Не учитывает предыдущие результаты - Формула: \(P(\text{top 5\%}) = 1 - (1 - 0.05)^n\)

Bayesian Optimization: - Строит surrogate model (GP) по результатам - Баланс exploration vs exploitation - Каждая новая точка информативна - Идеален для дорогих вычислений

Когда что использовать: - < 10 trials → Random Search - 10-100 trials → Bayesian (GP/TPE) - Cheap evaluation (seconds) → Grid/Random - Expensive (hours) → Bayesian + early stopping

Killer

Q: Спроектируйте AutoML систему для команды из 50 DS.

A:

Requirements: 50 DS, 1000+ experiments/week, diverse workloads (tabular, CV, NLP).

Architecture:

graph TD
    subgraph CTRL["AutoML Controller"]
        HPO["HPO Engine<br/>Optuna/TPE"]
        NAS["NAS Engine<br/>DARTS"]
        FE["Feature Engine"]
        ENS["Ensemble Engine"]
        HPO --> SCHED
        NAS --> SCHED
        FE --> SCHED
        ENS --> SCHED
        SCHED["Trial Scheduler (Ray Tune)<br/>Resource allocation, ASHA, 100+ parallel"]
    end
    SCHED --> REG["Model Registry (MLflow)"]

    style HPO fill:#e8eaf6,stroke:#3f51b5
    style NAS fill:#e8eaf6,stroke:#3f51b5
    style FE fill:#e8eaf6,stroke:#3f51b5
    style ENS fill:#e8eaf6,stroke:#3f51b5
    style SCHED fill:#e8f5e9,stroke:#4caf50
    style REG fill:#fff3e0,stroke:#ef6c00

Key components:

  1. HPO Engine: Optuna с TPE sampler для high-dimensional spaces
  2. NAS: DARTS для CV, AutoML-Text для NLP
  3. Early Stopping: ASHA (Asynchronous Successive Halving)
  4. Multi-fidelity: Сначала на 10% данных, потом на 100%

Cost optimization: - Warm-starting: transfer learning между похожими задачами - Budget-aware: остановить если не превосходит baseline после N trials - Meta-learning: использовать историю команды для инициализации

Governance: - Auto-logging всех экспериментов - Comparison vs baseline required для promotion - Weekly AutoML reports: savings, best practices discovered

Q: Что такое Multi-Fidelity Optimization?

A: Multi-Fidelity использует дешевые аппроксимации для ускорения HPO.

Идея: Сначала evaluate на маленьком subset данных/short training, потом только лучшие на full fidelity.

Методы:

  1. Successive Halving (SH):

    # Start with N configs, train for r epochs
    # Keep top 1/η, train for r*η epochs
    # Repeat until 1 config at max epochs
    def successive_halving(configs, n_initial, eta=3):
        n = len(configs)
        r = r_min
        while n > 1:
            results = [train(c, epochs=r) for c in configs]
            n_keep = n // eta
            configs = top_k(configs, results, k=n_keep)
            n, r = n_keep, r * eta
        return configs[0]
    

  2. ASHA (Asynchronous SHA):

  3. Параллельная версия SH
  4. Configs запускаются асинхронно
  5. Promote когда достигли milestone

  6. Hyperband:

  7. Комбинирует SH с разными budget allocations
  8. Робаст к разным типам задач

Формула Hyperband: $\(s_{max} = \lfloor \log_\eta(R/r_{min}) \rfloor\)$

\[B = (s_{max} + 1) R\]

19. Federated Learning

Basic

Q: Что такое Federated Learning?

A: ML парадигма, где модель обучается на распределённых данных без перемещения данных на центральный сервер.

Ключевые принципы: - Data stays on device (privacy) - Only model updates are shared - Central server aggregates updates - Model improves collaboratively

Q: Как работает FedAvg (Federated Averaging)?

A:

FedAvg Algorithm (McMahan et al.): 1. Server initializes global model \(w^0\) 2. For each round \(t\): - Server sends \(w^t\) to selected clients \(S_t\) - Each client \(k\) trains locally: \(w_k^{t+1} = w^t - \eta \nabla L_k(w^t)\) - Clients send updates back - Server aggregates: \(w^{t+1} = \sum_{k \in S_t} \frac{n_k}{n} w_k^{t+1}\)

где \(n_k\) — количество samples у клиента \(k\), \(n = \sum n_k\)

Medium

Q: Какие проблемы FedAvg и как их решают?

A:

Problem Cause Solution
Client Drift Heterogeneous data FedProx (proximal term)
Communication cost Large model updates Compression, sparse updates
Stragglers Slow clients Async aggregation
Non-IID data Different distributions Data sharing, clustering

FedProx: $\(\min_w L_k(w) + \frac{\mu}{2}\|w - w^t\|^2\)$

Proximal term keeps local updates close to global model.

Q: Local vs Global updates — в чём разница?

A:

Local updates (client-side): - Multiple SGD steps before sending to server - More computation, less communication - Formula: \(w_k \leftarrow w_k - \eta \sum_{i} \nabla \ell(x_i, y_i; w_k)\) for \(E\) epochs

Communication-efficiency trade-off: - More local epochs \(E\) → less communication, but more drift - Typical: \(E \in [1, 5]\) for stability

Q: Что такое Differential Privacy в Federated Learning?

A:

DP-FedAvg: Add noise to updates before sending to server $\(\tilde{g}_k = g_k + \mathcal{N}(0, \sigma^2 C^2)\)$

где \(C\) — clipping norm, \(\sigma\) — noise scale

Privacy guarantee: \((\epsilon, \delta)\)-DP - Lower \(\epsilon\) → stronger privacy, more noise - Trade-off: privacy vs accuracy

Killer

Q: Спроектируйте FL систему для предсказания клавиатуры на мобильных устройствах.

A:

Architecture:

[User Devices] → [Secure Aggregation] → [FL Server] → [Global Model]
      ↑                                          ↓
      ←——————— Model Distribution ←———————

Key decisions:

  1. Model: LSTM/Transformer, ~5-10M params (must fit on device)
  2. Participation: Sample 100-1000 users per round from millions
  3. Local training: 1-5 epochs on user's typing data
  4. Aggregation: Weighted by data size \(n_k\)
  5. Privacy: DP-FedAvg with \(\epsilon \approx 8\)

Python (simplified):

def fedavg_round(server_model, client_updates, client_sizes):
    total_size = sum(client_sizes)
    new_weights = {}
    for name, param in server_model.named_parameters():
        weighted_sum = sum(
            sizes[i] / total_size * updates[i][name]
            for i, updates in enumerate(client_updates)
        )
        new_weights[name] = weighted_sum
    return new_weights

Challenges: - Device heterogeneity (battery, compute) - Non-IID data (different users, different vocab) - Concept drift (new slang, languages)

Q: FedAvg vs FedProx vs SCAFFOLD — когда что использовать?

A:

Algorithm Best For Key Innovation
FedAvg IID-ish data, stable clients Baseline, simple
FedProx Heterogeneous data Proximal term reduces drift
SCAFFOLD Highly non-IID Control variates correct drift

SCAFFOLD insight: Client drift = \(\nabla L_k(w) - \nabla L(w)\) - Maintains control variates \(c_k\) to estimate drift - Updates: \(w_k \leftarrow w_k - \eta(g_k - c_k + c)\) - Achieves 45% faster convergence on non-IID data (2025 benchmarks)


20. TabPFN — Foundation Model for Tabular Data

Basic

Q: Что такое TabPFN?

A: Tabular Prior-data Fitted Network — foundation model для tabular data, использующий in-context learning вместо gradient descent.

Ключевые характеристики: - Pre-trained на synthetic tabular datasets - Zero-shot prediction (no training on your data) - Transformer-based architecture - Outperforms XGBoost/LightGBM on small datasets (<10K samples)

Q: В чём разница TabPFN vs традиционные ML модели?

A:

Aspect Traditional (XGBoost) TabPFN
Training Gradient descent on data Pre-trained, no training
Data requirement More data = better Small data specialist
Inference Fast tree traversal Forward pass through transformer
Hyperparameters Many (lr, depth, etc.) Minimal (none for basic use)
Max samples Unlimited 50K (TabPFN-2.5)

Medium

Q: Как работает TabPFN?

A:

Pre-training Phase: 1. Generate synthetic tabular datasets from priors 2. Train transformer to predict labels given (X_train, y_train, x_test) 3. Model learns general tabular patterns

Inference (In-Context Learning):

from tabpfn import TabPFNClassifier

classifier = TabPFNClassifier()
classifier.fit(X_train, y_train)  # No actual training!
predictions = classifier.predict(X_test)

Architecture: - Input: Training set + test sample as sequence - Encoder: Feature embedding + positional encoding - Decoder: Transformer predicts label probabilities

Q: Какие ограничения у TabPFN?

A:

Limitation TabPFN v2 TabPFN-2.5
Max samples 10,000 50,000
Max features 100 2,000
Max classes 10 ~100
GPU required Yes Yes

Practical limitations: - Slow on large datasets (O(n²) attention) - Categorical features need preprocessing - No native support for missing values - Regression needs separate model

Q: Когда использовать TabPFN vs XGBoost?

A:

Use TabPFN when: - Dataset < 50K samples - Limited time for hyperparameter tuning - Quick baseline needed - Data is clean (no missing values)

Use XGBoost/LightGBM when: - Large datasets (>50K) - Need feature importance - Complex preprocessing needed - Production deployment (no GPU)

Benchmarks (2025 Nature paper): - TabPFN outperforms on 57% of datasets <10K samples - Average accuracy gain: +2.7% vs best competitor

Killer

Q: Как интегрировать TabPFN в production pipeline?

A:

Hybrid approach:

def smart_classifier(X_train, y_train, X_test):
    n_samples = len(X_train)

    if n_samples < 5000:
        # TabPFN for small data
        model = TabPFNClassifier()
        model.fit(X_train, y_train)
        return model.predict(X_test)
    elif n_samples < 50000:
        # Compare TabPFN vs XGBoost
        tabpfn_score = cross_val_score(TabPFNClassifier(), X_train, y_train)
        xgb_score = cross_val_score(XGBClassifier(), X_train, y_train)

        if tabpfn_score > xgb_score:
            return TabPFNClassifier().fit(X_train, y_train).predict(X_test)
        else:
            return XGBClassifier().fit(X_train, y_train).predict(X_test)
    else:
        # Large data: traditional methods
        return XGBClassifier().fit(X_train, y_train).predict(X_test)

Production considerations: - GPU required for inference - Batch inference for throughput - Fallback to XGBoost on timeout - Model versioning (TabPFN updates)

Q: Что нового в TabPFN-2.5?

A:

Key improvements (Nov 2025): - 20x increase in data cells (50K samples × 2K features) - Better handling of high-cardinality categorical features - Improved regression support - Faster inference (optimized attention)

When to upgrade: - Datasets near v2 limits - Need for more classes - Large feature sets


21. Production ML Deployment Patterns

Источники: MatterAI Deployment Strategies (Jan 2026), ML Journey Shadow vs Canary (Sept 2025), Raghu's Deployment Patterns, FICO Champion/Challenger (Dec 2025)

Basic

Q: Какие основные паттерны deployment для ML моделей?

A:

Pattern Описание Risk Level Use Case
Blue-Green Two identical environments, instant switch Low Critical systems, zero downtime
Canary Gradual traffic shift (1%→100%) Medium Risk mitigation with real users
Shadow Parallel run, no user impact None Model validation, load testing
A/B Testing Deterministic routing by user Medium Statistical comparison
Champion-Challenger Continuous model competition Low Continuous improvement

Q: Что такое Blue-Green deployment?

A: Поддержка двух идентичных production environments: - Blue — текущая production версия - Green — новая версия для deployment

Process: 1. Deploy new model to Green 2. Run validation tests 3. Switch traffic via load balancer: Blue (0%) → Green (100%) 4. Blue становится standby для instant rollback

Infrastructure (Kubernetes + Istio):

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: inference-service
spec:
  hosts:
    - inference-service
  http:
    - route:
        - destination:
            host: inference-service
            subset: blue
          weight: 0
        - destination:
            host: inference-service
            subset: green
          weight: 100

Medium

Q: Как работает Canary deployment?

A: Gradual rollout с progressive traffic shifting:

Traffic Ramp:

1% → 5% → 10% → 25% → 50% → 100%

Kubernetes Argo Rollouts:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: model-inference
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: {duration: 10m}
        - setWeight: 20
        - pause: {duration: 10m}
        - setWeight: 50
        - pause: {duration: 10m}
      analysis:
        templates:
          - templateName: success-rate

Automated Gates (triggers rollback if breached): - P95 latency < 200ms - Error rate < 0.1% - Prediction distribution drift (KL divergence < 0.1) - Business metrics (conversion rate stable)

Q: Что такое Shadow deployment и когда его использовать?

A: Shadow model получает те же input данные, что и production, но predictions НЕ влияют на пользователей.

Architecture:

[Request] → [Production Model] → [User Response]
         ↘ [Shadow Model] → [Log for Analysis]

Implementation:

class ShadowDeployment:
    def __init__(self, production_model, shadow_model):
        self.prod = production_model
        self.shadow = shadow_model
        self.logger = PredictionLogger()

    async def predict(self, features):
        # Production prediction (returned to user)
        prod_pred = await self.prod.predict(features)

        # Shadow prediction (logged, not returned)
        shadow_pred = await self.shadow.predict(features)
        self.logger.log(
            features=features,
            prod_prediction=prod_pred,
            shadow_prediction=shadow_pred,
            timestamp=datetime.now()
        )

        return prod_pred  # Only production prediction

Use Cases: - Validate new model on real traffic (risk-free) - Compare prediction distributions - Load testing new infrastructure - Data drift detection

Q: Чем A/B Testing отличается от Canary?

A:

Aspect Canary A/B Testing
Traffic split Random percentage Deterministic (user ID hash)
Purpose Risk mitigation Statistical comparison
User consistency May see different models Same user sees same model
Duration Until full rollout Fixed experiment period
Analysis Operational metrics Business metrics + significance

A/B User Segmentation:

import hashlib

def get_model_variant(user_id, variants=['v1', 'v2']):
    hash_value = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
    index = hash_value % len(variants)
    return variants[index]

# Consistent routing: same user always sees same model
variant = get_model_variant("user_12345")

Killer

Q: Спроектируйте Champion-Challenger pipeline для recommendation system.

A:

Architecture:

graph TD
    subgraph REG["Model Registry"]
        CHAMP["Champion v2.3<br/>87.2%"]
        CH1["Challenger 1<br/>v2.4-alpha, 86.8%"]
        CH2["Challenger 2<br/>v2.4-beta, 87.5%"]
    end

    CHAMP -->|"90%"| ROUTER["Traffic Router"]
    CH1 -->|"5% shadow"| ROUTER
    CH2 -->|"5%"| ROUTER

    ROUTER --> METRICS["Metrics Collector<br/>CTR, Conversion, Revenue, Latency"]
    METRICS --> DECISION{"Promotion Decision<br/>challenger > champion by 2%+<br/>for 7 days?"}
    DECISION -->|"Yes"| PROMOTE["Promote to Champion"]
    DECISION -->|"No"| KEEP["Keep current Champion"]

    style CHAMP fill:#e8f5e9,stroke:#4caf50
    style CH1 fill:#e8eaf6,stroke:#3f51b5
    style CH2 fill:#e8eaf6,stroke:#3f51b5
    style ROUTER fill:#fff3e0,stroke:#ef6c00
    style METRICS fill:#f3e5f5,stroke:#9c27b0
    style PROMOTE fill:#e8f5e9,stroke:#4caf50
    style KEEP fill:#fce4ec,stroke:#c62828

Implementation:

class ChampionChallengerPipeline:
    def __init__(self, registry, traffic_router, metrics):
        self.registry = registry
        self.router = traffic_router
        self.metrics = metrics
        self.promotion_threshold = 0.02  # 2% improvement
        self.min_observation_days = 7

    def get_model(self, user_id, context):
        champion = self.registry.get_champion()
        challengers = self.registry.get_challengers()

        # Route traffic
        assignment = self.router.assign(user_id)

        if assignment == 'champion':
            return champion
        else:
            # Shadow: return champion prediction but log challenger
            challenger = challengers[assignment]
            return self.shadow_predict(champion, challenger, context)

    async def evaluate_promotion(self):
        champion = self.registry.get_champion()
        challengers = self.registry.get_challengers()

        for challenger in challengers:
            if challenger.observation_days < self.min_observation_days:
                continue

            # Statistical significance test
            improvement = self.metrics.compare(
                challenger, champion, metric='conversion_rate'
            )

            if (improvement > self.promotion_threshold and
                self.metrics.is_significant(challenger, champion)):
                await self.promote(challenger)

    async def promote(self, new_champion):
        old_champion = self.registry.get_champion()
        self.registry.demote(old_champion)
        self.registry.promote(new_champion)
        self.router.update_weights(champion=1.0)

Promotion Criteria: - Metric improvement > threshold (e.g., 2%) - Statistical significance (p < 0.05) - Minimum observation period - No degradation on critical slices - Stakeholder approval (for major changes)

Q: Когда использовать каждый deployment pattern?

A:

Decision Tree:

Is zero downtime required?
├── Yes → Blue-Green (critical systems: payments, auth)
└── No → Is risk tolerance low?
          ├── Yes → Shadow → Canary → Full
          └── No → Canary (fast iteration)

Need statistical comparison?
└── Yes → A/B Testing with significance analysis

Continuous improvement culture?
└── Yes → Champion-Challenger with automation

Pattern Combinations (Best Practice): 1. Shadow + Canary: Shadow for 2 weeks → Canary 1%→100% 2. Champion-Challenger + Shadow: Multiple challengers in shadow mode 3. A/B + Canary: A/B test on canary traffic only

Cost Comparison:

Pattern Infra Cost Rollback Speed Real-User Validation
Blue-Green High (2x) Instant No
Canary Medium (1.2x) Fast Yes
Shadow Medium (1.5x) N/A No
A/B Testing Medium Fast Yes
Champion-Challenger Medium Fast Yes

Q: Как реализовать automated rollback для ML deployment?

A:

class AutomatedRollback:
    def __init__(self, thresholds, monitoring):
        self.thresholds = {
            'p95_latency_ms': 200,
            'error_rate': 0.001,
            'prediction_drift_kl': 0.1,
            'conversion_rate_drop': 0.05,
        }
        self.monitoring = monitoring

    async def check_and_rollback(self, deployment):
        metrics = await self.monitoring.get_metrics(deployment)

        for metric, threshold in self.thresholds.items():
            current = metrics.get(metric, 0)

            if self._breaches_threshold(metric, current, threshold):
                await self.rollback(deployment)
                await self.alert(
                    f"Rollback triggered: {metric}={current}, threshold={threshold}"
                )
                return True

        return False

    def _breaches_threshold(self, metric, current, threshold):
        if 'drop' in metric:
            return current > threshold  # Higher drop is bad
        else:
            return current > threshold  # Higher latency/error is bad

    async def rollback(self, deployment):
        # Switch back to previous stable version
        await deployment.switch_to_previous()
        await deployment.scale_down_canary()

Rollback Triggers: 1. Latency spike > 2x baseline 2. Error rate > 0.1% 3. Prediction distribution shift (PSI > 0.2) 4. Business metric drop > 5% 5. Manual trigger from on-call


22. Data Drift Detection

Источники: AllDays Tech Model Drift 2026, Label Your Data Drift Detection, Towards Data Science Drift (Jan 2026)

Basic

Q: Что такое data drift и почему это проблема?

A: Data drift — изменение распределения входных данных с течением времени:

\[P_{t_0}(X) \neq P_t(X), \quad t > t_0\]

Типы Drift:

Type Definition Example
Data Drift Input distribution changes New user demographics, seasonality
Concept Drift P(y|X) changes Fraud patterns evolve, buying behavior shifts
Label Drift P(y) changes Class imbalance shifts, policy changes

Formal decomposition: $\(P(X, y) = P(X) \times P(y|X)\)$

Q: Почему drift неизбежен в production?

A: 1. Real-world change: Seasonality, macro events, adversaries adapt 2. Product change: New features, UI changes, pricing changes 3. Pipeline change: Schema changes, logging changes, feature computation bugs

Medium

Q: Какие методы обнаружения drift существуют?

A:

Method Use Case Formula/Approach
KS Test Continuous features $D = \max
Chi-Square Categorical features \(\chi^2 = \sum \frac{(O-E)^2}{E}\)
PSI Score/bin distribution \(\sum (Actual\% - Expected\%) \times \ln\frac{Actual\%}{Expected\%}\)
Wasserstein Continuous, sensitive Earth Mover's Distance

PSI Thresholds: - PSI < 0.1: No significant drift - 0.1 ≤ PSI < 0.25: Moderate drift, monitor - PSI ≥ 0.25: Significant drift, investigate

Q: Как реализовать PSI (Population Stability Index)?

A:

import numpy as np

def compute_psi(expected, actual, buckets=10):
    """Compute Population Stability Index."""
    breakpoints = np.percentile(expected, np.linspace(0, 100, buckets + 1))
    expected_counts, _ = np.histogram(expected, bins=breakpoints)
    actual_counts, _ = np.histogram(actual, bins=breakpoints)

    expected_pct = expected_counts / len(expected) + 1e-10
    actual_pct = actual_counts / len(actual) + 1e-10

    psi = np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct))
    return psi

Q: Что такое Adversarial Validation?

A: Метод определения насколько train и production distribution различаются:

  1. Label train data as 0, production data as 1
  2. Train classifier to distinguish them
  3. If AUC ≈ 0.5 → distributions similar (good)
  4. If AUC > 0.7 → significant drift (problem)

Killer

Q: Когда retraining необходим vs достаточно мониторинга?

A:

Drift Type Performance Impact Action
Data drift only No degradation Monitor, no action
Data drift + perf drop Model degrading Investigate root cause
Concept drift Always impacts Retrain with recent data
Pipeline bug Varies Fix pipeline first

Retrain Triggers: - Business metric drop > 5% - Model accuracy drops below threshold - Multiple features showing drift simultaneously


23. Hyperparameter Interactions & Learning Curves

Comprehensive guide to hyperparameter tuning strategies and training diagnostics.

Hyperparameter Tuning Strategies Comparison

Aspect Grid Search Random Search Bayesian Optimization
Strategy Exhaustive, all combinations Random sampling Probabilistic modeling
Efficiency Exponential growth Efficient for large spaces Very efficient, fewer evaluations
Implementation Easy (sklearn GridSearchCV) Easy (sklearn RandomizedSearchCV) Complex (Optuna, Hyperopt)
Best For Small spaces (<10 params) High-dimensional spaces Expensive evaluations
Scalability Limited Good Excellent
Exploration Thorough but wasteful Broad coverage Smart exploration/exploitation

Grid Search Details

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10]
}

# Total combinations: 3 * 4 * 3 = 36
grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)

Pros: Comprehensive, simple, reproducible Cons: \(O(n^k)\) complexity, wastes resources on unimportant dimensions

Random Search Details

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform, randint

param_distributions = {
    'n_estimators': randint(50, 500),
    'max_depth': randint(3, 20),
    'learning_rate': loguniform(1e-4, 1e-1),
    'min_samples_split': randint(2, 20)
}

random_search = RandomizedSearchCV(
    GradientBoostingClassifier(),
    param_distributions,
    n_iter=50,  # Number of random samples
    cv=5,
    scoring='f1',
    n_jobs=-1
)

Key Insight (Bergstra & Bengio 2012): Random search often finds better configs in fewer trials because: - Not all hyperparameters are equally important - Random sampling covers more distinct values per dimension

Bayesian Optimization

import optuna

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'max_depth': trial.suggest_int('max_depth', 3, 20),
        'learning_rate': trial.suggest_float('learning_rate', 1e-4, 1e-1, log=True),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 20)
    }

    model = GradientBoostingClassifier(**params)
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1')
    return scores.mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

How it works: 1. Builds probabilistic model (Gaussian Process) of objective function 2. Uses acquisition function (EI, UCB) to select next hyperparameters 3. Balances exploration (new regions) vs exploitation (known good regions)


Learning Curves Interpretation

Learning curves plot training/validation error vs training set size or epochs.

Well-Fitted Model

Error
  │  Train ----___
  │              ---___
  │  Val   -------___
  │                  ---
  └───────────────────────── Size/Epochs
- Small gap between train and validation - Both curves converge to low error - Action: Model is ready

Overfitting Model

Error
  │  Train ----------
  │                   (approaches zero)
  │  Val   ----___
  │              ___/‾‾‾  (increases!)
  └───────────────────────── Size/Epochs
- Training error very low, validation error high - Large gap between curves - Actions: More data, regularization, simpler model, early stopping

Underfitting Model

Error
  │  Train ------ (high)
  │  Val   ------- (high, similar)
  └───────────────────────── Size/Epochs
- Both errors high and plateau - Small gap but poor performance - Actions: More complex model, more features, less regularization

Learning Curve Analysis Code

from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt

def plot_learning_curve(estimator, X, y, cv=5):
    train_sizes, train_scores, val_scores = learning_curve(
        estimator, X, y,
        train_sizes=np.linspace(0.1, 1.0, 10),
        cv=cv,
        scoring='neg_mean_squared_error',
        n_jobs=-1
    )

    train_mean = -np.mean(train_scores, axis=1)
    val_mean = -np.mean(val_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    val_std = np.std(val_scores, axis=1)

    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, 'o-', label='Training error')
    plt.plot(train_sizes, val_mean, 'o-', label='Validation error')
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1)
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1)
    plt.xlabel('Training set size')
    plt.ylabel('MSE')
    plt.legend()
    plt.grid(True)
    plt.show()

Early Stopping Strategies

Early stopping prevents overfitting by stopping training when validation performance stops improving.

Basic Early Stopping

from sklearn.model_selection import train_test_split

class EarlyStopping:
    def __init__(self, patience=5, min_delta=0.001):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.best_score = None
        self.should_stop = False

    def __call__(self, val_score):
        if self.best_score is None:
            self.best_score = val_score
        elif val_score < self.best_score + self.min_delta:
            self.counter += 1
            if self.counter >= self.patience:
                self.should_stop = True
        else:
            self.best_score = val_score
            self.counter = 0
        return self.should_stop

# Usage in training loop
early_stopping = EarlyStopping(patience=10, min_delta=0.001)
for epoch in range(max_epochs):
    train_loss = train_one_epoch(model, train_loader)
    val_loss = validate(model, val_loader)

    if early_stopping(-val_loss):  # Negative because we want to maximize
        print(f"Early stopping at epoch {epoch}")
        break

Early Stopping in Gradient Boosting

from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(
    n_estimators=1000,
    learning_rate=0.01,
    validation_fraction=0.1,
    n_iter_no_change=10,  # Early stopping patience
    tol=1e-4  # Minimum improvement
)
model.fit(X_train, y_train)
print(f"Actual n_estimators used: {model.n_estimators_}")

PyTorch Early Stopping with Checkpointing

import torch

def train_with_early_stopping(model, train_loader, val_loader, epochs, patience=5):
    optimizer = torch.optim.Adam(model.parameters())
    criterion = torch.nn.CrossEntropyLoss()

    best_val_loss = float('inf')
    patience_counter = 0
    best_model_state = None

    for epoch in range(epochs):
        # Training
        model.train()
        for X, y in train_loader:
            optimizer.zero_grad()
            loss = criterion(model(X), y)
            loss.backward()
            optimizer.step()

        # Validation
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for X, y in val_loader:
                val_loss += criterion(model(X), y).item()

        val_loss /= len(val_loader)

        # Early stopping logic
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            patience_counter = 0
            best_model_state = model.state_dict().copy()
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print(f"Early stopping at epoch {epoch}")
                model.load_state_dict(best_model_state)
                break

    return model

Decision Framework: Which Tuning Strategy to Use

Scenario Recommended Strategy Why
<5 hyperparameters Grid Search Small space, comprehensive
5-20 hyperparameters Random Search Efficient exploration
>20 hyperparameters Bayesian (Optuna) Smart search
Expensive training (>1hr) Bayesian + Early Stopping Minimize evaluations
Limited compute budget Random (n=50) + Early Stopping Good coverage, low cost
Production deployment Bayesian + Cross-validation Robust, reproducible

Best Practices

  1. Start coarse, then fine: Wide search first, narrow later
  2. Use domain knowledge: Set sensible ranges based on experience
  3. Monitor learning curves: Diagnose over/underfitting early
  4. Apply early stopping: Save compute, prevent overfitting
  5. Document experiments: Track all configurations and results
  6. Cross-validation: Use k-fold CV for reliable estimates
  7. Parallelize: Use n_jobs=-1 or distributed tuning (Ray Tune)

Источники: AICompetence Grid vs Random vs Bayesian (May 2025), GeeksforGeeks Learning Curves (Jul 2025), Bergstra & Bengio (2012), Snoek et al. (2012)

Basic

Q: Почему Random Search часто работает лучше Grid Search?

A: Bergstra & Bengio (2012) показали: 1. Не все параметры важны: Важен только ~1-2 параметра, остальные мало влияют 2. Grid тратит ресурсы: Перебирает все комбинации неважных параметров 3. Random покрывает больше: При том же бюджете исследует больше значений важных параметров

Q: Что показывает Learning Curve?

A: График зависимости ошибки от размера обучающей выборки или эпох: - X-axis: Training set size или epochs - Y-axis: Error (MSE, loss) или accuracy - Две линии: Training error и Validation error

Medium

Q: Как диагностировать переобучение по Learning Curve?

A:

Symptom Training Error Validation Error Gap
Overfitting Very low High, increasing Large
Underfitting High High Small
Good fit Low Low Small

Actions for overfitting: More data, regularization, early stopping, simpler model

Q: Как работает Early Stopping?

A: Остановка обучения когда validation loss перестаёт улучшаться:

if val_loss < best_val_loss - min_delta:
    best_val_loss = val_loss
    counter = 0
else:
    counter += 1
    if counter >= patience:
        stop_training()

Параметры: patience (сколько эпох ждать), min_delta (минимальное улучшение)

Killer

Q: Как выбрать стратегию tuning для production системы?

A:

  1. Budget assessment: Сколько времени/ресурсов доступно?
  2. Model complexity: Deep learning → Bayesian, Classical ML → Random/Grid
  3. Iteration cost: Дорогое обучение → Bayesian + early stopping
  4. Risk tolerance: Production → k-fold CV + multiple runs

Recommended pipeline:

# Stage 1: Coarse random search
random_search = RandomizedSearchCV(..., n_iter=50, cv=3)

# Stage 2: Fine Bayesian around best region
study = optuna.create_study()
study.optimize(objective, n_trials=100)

# Stage 3: Final validation with full CV
final_cv = cross_val_score(best_model, X, y, cv=10)

Q: Как избежать overfitting на validation set при tuning?

A: 1. Nested CV: Inner loop для tuning, outer loop для оценки 2. Hold-out test set: Не использовать для tuning вообще 3. Ограничить trials: Не перебирать тысячи комбинаций 4. Early stopping: Не "подгонять" под validation

# Nested CV prevents overfitting to validation
from sklearn.model_selection import cross_val_score, GridSearchCV

inner_cv = KFold(n_splits=3)
outer_cv = KFold(n_splits=5)

clf = GridSearchCV(estimator=model, param_grid=param_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X, y, cv=outer_cv)

Обновлено: 2026-02-12, Ralph iteration 106 — добавлен Cross-Validation Edge Cases (Section 24)


24. Cross-Validation Edge Cases

Advanced cross-validation techniques for robust model evaluation.

Nested Cross-Validation

Problem: When tuning hyperparameters, standard CV causes optimism bias — we use validation data both to select hyperparameters AND to report performance.

Solution: Nested CV separates model selection from evaluation: - Outer loop (evaluation): Honest test of tuned model generalization - Inner loop (selection): Hyperparameter tuning on training data only

from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier

# Inner loop: hyperparameter search
inner_cv = KFold(n_splits=3, shuffle=True, random_state=1)
param_grid = {'n_estimators': [50, 100], 'max_depth': [3, 5, 10]}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=inner_cv)

# Outer loop: evaluation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
nested_scores = cross_val_score(grid, X, y, cv=outer_cv)

print(f"Nested CV score: {nested_scores.mean():.3f} (+/- {nested_scores.std():.3f})")

Nested vs Standard CV Comparison

Aspect Standard CV (GridSearchCV) Nested CV
Purpose Tune hyperparameters Evaluate tuning pipeline
Data leakage Possible (optimistic bias) Prevented
Computation \(k \times n_{params}\) \(k_{outer} \times k_{inner} \times n_{params}\)
When to use Final model selection Model comparison, publication

Time Series Cross-Validation

Problem: Standard K-fold CV breaks temporal structure — training on future data to predict past = data leakage.

Walk-Forward Validation (Expanding Window)

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
    print(f"Fold {fold}: Train={len(train_idx)}, Test={len(test_idx)}")

Sliding Window (Fixed Size)

class SlidingWindowCV:
    def __init__(self, window_size, step=1):
        self.window_size = window_size
        self.step = step

    def split(self, X):
        n = len(X)
        for i in range(self.window_size, n, self.step):
            train_idx = np.arange(i - self.window_size, i)
            test_idx = np.arange(i, min(i + self.step, n))
            yield train_idx, test_idx

Blocked Time Series CV (with Embargo)

class BlockedTimeSeriesCV:
    def __init__(self, n_splits=5, embargo=0):
        self.n_splits = n_splits
        self.embargo = embargo  # Gap between train and test

    def split(self, X):
        n = len(X)
        k = n // (self.n_splits + 1)
        for i in range(self.n_splits):
            test_start = i * k + k
            test_end = test_start + k
            train_end = test_start - self.embargo
            yield np.arange(0, train_end), np.arange(test_start, test_end)

Time Series CV Comparison

Method Window Memory Best For
Expanding Grows All history Stable systems
Sliding Fixed Recent only Concept drift
Blocked Fixed + gap No leakage Financial data

Bootstrap .632 Estimator

Formula: \(\hat{Err}^{.632} = 0.368 \times \overline{err} + 0.632 \times \hat{Err}^{(1)}\)

Where \(\overline{err}\) = training error, \(\hat{Err}^{(1)}\) = OOB error

def bootstrap_632_score(model, X, y, n_bootstraps=100):
    n = len(y)
    oob_errors, train_errors = [], []

    for _ in range(n_bootstraps):
        indices = resample(range(n), n_samples=n, replace=True)
        oob_mask = ~np.isin(range(n), indices)
        if oob_mask.sum() == 0: continue

        model.fit(X[indices], y[indices])
        train_errors.append(np.mean(model.predict(X[indices]) != y[indices]))
        oob_errors.append(np.mean(model.predict(X[oob_mask]) != y[oob_mask]))

    return 0.368 * np.mean(train_errors) + 0.632 * np.mean(oob_errors)

Why 0.632? Bootstrap sample includes ~63.2% of data (1 - 1/e).


Repeated K-Fold CV

from sklearn.model_selection import RepeatedKFold

cv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
scores = cross_val_score(model, X, y, cv=cv)  # 50 evaluations

Decision Framework: Which CV to Use

Data Type CV Method Why
Standard (i.i.d.) K-fold (k=5 or 10) Good bias-variance trade-off
Small dataset LOOCV or .632 Bootstrap Maximize training data
Imbalanced Stratified K-fold Preserve class ratios
Time series Walk-forward / Blocked Respect temporal order
Grouped (clusters) GroupKFold Keep groups together
Hyperparameter tuning Nested CV Prevent optimism bias

Источники: Medium Nested CV (May 2025), MLMastery Time Series CV (Jan 2026)

Basic

Q: Зачем нужен Nested Cross-Validation?

A: Стандартный GridSearchCV даёт optimistic bias — мы используем validation данные дважды (для выбора гиперпараметров И для оценки качества).

Nested CV: Outer loop = честная оценка, Inner loop = подбор гиперпараметров.

Q: Почему нельзя использовать K-fold для Time Series?

A: Random shuffle ломает временной порядок — train на будущем, test на прошлом = data leakage.

Medium

Q: В чём разница Expanding vs Sliding Window?

A: Expanding растёт (вся история), Sliding фиксирован (только недавние). Expanding для стабильных систем, Sliding для concept drift.

Q: Что такое Bootstrap .632?

A: Комбинация training error и OOB error: \(0.368 \times \text{train\_err} + 0.632 \times \text{OOB\_err}\). Снижает bias для малых данных.

Killer

Q: Как правильно организовать CV pipeline с preprocessing?

A: КРИТИЧНО: Pipeline внутри CV, не снаружи:

# WRONG - data leakage
scaler = StandardScaler().fit(X)
X_scaled = scaler.transform(X)
cross_val_score(model, X_scaled, y, cv=5)

# CORRECT
pipeline = Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())])
cross_val_score(pipeline, X, y, cv=5)


Типичные заблуждения

Заблуждение: Переобучение можно определить только по метрикам на валидации

Learning curves (train vs val) -- необходимый, но не достаточный инструмент. По данным Kaggle 2025, 34% случаев "переобучения" на самом деле вызваны data leakage в preprocessing pipeline (например, fit scaler на всех данных до split). Всегда проверяйте pipeline целиком через nested cross-validation.

Заблуждение: Feature selection всегда улучшает качество модели

На практике агрессивный feature selection может удалить признаки с weak-but-useful сигналом. Исследование (Boulesteix et al., 2024) показало, что на datasets с >50 признаками L1-регуляризация (Lasso) в среднем на 2-4% хуже по AUC чем Ridge (L2) + все признаки, если между признаками высокая корреляция. Используйте Elastic Net как компромисс.

Заблуждение: Gradient Boosting всегда лучше Random Forest

Согласно мета-анализу AutoML Benchmark 2025 на 104 табличных датасетах, Random Forest побеждает GBDT в 38% случаев -- особенно на зашумленных данных (SNR < 2), малых выборках (<500 строк) и при дисбалансе классов 1:50+. RF также значительно устойчивее к гиперпараметрам: default RF проигрывает tuned RF на 1-2%, а default XGBoost проигрывает tuned XGBoost на 5-8%.


See Also