Classical ML: Interview Q&A¶
~58 минут чтения
Типичные вопросы с собеседований 2025-2026 Формат: Q: вопрос / A: развернутый ответ Обновлено: 2026-02-11
Предварительно: Материалы
Содержание¶
- K-Nearest Neighbors
- Logistic Regression
- K-Means
- Naive Bayes
- Decision Trees
- SVM
- Gradient Boosting
- Feature Engineering
- Feature Selection
- Сложные вопросы (Senior+)
- Model Interpretability (SHAP & LIME) — NEW 2026
- Reinforcement Learning Basics
- NLP: Word Embeddings (Word2Vec, GloVe)
- NLP: Named Entity Recognition (NER) & Sequence Labeling
- Hyperparameter Optimization
- Active Learning
- Time Series: Deep Learning Methods
- Explainable AI (XAI): SHAP & LIME
- Neural Architecture Search (NAS)
- Cost-Sensitive Learning
- Missing Data Handling
- Model Debugging
- AutoML Theory
- Federated Learning
- TabPFN — Foundation Model for Tabular Data
- Production ML Deployment Patterns
- Data Drift Detection
- Hyperparameter Interactions & Learning Curves
- Cross-Validation Edge Cases
K-Nearest Neighbors¶
Q: Почему KNN плохо работает в высоких размерностях?¶
A: Curse of Dimensionality.
Причина: В высокоразмерном пространстве: 1. Все точки "далеко" друг от друга 2. Volume unit ball \(\to 0\) exponentially 3. Distances становятся неразличимыми
Solution: Dimensionality reduction (PCA) или другие алгоритмы.
Q: Как выбрать k в KNN?¶
A:
- k=1: Overfitting, чувствителен к noise
- k=n: Underfitting, всегда majority class
- Rule of thumb: \(k \approx \sqrt{n}\) (but use CV)
Practical: Cross-validation для optimal k. Обычно нечётное k для бинарной классификации (избежать ties).
Q: Какую метрику расстояния выбрать?¶
A:
| Метрика | Формула | Когда использовать |
|---|---|---|
| Euclidean | \(\sqrt{\sum(x_i-y_i)^2}\) | Default, continuous features, масштабированные данные |
| Manhattan | \(\sum\|x_i-y_i\|\) | High-dim (более устойчив к curse of dimensionality), sparse |
| Cosine | \(1 - \frac{x \cdot y}{\|x\|\|y\|}\) | Text embeddings, TF-IDF, когда важен угол, а не magnitude |
| Mahalanobis | \(\sqrt{(x-y)^T S^{-1} (x-y)}\) | Коррелированные фичи, учитывает ковариацию |
Gotcha: Euclidean и Manhattan требуют feature scaling. Cosine -- нет (инвариантен к scale).
Q: Weighted KNN -- зачем и как?¶
A:
Проблема: Стандартный KNN даёт равный вес всем k соседям -- далёкий сосед влияет так же, как ближайший.
Решение: Взвешивание по обратному расстоянию: $\(\hat{y} = \frac{\sum_{i \in N_k} w_i y_i}{\sum_{i \in N_k} w_i}, \quad w_i = \frac{1}{d(x, x_i)^p}\)$
Scikit-learn: KNeighborsClassifier(weights='distance')
Когда помогает: Неравномерная плотность данных, граничные зоны между классами.
Q: Как ускорить KNN? Brute force O(nd) на каждый query.¶
A:
| Метод | Сложность query | Когда |
|---|---|---|
| Brute force | \(O(nd)\) | \(n < 10K\) или \(d > 20\) |
| KD-tree | \(O(d \log n)\) avg | \(d < 20\), dense data |
| Ball tree | \(O(d \log n)\) avg | Любая метрика, \(d < 40\) |
| ANN (приближённые) | \(O(d \log n)\) | \(n > 100K\), допустима ошибка |
Approximate Nearest Neighbors (ANN):
| Библиотека | Алгоритм | Плюсы |
|---|---|---|
| FAISS (Meta) | IVF + PQ | GPU, миллиарды векторов, production-standard |
| Annoy (Spotify) | Random projections | Read-only, быстрый build, mmap |
| HNSW (hnswlib) | Hierarchical NSW graph | Лучший recall/speed trade-off |
| ScaNN (Google) | Anisotropic quantization | Оптимизирован для inner product |
Scikit-learn: KNeighborsClassifier(algorithm='auto') выбирает между brute/kd_tree/ball_tree автоматически.
Q: KNN для регрессии vs классификации -- в чём разница?¶
A:
| Classification | Regression | |
|---|---|---|
| Prediction | Majority vote среди k соседей | Mean/median значений k соседей |
| Weighted | Взвешенный vote | Взвешенное среднее |
| Метрики | Accuracy, F1 | MSE, MAE |
Scikit-learn: KNeighborsClassifier vs KNeighborsRegressor.
Gotcha: Для регрессии weighted KNN почти всегда лучше uniform -- далёкие точки вносят шум.
Q: Почему feature scaling обязателен для KNN?¶
A:
Проблема: KNN основан на расстоянии. Фича с бОльшим масштабом доминирует:
Фича A: зарплата (30000-150000)
Фича B: возраст (18-65)
→ Расстояние определяется почти полностью зарплатой
Решение:
| Метод | Когда |
|---|---|
| StandardScaler | Gaussian-like features |
| MinMaxScaler | Bounded features, [0,1] |
| RobustScaler | Есть outliers |
Gotcha: Fit scaler ТОЛЬКО на train set, transform и train и test. Иначе -- data leakage.
Заблуждение: KNN всегда хуже complex моделей
На малых датасетах (\(n < 1000\), \(d < 20\)) с чистыми данными KNN часто побеждает Random Forest и SVM. Причина: KNN не имеет bias от функциональной формы -- чисто data-driven. Проблемы начинаются при \(d > 20\) (curse of dimensionality) или \(n > 50K\) (скорость).
Logistic Regression¶
Q: Почему logistic regression называется "regression"?¶
A: Исторически — потому что использует linear combination признаков:
Технически это classification (sigmoid превращает в probability), но underlying model — linear regression + activation.
Q: Multiclass logistic regression — как работает?¶
A:
One-vs-Rest (OvR): K бинарных классификаторов, каждый с sigmoid: $\(P(y=k|x) = \sigma(w_k^Tx + b_k) = \frac{1}{1 + e^{-(w_k^Tx + b_k)}}\)$
Softmax (Multinomial): Единая модель, все классы одновременно: $\(P(y=k|x) = \frac{e^{w_k^Tx}}{\sum_j e^{w_j^Tx}}\)$
Scikit-learn: multi_class='ovr' или 'multinomial'
Q: L1 vs L2 regularization в Logistic Regression¶
A:
| L1 (Lasso) | L2 (Ridge) |
|---|---|
| Sparse coefficients | Small coefficients |
| Feature selection | Handles multicollinearity |
| Non-differentiable at 0 | Smooth |
| Coordinate descent | Gradient descent |
Practical: L1 если нужна интерпретируемость, L2 если все признаки важны.
K-Means¶
Q: K-means всегда сходится?¶
A: Да, но не обязательно к глобальному optimum.
Теорема: K-means сходится к локальному минимуму за конечное число шагов.
Причина: На каждом шаге: 1. Assignment step уменьшает J или оставляет 2. Update step уменьшает J или оставляет 3. J ограничено снизу
Problem: Может застрять в плохом local minimum.
Solution: K-means++ initialization, multiple restarts.
Q: Как работает K-Means++ initialization и почему она важна?¶
A:
Проблема random init: Плохие стартовые центроиды = плохой local minimum. Random init в ~20% случаев даёт результат в 2-5x хуже optimal.
K-Means++ алгоритм: 1. Выбрать первый центроид случайно 2. Для каждой точки вычислить расстояние до ближайшего центроида \(D(x)\) 3. Выбрать следующий центроид с вероятностью \(\frac{D(x)^2}{\sum D(x)^2}\) 4. Повторять шаги 2-3 пока не будет k центроидов
Гарантия: \(O(\log k)\)-competitive с оптимальным решением (Arthur & Vassilvitskii, 2007).
Scikit-learn: KMeans(init='k-means++') -- default.
Q: Как выбрать k в K-means?¶
A:
| Метод | Как работает | Плюсы/Минусы |
|---|---|---|
| Elbow | Plot \(J(k)\) vs \(k\), найти "локоть" | Субъективный, не всегда есть чёткий elbow |
| Silhouette | \(s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}\) | Объективный, \(s \in [-1, 1]\), но \(O(n^2)\) |
| Gap Statistic | Сравнение с uniform distribution | Статистически обоснован, дорогой |
| Calinski-Harabasz | \(\frac{B/(k-1)}{W/(n-k)}\) (between/within variance ratio) | Быстрый, но biased к convex clusters |
Practical: - Domain knowledge > метрики (если знаешь бизнес = 3 сегмента клиентов, бери k=3) - Silhouette + Elbow -- стандартная комбинация - Если Silhouette < 0.25 -- кластеризация плохая, данные не имеют кластерной структуры
Q: K-means vs K-medoids¶
A:
| K-means | K-medoids (PAM) |
|---|---|
| Centroid = mean | Centroid = actual data point |
| Sensitive to outliers | Robust to outliers |
| Только Euclidean | Любая метрика расстояния |
| \(O(nkd)\) per iteration | \(O(n^2kd)\) -- значительно медленнее |
Когда K-medoids: Outliers, non-Euclidean distances, когда центроид должен быть интерпретируемым (реальная точка данных).
Q: Когда K-Means не работает?¶
A:
| Ситуация | Почему ломается | Альтернатива |
|---|---|---|
| Non-convex кластеры (полумесяцы, кольца) | K-Means делит пространство Voronoi-разбиением | DBSCAN, Spectral Clustering |
| Кластеры разного размера | Маленький кластер "поглощается" большим | GMM, HDBSCAN |
| Кластеры разной плотности | Разреженные точки ошибочно присваиваются плотному кластеру | DBSCAN, OPTICS |
| High-dimensional (\(d > 50\)) | Расстояния теряют смысл (curse of dimensionality) | Spectral Clustering, PCA + K-Means |
| Неизвестное количество кластеров | K задаётся вручную | DBSCAN, HDBSCAN, X-Means |
Q: K-Means vs DBSCAN vs GMM -- когда что?¶
A:
| Аспект | K-Means | DBSCAN | GMM |
|---|---|---|---|
| Форма кластеров | Сферические | Произвольная | Эллиптическая |
| K задаётся? | Да | Нет (\(\epsilon\), min_pts) | Да |
| Outliers | Нет (все точки в кластерах) | Да (noise points) | Нет (но low probability) |
| Soft assignment | Нет (hard) | Нет (hard) | Да (\(P(z_k\|x)\)) |
| Скорость | \(O(nkd)\) | \(O(n \log n)\) с index | \(O(nk d^2)\) per EM step |
| Масштабируемость | Mini-batch до миллионов | Плохо > 100K | Плохо > 50K |
Rules of thumb: - K-Means: Сферические кластеры, знаешь k, нужна скорость - DBSCAN: Не знаешь k, есть outliers, non-convex формы - GMM: Нужна soft membership (вероятности), overlapping clusters
Q: Mini-batch K-Means -- зачем?¶
A:
Проблема: Стандартный K-Means: каждый iteration проходит по ВСЕМ \(n\) точкам. При \(n > 1M\) -- медленно.
Mini-batch: Каждый iteration -- случайная выборка \(b\) точек (batch_size=1000 typically).
| Standard K-Means | Mini-batch K-Means | |
|---|---|---|
| Per iteration | \(O(nkd)\) | \(O(bkd)\), \(b \ll n\) |
| Convergence | Стабильная | Slightly noisier |
| Quality | Baseline | ~1-3% хуже inertia |
| Speed | Slow for \(n > 100K\) | 10-100x faster |
Scikit-learn: MiniBatchKMeans(n_clusters=k, batch_size=1000)
Q: Метрики качества кластеризации -- с labels и без?¶
A:
Внешние (есть ground truth):
| Метрика | Формула/Суть | Диапазон |
|---|---|---|
| ARI (Adjusted Rand Index) | Корректировка Rand Index за chance | \([-1, 1]\), 1 = perfect |
| NMI (Normalized Mutual Information) | \(\frac{2 \cdot MI(U,V)}{H(U) + H(V)}\) | \([0, 1]\) |
| Homogeneity / Completeness | Кластер = один класс / класс = один кластер | \([0, 1]\) |
Внутренние (нет ground truth):
| Метрика | Суть | Диапазон |
|---|---|---|
| Silhouette | Cohesion vs separation | \([-1, 1]\) |
| Calinski-Harabasz | Between/within variance | \([0, \infty)\), higher = better |
| Davies-Bouldin | Avg cluster similarity | \([0, \infty)\), lower = better |
Gotcha: Silhouette biased к convex clusters. Для DBSCAN результатов лучше использовать DBCV (Density-Based Clustering Validation).
Заблуждение: K-Means всегда находит глобальный оптимум
K-Means гарантирует сходимость к ЛОКАЛЬНОМУ минимуму за конечное число шагов (monotonic decrease of J), но НЕ гарантирует глобальный. На практике: запускай n_init=10 (scikit-learn default) -- 10 запусков с разными init, берёт лучший. K-Means++ снижает разброс между запусками, но не устраняет полностью.
Заблуждение: Feature scaling не нужен для K-Means
K-Means использует Euclidean distance -- фича с большим масштабом доминирует. StandardScaler или MinMaxScaler ОБЯЗАТЕЛЬНЫ перед K-Means. Единственное исключение: если все фичи уже в одном масштабе (e.g., one-hot encoded).
Naive Bayes¶
Q: Почему "naive"? Что если assumption нарушается?¶
A:
Naive assumption: Features условно независимы при фиксированном классе.
Reality: Features коррелируют.
Но работает потому что: 1. Нужно только ORDERING, не точные probabilities 2. Overestimation всех вероятностей сокращается 3. Strong signal от truly informative features
Q: Gaussian vs Multinomial vs Bernoulli Naive Bayes¶
A:
| Type | Data | Distribution |
|---|---|---|
| Gaussian | Continuous | Normal |
| Multinomial | Counts (text) | Multinomial |
| Bernoulli | Binary features | Bernoulli |
Для текста: Multinomial NB standard (TF-IDF counts).
Decision Trees¶
Q: Gini vs Entropy — что выбрать?¶
A:
Gini: \(1 - \sum p_k^2\)
Entropy: \(-\sum p_k \log_2 p_k\)
Практически: Почти идентичные результаты.
Различия: - Gini: быстрее (нет log) - Entropy: чуть более sensitive к pure nodes - Gini чаще в sklearn default
Recommendation: Use default (Gini), только если time-critical.
Q: Как предотвратить overfitting в Decision Trees?¶
A:
Pre-pruning:
- max_depth: limit tree depth
- min_samples_split: minimum samples to split
- min_samples_leaf: minimum samples in leaf
- max_leaf_nodes: maximum leaves
Post-pruning: - Cost-complexity pruning (α penalty) - Reduced error pruning
Rule of thumb: Start with max_depth=5-10, tune.
Q: Feature importance в Decision Trees — как считается?¶
A:
Gini Importance (Mean Decrease in Impurity): $\(\text{Importance}(j) = \sum_{t \in T} p(t) \cdot \Delta i(t) \cdot \mathbb{1}(j \text{ used at } t)\)$
где \(p(t)\) = fraction of samples at node \(t\), \(\Delta i(t)\) = impurity decrease.
Warning: Biased towards high-cardinality features!
Alternative: Permutation importance (more reliable).
SVM¶
Q: Что такое support vectors?¶
A: Support vectors -- точки, лежащие на границе margin или внутри.
Математически: Точки с \(\alpha_i > 0\) в dual formulation.
Свойства: - Только support vectors определяют hyperplane - Удаление non-support vectors не меняет модель - Обычно 10-30% training samples
Follow-up: Это делает SVM memory-efficient -- prediction зависит только от support vectors, а не от всего dataset.
Q: Hard margin vs Soft margin -- в чём разница и зачем C?¶
A:
Hard margin (линейно разделимые данные): $\(\min_{w,b} \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^Tx_i + b) \geq 1\)$
Soft margin (реальные данные с шумом): $\(\min_{w,b,\xi} \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}\xi_i \quad \text{s.t.} \quad y_i(w^Tx_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0\)$
Параметр C -- trade-off:
| C | Margin | Ошибки на train | Risk |
|---|---|---|---|
| Маленький (0.01) | Широкий | Допускает больше | Underfitting |
| Большой (100) | Узкий | Штрафует сильно | Overfitting |
Practical: Подбирать через CV. Default в sklearn: C=1.0. Типичный grid: [0.01, 0.1, 1, 10, 100].
Q: Почему SVM kernel trick работает?¶
A:
Idea: Map data to higher dimension \(\phi(x)\), but compute only kernel \(K(x,x') = \phi(x)^T\phi(x')\).
Dual formulation: $\(f(x) = \sum_i \alpha_i y_i K(x_i, x) + b\)$
Key insight: Никогда не вычисляем \(\phi(x)\) явно!
RBF kernel: \(\exp(-\gamma\|x-x'\|^2)\) = infinite dimensional mapping. Caveat: kernel matrix \(O(n^2)\) space/time; for large \(n\) (>10K) use approximations (Nystrom, random Fourier features).
Q: Как выбрать kernel?¶
A:
| Kernel | Формула | Когда использовать |
|---|---|---|
| Linear | \(x^Tx'\) | \(d > n\) (text, genomics), линейно разделимые данные |
| Polynomial | \((\gamma x^Tx' + r)^d\) | Известна полиномиальная связь, NLP (degree 2-3) |
| RBF (Gaussian) | \(\exp(-\gamma\|x-x'\|^2)\) | Default. Не знаешь структуру данных |
| Sigmoid | \(\tanh(\gamma x^Tx' + r)\) | Редко. Аналог нейросети с 1 hidden layer |
Decision flow:
1. Начни с Linear SVM (LinearSVC) -- быстрый, часто достаточен
2. Если accuracy < target -- RBF SVM (SVC(kernel='rbf'))
3. Tune \(\gamma\) и \(C\) через GridSearchCV
\(\gamma\) в RBF: - Маленький \(\gamma\): каждая точка влияет далеко (smoother boundary, underfitting) - Большой \(\gamma\): каждая точка влияет только на ближайших (complex boundary, overfitting)
Q: SVM vs Logistic Regression -- когда что?¶
A:
| Аспект | SVM | Logistic Regression |
|---|---|---|
| Objective | Max margin (hinge loss) | Max likelihood (log loss) |
| Output | Decision value (не probability) | Probability \(P(y=1\|x)\) |
| Outliers | Менее чувствителен (hinge = flat beyond margin) | Более чувствителен (log loss grows) |
| Feature scaling | Обязательно | Желательно |
| Kernels | Да (non-linear) | Нет (только linear) |
| Скорость | \(O(n^2)\) - \(O(n^3)\) | \(O(nd)\) |
| Большие данные | Плохо > 50K | Хорошо на миллионах |
Rules of thumb: - Нужны вероятности → Logistic Regression - Мало данных (\(n < 10K\)), non-linear → SVM с RBF - Много данных (\(n > 50K\)) → Logistic Regression (или LinearSVC) - Text classification → LinearSVC (часто лучше LR на sparse data)
Q: Multiclass SVM -- OvO vs OvR¶
A:
| One-vs-Rest (OvR) | One-vs-One (OvO) | |
|---|---|---|
| Классификаторов | \(k\) | \(\frac{k(k-1)}{2}\) |
| Training | Каждый: \(n\) samples | Каждый: \(\frac{2n}{k}\) samples |
| Prediction | Макс confidence score | Majority voting |
| Когда лучше | \(k\) большой, \(n\) большой | \(n\) маленький, kernel SVM |
Scikit-learn: SVC использует OvO по default. LinearSVC использует OvR. Для multiclass > 10 классов: OvR быстрее.
Q: SVM для регрессии (SVR)¶
A:
Идея: \(\epsilon\)-insensitive tube -- ошибки внутри \(\epsilon\) не штрафуются.
| Параметр | Эффект |
|---|---|
| \(\epsilon\) | Ширина tube (толерантность к ошибкам) |
| \(C\) | Штраф за выход из tube |
Когда SVR: Маленький dataset, non-linear зависимости, outliers (tube их игнорирует).
Q: SVM для несбалансированных данных¶
A:
Class weights: $\(C_+ = C \cdot \frac{n}{2 \cdot n_+}, \quad C_- = C \cdot \frac{n}{2 \cdot n_-}\)$
Scikit-learn:
Q: Масштабируемость SVM -- когда НЕ использовать?¶
A:
| \(n\) (samples) | Рекомендация |
|---|---|
| < 10K | SVM с любым kernel |
| 10K-100K | LinearSVC (или SGDClassifier с hinge loss) |
| > 100K | Не SVM. Используй LogReg, GBDT, Neural Nets |
Причина: Kernel SVM строит kernel matrix \(n \times n\) -- \(O(n^2)\) memory, \(O(n^3)\) training.
Альтернативы для large-scale:
| Метод | Сложность |
|---|---|
LinearSVC (liblinear) |
\(O(nd)\) |
SGDClassifier(loss='hinge') |
\(O(nd)\), online |
| Nystrom approximation | \(O(nm^2)\), \(m \ll n\) |
| Random Fourier Features | \(O(nDd)\), \(D\) = projection dim |
Q: В каких задачах SVM всё ещё актуален в 2026?¶
A:
| Задача | Почему SVM |
|---|---|
| Text classification (small corpus) | LinearSVC на TF-IDF часто побеждает fine-tuned BERT при \(n < 5K\) |
| Bioinformatics (gene expression) | \(d \gg n\), kernel methods natural |
| Anomaly detection (One-Class SVM) | Не нужны аномальные примеры для train |
| Small dataset + non-linear | RBF SVM при \(n < 1K\) часто лучше RF/GBDT |
Где SVM проиграл: Tabular > 10K samples (GBDT лучше), Vision (CNN), NLP (Transformers), любой large-scale.
Заблуждение: SVM даёт вероятности
SVC.predict_proba() в sklearn использует Platt scaling (sigmoid calibration поверх decision values). Это НЕ native probability -- это post-hoc calibration, медленная (\(O(n^2)\) cross-validation), и может быть неточной. Если нужны вероятности -- используй Logistic Regression.
Заблуждение: RBF kernel всегда лучше Linear
При \(d > n\) (high-dimensional, sparse data) linear kernel часто лучше RBF. Причина: в high-dim пространстве данные часто линейно разделимы. RBF добавляет ненужную сложность и overfits. Правило: text/genomics → linear, tabular low-dim → RBF.
Gradient Boosting¶
Q: Gradient Boosting vs Random Forest — когда что?¶
A:
| Gradient Boosting | Random Forest |
|---|---|
| Sequential training | Parallel training |
| Low bias, higher variance | Higher bias, low variance |
| Prone to overfitting | Resistant to overfitting |
| Requires careful tuning | Easy to tune |
| Better accuracy (potential) | Good baseline |
Practical: - Start with RF for baseline - Use GBDT if need max accuracy - XGBoost/LightGBM/CatBoost > sklearn GBDT
Q: Learning rate в Gradient Boosting — как выбрать?¶
A:
Trade-off: - Low LR (0.01): More trees needed, better generalization - High LR (0.3): Fewer trees, faster, may overfit
Rule of thumb: - Start with LR=0.1, n_estimators=100 - If overfitting: decrease LR, increase n_estimators - Typical: LR=0.01-0.1
Relation: \(n\_estimators \propto 1/LR\)
Q: XGBoost vs LightGBM vs CatBoost¶
A:
| Feature | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Tree growth | Level-wise | Leaf-wise | Symmetric |
| Categorical | Manual | Native | Native (best) |
| Missing values | Native | Native | Native |
| Speed | Good | Fastest | Good |
| Memory | Medium | Low | Medium |
| Tuning | Complex | Medium | Easy |
Practical: - CatBoost: Best for categorical, minimal tuning - LightGBM: Fastest for large datasets - XGBoost: Most mature, good default
Feature Engineering¶
Q: Target Encoding vs One-Hot для high-cardinality¶
A:
| One-Hot | Target Encoding |
|---|---|
| 1 column per category | 1 column total |
| No leakage risk | Leakage risk |
| Works for tree models | Works for linear models |
| O(k) dimensions | O(1) dimension |
Target Encoding risks: - Leakage если не использовать CV - Overfitting на rare categories
Solution: Leave-one-out, smoothing, CV-based encoding.
Q: Когда использовать log transform?¶
A:
Для: - Right-skewed distributions (income, prices) - Positive values only - Multiplicative relationships
Effect: - Reduces skewness - Stabilizes variance - Makes relationships more linear
# Log transform
X_log = np.log1p(X) # log(1+X) handles zeros
# Power transform (Box-Cox)
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='box-cox')
X_transformed = pt.fit_transform(X)
Feature Selection¶
Q: RFE vs Feature Importance — что лучше?¶
A:
RFE (Recursive Feature Elimination): - Trains model, removes weakest feature, repeat - Computationally expensive - More reliable ranking
Feature Importance: - Single model training - Faster - May be biased (high-cardinality)
Practical: - Quick: Feature importance from Random Forest - Important: RFE with cross-validation
Q: Mutual Information vs Correlation для feature selection¶
A:
| Correlation | Mutual Information |
|---|---|
| Linear only | Any relationship |
| [-1, 1] scale | [0, ∞) scale |
| Fast to compute | Slower |
| Gaussian assumption | No assumption |
When MI > Correlation: - Non-linear relationships - Categorical features - Complex interactions
Сложные вопросы (Senior+)¶
Q: Выведите gradient для Logistic Regression с L2 regularization¶
A:
Q: Как работает Early Stopping?¶
A:
Algorithm: 1. Split train → train_sub + validation 2. Train, evaluate on validation each epoch 3. Track best validation score 4. Stop if no improvement for patience epochs 5. Return best model (not last!)
Practical:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1)
best_score = 0
patience = 10
for epoch in range(max_epochs):
model.fit(X_train, y_train)
score = model.score(X_val, y_val)
if score > best_score:
best_score = score
best_model = clone(model)
wait = 0
else:
wait += 1
if wait >= patience:
break
Q: Stratified K-Fold vs K-Fold — когда что?¶
A:
| K-Fold | Stratified K-Fold |
|---|---|
| Random split | Preserve class ratios |
| Works for regression | Classification only |
| May have unbalanced folds | Balanced folds |
Когда Stratified: - Imbalanced classification - Small datasets - Rare classes
Когда K-Fold: - Regression - Large balanced datasets - Time series (use TimeSeriesSplit instead)
Model Interpretability (SHAP & LIME) — NEW 2026¶
Q: Зачем нужна интерпретируемость модели?¶
A:
Business reasons: - Regulatory compliance (GDPR "right to explanation") - Trust building with stakeholders - Debug model biases and errors - Feature leakage detection
Technical reasons: - Validate model behavior matches domain knowledge - Identify spurious correlations - Debug poor performance on specific cases
Q: SHAP vs LIME — в чём разница?¶
A:
| SHAP | LIME |
|---|---|
| Game-theoretic (Shapley values) | Local surrogate model |
| Consistent, additive | May be inconsistent |
| Global + local explanations | Local only |
| Slower (especially KernelSHAP) | Faster |
| Feature interactions visible | No interactions by default |
SHAP formula: $\(\phi_i = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!(|N|-|S|-1)!}{|N|!}[f(S \cup \{i\}) - f(S)]\)$
LIME formula: $\(\xi(x) = \arg\min_{g \in G} \mathcal{L}(f, g, \pi_x) + \Omega(g)\)$
Q: Как интерпретировать SHAP values?¶
A:
Global interpretation: - Mean |SHAP| = feature importance - SHAP distribution = effect direction (positive/negative) - Dependence plots = feature interaction effects
Local interpretation: - SHAP value = contribution of feature to this prediction - Sum of all SHAP values = prediction - base_value - Positive SHAP = increases prediction - Negative SHAP = decreases prediction
import shap
# TreeExplainer для tree models (fast)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
# Визуализация
shap.summary_plot(shap_values, X)
shap.dependence_plot("feature_name", shap_values, X)
# Local explanation
shap.force_plot(explainer.expected_value, shap_values[0], X.iloc[0])
Q: Когда использовать SHAP, а когда LIME?¶
A:
Use SHAP when: - Need consistent explanations - Want global feature importance - Tree models (TreeExplainer is fast) - Budget allows for computation
Use LIME when: - Need quick local explanations - Any model type (model-agnostic) - Limited compute resources - Text or image explanations
Production tip: Pre-compute SHAP for common cases, use LIME for real-time ad-hoc explanations.
Q: Проблемы SHAP/LIME в production¶
A:
Challenges: 1. Computational cost: KernelSHAP needs many model calls 2. Stability: Explanations may vary between runs 3. Counterfactual: Doesn't tell "what if" (need different tools) 4. Human interpretation: Still requires ML knowledge to understand
Solutions: - TreeExplainer for tree models (exact, fast) - Pre-compute explanations for common inputs - Cache results - Use for debugging, not as sole explanation
Reinforcement Learning Basics¶
Q: В чём разница между value-based и policy-based методами?¶
A:
| Value-based | Policy-based |
|---|---|
| Учим Q(s,a) или V(s) | Учим π(a|s) напрямую |
| Выбираем action через argmax | Сэмплируем из распределения |
| DQN, Q-learning | REINFORCE, A3C, PPO |
| Дискретные действия | Непрерывные действия |
| Sample efficient | Требует много эпизодов |
| Низкая variance (off-policy) | Высокая variance (Monte Carlo returns) |
Гибридный подход (Actor-Critic): Actor учит политику, Critic оценивает value function.
Q: Объясните Q-learning алгоритм¶
A:
Идея: Итеративно обновляем Q-values через Bellman equation.
Q-learning update: $\(Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)]\)$
Ключевые компоненты: - \(\alpha\) — learning rate - \(\gamma\) — discount factor (0.9-0.99) - \(\epsilon\)-greedy — exploration vs exploitation
Deep Q-Network (DQN): - Q-function аппроксимируется нейросетью - Experience replay — учимся на past transitions - Target network — отдельная сеть для стабильности
# Q-learning update
q_table[state, action] += lr * (
reward + gamma * np.max(q_table[next_state]) - q_table[state, action]
)
Q: Что такое policy gradient? REINFORCE?¶
A:
Policy Gradient Theorem: $\(\nabla J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla \log \pi_\theta(a|s) \cdot Q^{\pi}(s,a)]\)$
Интуиция: Увеличиваем вероятность действий, которые привели к высокой награде.
REINFORCE algorithm: 1. Сэмплируем trajectory \(\tau\) из \(\pi_\theta\) 2. Вычисляем return \(G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k\) 3. Обновляем: \(\theta \leftarrow \theta + \alpha \nabla \log \pi_\theta(a_t|s_t) G_t\)
Проблема REINFORCE: Высокая variance (одна trajectory = noisy estimate).
Q: PPO — почему популярен?¶
A:
Proximal Policy Optimization решает проблему instability policy gradient.
Key idea: Не меняем политику слишком сильно за один шаг.
Clipped objective: $\(L^{CLIP}(\theta) = \mathbb{E}_t[\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)]\)$
Где \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\) — probability ratio.
Преимущества: - Стабильный (clipping предотвращает большие обновления) - Sample efficient (reuses samples) - Простой в реализации - SOTA для многих RL задач
Q: Exploration vs Exploitation — как балансировать?¶
A:
Problem: Нужно исследовать новые действия (exploration) и использовать лучшие известные (exploitation).
Strategies: 1. \(\epsilon\)-greedy: С вероятностью \(\epsilon\) — random action, иначе — best action. Decay \(\epsilon\) от 1.0 до 0.01.
- Upper Confidence Bound (UCB): $\(a_t = \arg\max_a [Q_t(a) + c\sqrt{\frac{\ln t}{N_t(a)}}]\)$
Балансирует exploitation (Q-value) и exploration (uncertainty term).
-
Thompson Sampling: Байесовский подход — сэмплируем из posterior over Q-values.
-
Entropy regularization: Добавляем \(-\beta H(\pi)\) к loss для поощрения разнообразия.
In practice: \(\epsilon\)-greedy для простоты, UCB для bandits, entropy regularization для continuous control.
NLP: Word Embeddings (Word2Vec, GloVe)¶
Q: В чём разница между CBOW и Skip-gram?¶
A:
| CBOW | Skip-gram |
|---|---|
| Предсказывает center word по context | Предсказывает context по center word |
| Быстрее на частых словах | Лучше на редких словах |
| Сглаживает context (averaging) | Точный для каждого context word |
| $P(w_t | w_{t-c}, ..., w_{t+c})$ |
CBOW: Вход — one-hot контекстных слов → averaging → hidden → softmax для center word.
Skip-gram: Вход — one-hot center word → hidden → K независимых softmax для каждого context слова.
На практике: Skip-gram с negative sampling — стандарт (word2vec Google News).
Q: Как работает Negative Sampling?¶
A:
Проблема: Softmax над всем словарём (100K+ слов) — дорого на каждый training step.
Решение: Заменить softmax на binary classification для каждого примера.
Original softmax: $\(P(w_o | w_i) = \frac{\exp(v_{w_o}^T v_{w_i})}{\sum_{w \in V} \exp(v_w^T v_{w_i})}\)$
Negative Sampling objective: $\(L = \log \sigma(v_{w_o}^T v_{w_i}) + \sum_{k=1}^{K} \mathbb{E}_{w_k \sim P_n(w)}[\log \sigma(-v_{w_k}^T v_{w_i})]\)$
Где K = 5-20 negative samples, \(P_n(w) \propto f(w)^{3/4}\) (freq^0.75 — повышает редкие слова).
Идея: Положительная пара (center, context) + K отрицательных пар (center, random word).
# Negative sampling loss
def negative_sampling_loss(center, context, negative_samples):
# Positive: center-context pair
pos_score = torch.dot(center, context)
pos_loss = -torch.log(torch.sigmoid(pos_score))
# Negative: center-random pairs
neg_scores = torch.matmul(negative_samples, center)
neg_loss = -torch.sum(torch.log(torch.sigmoid(-neg_scores)))
return pos_loss + neg_loss
Q: Word2Vec vs GloVe — в чём разница?¶
A:
| Word2Vec | GloVe |
|---|---|
| Predictive (local context) | Count-based (global co-occurrence) |
| Skip-gram / CBOW | Matrix factorization |
| Sliding window | Co-occurrence matrix |
| Online learning | Batch (matrix ops) |
| Нет explicit global info | Captures global statistics |
GloVe objective: $\(J = \sum_{i,j=1}^{V} f(X_{ij})(w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij})^2\)$
Где \(X_{ij}\) — co-occurrence count, \(f\) — weighting function (снижает слишком частые пары).
Практика: GloVe часто лучше на analogies, Word2Vec — на downstream tasks с fine-tuning.
Q: Почему word embeddings capture semantics?¶
A:
Distributional Hypothesis: "You shall know a word by the company it keeps" (Firth, 1957).
Механизм: 1. Similar words appear in similar contexts 2. Model learns to predict context → similar vectors for similar contexts 3. Vector space reflects distributional similarity
Аналогии: \(\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}\)
Ограничения: - Polysemy: "bank" (river vs financial) — один вектор - Antonyms могут быть близки (similar context) - No compositional meaning
Современные решения: Contextualized embeddings (BERT, ELMo) — разные векторы для разных контекстов.
Q: Что такое FastText и чем отличается от Word2Vec?¶
A:
FastText (Facebook AI Research, 2016) — extension Word2Vec с subword information.
Ключевое отличие: Представляет слово как bag of character n-grams:
- "apple" → ["
Формула: $\(\vec{w} = \sum_{g \in G_w} \vec{z}_g\)$
Где \(G_w\) — множество n-grams для слова \(w\).
Преимущества: 1. OOV handling: Может создать embedding для неизвестных слов 2. Morphology: Captures "running", "runner", "runs" share patterns 3. Rare words: Лучше для редких слов (shared subwords) 4. Multilingual: Works well для morphologically rich languages (Russian, German)
Недостатки: 1. Memory: Больше векторов (n-grams vs words) 2. Noise: Subwords могут вносить noise 3. Slower: Больше parameters
import fasttext
# Train FastText
model = fasttext.train_unsupervised(
'data.txt',
model='skipgram',
dim=300,
ws=5, # window size
minCount=5,
minn=3, # min n-gram
maxn=6 # max n-gram
)
# OOV handling — works!
embedding = model.get_word_vector('unprecedentedword')
Q: Word2Vec vs GloVe vs FastText — когда что использовать?¶
A:
| Criteria | Word2Vec | GloVe | FastText |
|---|---|---|---|
| Training | Predictive (local) | Count-based (global) | Predictive + subwords |
| OOV handling | ❌ No | ❌ No | ✅ Yes (via subwords) |
| Memory | Low | High (co-occ matrix) | Medium-High |
| Speed | Fast | Medium | Medium |
| Rare words | Poor | Medium | Good |
| Morphology | No | No | Yes |
| Best for | General NLP, speed | Analogy tasks | OOV, morphological langs |
Decision framework:
# Choose Word2Vec when:
# - Speed priority
# - Well-defined vocabulary (no OOV expected)
# - Limited compute
# Choose GloVe when:
# - Global context matters
# - Analogy tasks important
# - Clean, large corpus available
# Choose FastText when:
# - OOV words common (user-generated content)
# - Morphologically rich language (Russian, Finnish)
# - Domain-specific vocabulary
2026 Context: Static embeddings → less common with transformers, but still useful for: - Lightweight production systems - Resource-constrained environments - Baseline comparisons - Word similarity tasks
Q: Как оценить качество word embeddings?¶
A:
Intrinsic evaluation: 1. Analogy tests: a:b :: c:? (Google analogy dataset) 2. Similarity correlation: Spearman с human judgments (WordSim-353, SimLex-999) 3. Concept categorization: Clustering quality (WordNet)
Extrinsic evaluation: 1. Downstream task performance (NER, sentiment, QA) 2. Probe tasks (part-of-speech, syntactic tree depth)
Practical: Intrinsic — для разработки, Extrinsic — для production.
# Cosine similarity для word vectors
from sklearn.metrics.pairwise import cosine_similarity
def word_similarity(w1, w2, embeddings):
v1 = embeddings[w1]
v2 = embeddings[w2]
return cosine_similarity([v1], [v2])[0][0]
# Analogy: king - man + woman = ?
def analogy(a, b, c, embeddings):
"""Find d such that a:b :: c:d"""
target = embeddings[a] - embeddings[b] + embeddings[c]
# Find nearest word to target
similarities = cosine_similarity([target], embeddings.vectors)
return embeddings.index_to_word[np.argmax(similarities)]
NLP: Named Entity Recognition (NER) & Sequence Labeling¶
Q: Что такое NER и как оценивать?¶
A:
Named Entity Recognition — задача извлечения именованных сущностей (Person, Organization, Location, Date, etc.) из текста.
Формат: BIO tagging (Begin-Inside-Outside) - B-PER, I-PER — person name - B-ORG, I-ORG — organization - O — not an entity
Метрики:
Token-level: - Precision, Recall, F1 для каждого класса - Micro vs Macro averaging
Entity-level (строже): - Exact match: границы и тип должны совпасть - Partial match: overlap > threshold
CoNLL-2003 standard: Entity-level F1.
Q: CRF vs BiLSTM vs BERT для NER?¶
A:
| CRF | BiLSTM-CRF | BERT |
|---|---|---|
| Hand-crafted features | Learned features | Contextualized embeddings |
| No deep learning | Sequence model | Pretrained transformer |
| Fast inference | Medium | Slow (but fine-tuning helps) |
| Works on small data | Needs more data | Transfer learning |
BiLSTM-CRF: - BiLSTM: contextual representations - CRF layer: learns transition constraints (I-PER after B-PER, not I-ORG)
BERT for NER: - Fine-tune BERT + linear classifier - Subword tokenization → use first subword for entity - SOTA on CoNLL-2003 (93+ F1)
Q: Как обрабатывать nested entities?¶
A:
Problem: "University of California" — ORG, но "California" внутри — LOC.
Approaches: 1. Flat NER: Игнорировать вложенность (стандартный подход) 2. Layered NER: Два прохода — сначала outer, потом inner entities 3. Hypergraph decoding: Joint prediction всех уровней 4. Seq2seq: Generate entity spans with markers
Практика: Большинство систем — flat NER, nested — отдельная post-processing или специализированные модели.
Q: POS Tagging — основные подходы?¶
A:
Part-of-Speech Tagging — присвоение грамматических категорий (NOUN, VERB, ADJ, etc.) словам.
Approaches:
-
HMM (Hidden Markov Model): $\(P(t_1^n, w_1^n) = \prod_{i=1}^{n} P(w_i | t_i) P(t_i | t_{i-1})\)$
-
Emission: \(P(w|t)\) — word given tag
- Transition: \(P(t_i | t_{i-1})\) — tag bigram
-
Viterbi decoding
-
CRF (Conditional Random Field): $\(P(t|w) = \frac{1}{Z(w)} \exp(\sum_i \theta \cdot f(t_{i-1}, t_i, w, i))\)$
-
Features: word, suffix, prefix, neighboring tags
-
Global normalization
-
BiLSTM / BiLSTM-CRF:
-
Learned features, no manual feature engineering
-
BERT fine-tuning:
- Contextual representations
- +97% accuracy на Penn Treebank
Practical: BERT для high accuracy, HMM/CRF для скорости и interpretability.
Hyperparameter Optimization¶
Q: Parameters vs Hyperparameters — разница?¶
A:
| Parameters | Hyperparameters |
|---|---|
| Learned from data during training | Set before training |
| Internal to model (weights, biases) | Control learning process |
| Optimized by optimizer (SGD, Adam) | Set by practitioner or search |
| Examples: weights in NN, coefficients in regression | Examples: learning rate, batch size, num layers |
Hyperparameters determine HOW model learns, parameters are WHAT model learns.
Q: Grid Search vs Random Search?¶
A:
Grid Search: Exhaustive search over all combinations.
from sklearn.model_selection import GridSearchCV
param_grid = {
'C': [0.1, 1, 10],
'kernel': ['rbf', 'linear'],
'gamma': ['scale', 'auto']
}
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
Random Search: Sample random combinations.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform
param_dist = {
'C': loguniform(1e-3, 1e3),
'kernel': ['rbf', 'linear'],
'gamma': ['scale', 'auto']
}
random_search = RandomizedSearchCV(SVC(), param_dist, n_iter=50, cv=5)
When Random > Grid: - Most hyperparameters don't matter much (only a few are important) - Random search explores more values for important params - Paper: "Random Search for Hyper-Parameter Optimization" (Bergstra & Bengio, 2012)
Q: Что такое Bayesian Optimization?¶
A:
Idea: Build probabilistic model of objective function, use it to guide search.
Components: 1. Surrogate model: Gaussian Process (GP) approximates f(x) 2. Acquisition function: Decides where to sample next (balance exploration vs exploitation)
Acquisition functions: - Expected Improvement (EI): \(EI(x) = \mathbb{E}[\max(f(x) - f^*, 0)]\) - Upper Confidence Bound (UCB): \(UCB(x) = \mu(x) + \beta \sigma(x)\) - Probability of Improvement (PI): \(PI(x) = P(f(x) > f^*)\)
Optuna example:
import optuna
def objective(trial):
lr = trial.suggest_float('lr', 1e-5, 1e-1, log=True)
batch_size = trial.suggest_categorical('batch_size', [32, 64, 128])
layers = trial.suggest_int('layers', 1, 5)
model = build_model(lr, batch_size, layers)
score = train_and_evaluate(model)
return score
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
When Bayesian > Random: - Expensive evaluations (training takes hours) - Low-dimensional search space (<20 params) - Smooth objective function
Q: Optuna vs Ray Tune — когда что?¶
A:
| Aspect | Optuna | Ray Tune |
|---|---|---|
| Focus | Single-node optimization | Distributed at scale |
| Sampling | TPE, CMA-ES, GP | Same + population-based |
| Distributed | Via RDB/Redis | Native Ray cluster |
| Early stopping | Pruning (Median, Async) | PBT, ASHA, Hyperband |
| Integration | Sklearn, PyTorch, TF | PyTorch, TF, XGBoost |
| Ease of use | Simpler API | More complex |
Use Optuna when: - Single machine or small cluster - Need sophisticated sampling (TPE, CMA-ES) - Simpler setup
Use Ray Tune when: - Large-scale distributed training - Population-Based Training (PBT) - Already using Ray ecosystem
Q: Что такое Early Stopping в HPO?¶
A:
Problem: Many trials are bad — stop them early to save compute.
Approaches:
1. Median Pruning (Optuna):
pruner = optuna.pruners.MedianPruner(n_startup_trials=5, n_warmup_steps=10)
study = optuna.create_study(pruner=pruner)
Mechanism: At step k, if trial's intermediate value < median of previous trials, prune.
2. ASHA (Async Successive Halving): - Run many trials with minimal resources - Promote top performers to more resources - Early stop underperformers
3. Hyperband: - Multiple brackets of ASHA with different resource allocations - Better theoretical guarantees
Q: Как приоритизировать гиперпараметры для тюнинга?¶
A:
Priority order (higher = tune first):
- Learning rate — biggest impact on convergence
- Batch size — affects generalization and speed
- Optimizer — Adam vs SGD with momentum
- Architecture — layers, units per layer
- Regularization — dropout, weight decay
- Data augmentation — for vision
Coarse-to-fine strategy:
# Stage 1: Coarse search
lr_range = [1e-4, 1e-3, 1e-2, 1e-1] # Log scale
# Stage 2: Fine search around best
best_lr = 1e-3
lr_range = [5e-4, 1e-3, 2e-3, 5e-3]
Q: Nested Cross-Validation для HPO — зачем?¶
A:
Problem: Using same CV split for HPO and evaluation → overfitting to validation set.
Solution: Nested CV — inner loop for HPO, outer loop for evaluation.
Outer CV (k=5 folds):
For each fold:
Inner CV (k=5 folds):
GridSearchCV on training portion
Evaluate best model on outer test fold
from sklearn.model_selection import cross_val_score, GridSearchCV
# Inner: HPO
inner_cv = KFold(n_splits=5)
clf = GridSearchCV(estimator=model, param_grid=param_grid, cv=inner_cv)
# Outer: Evaluation
outer_cv = KFold(n_splits=5)
nested_score = cross_val_score(clf, X, y, cv=outer_cv)
Trade-off: 5×5 = 25 model fits per HPO candidate → expensive but unbiased.
Q: Sensitivity Analysis для гиперпараметров?¶
A:
Goal: Understand which hyperparameters matter most.
Methods:
1. One-at-a-time (OAT): Vary one param, fix others. - Simple but misses interactions
2. Morris Method: Measure elementary effects. $\(EE_i = \frac{f(x_1, ..., x_i + \Delta, ..., x_k) - f(x)}{\Delta}\)$
3. Sobol Indices: Variance-based decomposition. - \(S_i\) = first-order (main effect) - \(S_{Ti}\) = total effect (including interactions)
4. fANOVA (for Optuna):
import optuna
from optuna.importance import FanovaImportanceEvaluator
study = optuna.create_study()
study.optimize(objective, n_trials=100)
importance = optuna.importance.get_param_importances(
study, evaluator=FanovaImportanceEvaluator()
)
Output: {'lr': 0.45, 'batch_size': 0.30, 'layers': 0.15, 'dropout': 0.10}
Q: Multi-objective HPO — как балансировать accuracy и latency?¶
A:
Problem: Maximize accuracy, minimize latency — conflicting objectives.
Approaches:
1. Scalarization: $\(L = \alpha \cdot (1 - accuracy) + (1 - \alpha) \cdot \frac{latency}{max\_latency}\)$
2. Pareto Front: Find set of solutions where no objective can improve without worsening another.
Optuna multi-objective:
def objective(trial):
lr = trial.suggest_float('lr', 1e-4, 1e-1, log=True)
accuracy, latency = train_and_profile(model)
return accuracy, latency # Maximize both
study = optuna.create_study(directions=['maximize', 'minimize'])
study.optimize(objective, n_trials=100)
# Get Pareto front
pareto_trials = study.best_trials
Decision: Choose from Pareto front based on business constraints.
Active Learning¶
Q: Что такое Active Learning?¶
A:
Definition: ML paradigm where algorithm strategically selects most informative samples for labeling, reducing annotation cost.
Key insight: Not all samples equally valuable — some provide more information than others.
Active learning loop: 1. Start with small labeled set \(L\), large unlabeled pool \(U\) 2. Train model on \(L\) 3. Query oracle for labels of most informative samples from \(U\) 4. Add newly labeled samples to \(L\) 5. Repeat until budget exhausted or target accuracy reached
Goal: Achieve target accuracy with minimum labeling cost.
Q: Query strategies — Uncertainty Sampling?¶
A:
Core idea: Query samples where model is most uncertain.
Metrics:
1. Least Confidence: $\(x^* = \arg\max_x (1 - P(\hat{y}|x))\)$
Query samples with lowest max probability.
def least_confidence(probas):
# probas: (n_samples, n_classes)
max_proba = probas.max(axis=1)
return np.argmax(1 - max_proba) # Most uncertain
2. Margin Sampling: $\(x^* = \arg\min_x (P(\hat{y}_1|x) - P(\hat{y}_2|x))\)$
Query samples where top two classes are closest.
def margin_sampling(probas):
# Sort probabilities
sorted_probas = np.sort(probas, axis=1)[:, ::-1]
margins = sorted_probas[:, 0] - sorted_probas[:, 1]
return np.argmin(margins) # Smallest margin
3. Entropy: $\(x^* = \arg\max_x \left(-\sum_c P(y_c|x) \log P(y_c|x)\right)\)$
Query samples with highest prediction entropy.
def entropy_sampling(probas):
# Entropy: -sum(p * log(p))
eps = 1e-10
entropy = -np.sum(probas * np.log(probas + eps), axis=1)
return np.argmax(entropy) # Highest entropy
Comparison: | Strategy | Best for | Limitation | |----------|----------|------------| | Least Confidence | Binary classification | Ignores class distribution | | Margin | Multi-class | Only considers top 2 | | Entropy | Multi-class | Computationally heavier |
Q: Query-by-Committee (QBC)?¶
A:
Idea: Train multiple models (committee), query samples with highest disagreement.
Disagreement measures:
1. Vote Entropy: $\(x^* = \arg\max_x \left(-\sum_c \frac{V_c}{C} \log \frac{V_c}{C}\right)\)$
Where \(V_c\) = votes for class \(c\), \(C\) = committee size.
2. Kullback-Leibler Divergence: $\(x^* = \arg\max_x \frac{1}{C} \sum_{c=1}^{C} D_{KL}(P(y|x;\theta_c) \| P(y|x))\)$
Implementation:
class QueryByCommittee:
def __init__(self, n_models=5):
self.models = [create_model() for _ in range(n_models)]
def fit(self, X, y):
for model in self.models:
# Bootstrap sample for diversity
idx = np.random.choice(len(X), len(X), replace=True)
model.fit(X[idx], y[idx])
def query(self, X_pool, n_samples=1):
# Collect predictions
predictions = np.array([
model.predict_proba(X_pool) for model in self.models
]) # (n_models, n_samples, n_classes)
# Vote entropy
votes = np.argmax(predictions, axis=2) # (n_models, n_samples)
vote_counts = np.apply_along_axis(
lambda x: np.bincount(x, minlength=predictions.shape[2]),
axis=0, arr=votes
) # (n_classes, n_samples)
vote_probas = vote_counts / len(self.models)
entropy = -np.sum(vote_probas * np.log(vote_probas + 1e-10), axis=0)
return np.argsort(entropy)[-n_samples:]
Q: Expected Model Change?¶
A:
Idea: Query samples that would cause largest change in model if labeled.
Expected Gradient Length (EGL): $\(x^* = \arg\max_x \mathbb{E}_{y \sim P(y|x)} \|\nabla L(x, y)\|\)$
Intuition: If gradient would be large regardless of label, sample is informative.
def expected_gradient_length(model, x, possible_labels):
total_grad_norm = 0
for y in possible_labels:
# Compute loss gradient for this label
loss = compute_loss(model, x, y)
grads = torch.autograd.grad(loss, model.parameters())
grad_norm = torch.sqrt(sum(g.pow(2).sum() for g in grads))
# Weight by probability of this label
prob = model.predict_proba(x)[y]
total_grad_norm += prob * grad_norm
return total_grad_norm
Pros: Theoretically motivated, considers impact on model Cons: Computationally expensive (requires gradients for each candidate)
Q: Diversity-based sampling?¶
A:
Problem: Uncertainty sampling may select redundant samples.
Solution: Balance uncertainty with diversity.
Core-set selection: $\(\min_{S \subseteq U} \max_{x \in U} \min_{s \in S} d(x, s)\)$
Find subset \(S\) that covers unlabeled pool well.
Coreset via k-Center:
def k_center_selection(X_pool, n_samples, already_labeled=None):
"""Greedy k-center for diverse selection."""
selected = []
if already_labeled is not None:
# Start with distances to already labeled
dist_matrix = cdist(X_pool, already_labeled)
min_distances = dist_matrix.min(axis=1)
else:
min_distances = np.full(len(X_pool), np.inf)
# Start with random point
selected.append(np.random.randint(len(X_pool)))
min_distances[selected[0]] = 0
while len(selected) < n_samples:
# Find point furthest from any selected
next_idx = np.argmax(min_distances)
selected.append(next_idx)
# Update distances
new_dists = cdist(X_pool, X_pool[selected[-1:]])
min_distances = np.minimum(min_distances, new_dists.flatten())
return selected
BADGE (Batch Active Learning by Diverse Gradient Embeddings): - Combine uncertainty + diversity - Embed samples using gradient embeddings - k-means++ selection in embedding space
Q: Когда Active Learning НЕ эффективен?¶
A:
Failure cases:
- Very small initial labeled set:
- Model too weak to identify informative samples
-
Random sampling may be better initially
-
Highly imbalanced data:
- May oversample minority class unnecessarily
-
Or ignore rare but important samples
-
Clustered data structure:
- May miss entire clusters if initial samples don't cover them
-
Solution: Combine with diversity sampling
-
Noisy labels:
- Querying uncertain samples may amplify noise
-
Solution: Label smoothing, robust loss
-
Budget too small:
- Active learning overhead > benefit
- Random sampling competitive for <100 samples
Rule of thumb: Active learning shines when: - Labeling cost >> computation cost - 100+ queries budget - Model has reasonable base accuracy (>50%)
Q: Active Learning в production — best practices?¶
A:
Implementation checklist:
-
Start with diversity:
-
Combine strategies:
# 70% uncertainty + 30% diversity n_uncertain = int(0.7 * batch_size) n_diverse = batch_size - n_uncertain uncertain = uncertainty_sampling(model, X_pool, n_uncertain) remaining_pool = np.setdiff1d(np.arange(len(X_pool)), uncertain) diverse = diversity_sampling(X_pool[remaining_pool], n_diverse) return np.concatenate([uncertain, diverse]) -
Cold start handling:
- First 50-100 samples: random or stratified
-
After model shows promise: switch to active learning
-
Human-in-the-loop:
- Show model confidence to annotator
- Allow annotator to flag "don't know" or "bad sample"
-
Track annotator agreement
-
Stopping criteria:
- Model accuracy plateaus
- Budget exhausted
- Remaining samples all low uncertainty
Tools: - Modal: Active learning platform - Label Studio: Annotation with active learning plugin - SuperAnnotate: Computer vision active learning - Prodigy: NLP active learning
Time Series: Deep Learning Methods¶
Q: DeepAR — как работает?¶
A:
Architecture: Autoregressive RNN с probabilistic output.
Key features: 1. Global model: Learns from multiple related time series 2. Autoregressive: Uses past values as input 3. Probabilistic: Outputs distribution (Gaussian with mean + std) 4. Covariates: Can include time-dependent and static features
Training: $\(p(y_{t:T} | y_{1:t}, x_{1:T}) = \prod_{t'=t}^{T} p(y_{t'} | y_{1:t'-1}, x_{1:T}, \theta)\)$
Inference: Sample from predicted distribution → prediction intervals.
# DeepAR prediction (conceptual)
def predict_deepar(model, context, num_samples=100):
samples = []
for _ in range(num_samples):
# Sample from predicted distribution at each step
pred_dist = model(context) # Gaussian(mean, std)
sample = pred_dist.sample()
samples.append(sample)
return {
'mean': np.mean(samples, axis=0),
'std': np.std(samples, axis=0),
'quantiles': np.quantile(samples, [0.1, 0.5, 0.9], axis=0)
}
Advantages over ARIMA: - Handles multiple related series (learns globally) - Works with covariates - Produces probabilistic forecasts - Can handle cold start with item features
Q: Temporal Fusion Transformer (TFT)?¶
A:
Architecture: 1. Variable Selection Network: Learns which features are important 2. Static Covariate Encoder: Processes time-invariant features 3. Gated Residual Network (GRN): Non-linear processing with skip connections 4. Multi-head Attention: Learns temporal dependencies + interpretability 5. Quantile Regression: Predicts multiple quantiles for intervals
Three input types: - Static: Product category, store location - Known future: Holidays, promotions (available at prediction time) - Historical: Past sales, weather (only available from past)
Key innovation — Interpretability: - Variable importance: Which features matter - Attention weights: Which past time steps matter - Seasonal patterns: Via attention visualization
# TFT attention interpretation
attention_weights = model.get_attention_weights(x) # (batch, heads, seq_len)
# Identify which past steps influence predictions
important_steps = attention_weights.mean(dim=(0, 1)).argsort(descending=True)[:5]
When to use TFT: - Multiple known future covariates - Need interpretability - Complex temporal patterns - Long-range dependencies
Q: Prophet vs ARIMA vs Deep Learning?¶
A:
| Method | Strengths | Weaknesses | Best For |
|---|---|---|---|
| ARIMA | Interpretable, well-understood | Single series, manual tuning | Clean univariate |
| Prophet | Multiple seasonalities, holidays | Less accurate, no covariates | Business forecasting |
| DeepAR | Global learning, covariates | Needs many series | Related series |
| TFT | Interpretability, all covariates | Complex, needs data | Complex systems |
| N-BEATS | Pure DL, no features | Black box | Pure DL forecasting |
Prophet model: $\(y(t) = g(t) + s(t) + h(t) + \varepsilon_t\)$
Where: - \(g(t)\) = trend (piecewise linear or logistic) - \(s(t)\) = seasonality (Fourier series) - \(h(t)\) = holiday effects
from prophet import Prophet
model = Prophet(
yearly_seasonality=True,
weekly_seasonality=True,
daily_seasonality=False
)
model.add_country_holidays(country_name='US')
model.fit(df) # df with 'ds' (date) and 'y' (value) columns
forecast = model.predict(future_df)
Q: Time Series Cross-Validation?¶
A:
Critical: Never use random split — temporal order must be preserved!
Rolling origin (expanding window):
Fold 1: Train [0:100], Test [100:120]
Fold 2: Train [0:120], Test [120:140]
Fold 3: Train [0:140], Test [140:160]
Sliding window:
Fold 1: Train [0:100], Test [100:120]
Fold 2: Train [20:120], Test [120:140]
Fold 3: Train [40:140], Test [140:160]
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Train and evaluate
Metrics: - MAPE: Mean Absolute Percentage Error = \(\frac{100\%}{n}\sum|\frac{y_i - \hat{y}_i}{y_i}|\) - MASE: Mean Absolute Scaled Error = \(\frac{MAE}{MAE_{naive}}\) - RMSE: Root Mean Squared Error - WMAPE: Weighted MAPE = \(\frac{\sum|y_i - \hat{y}_i|}{\sum|y_i|}\)
Q: N-BEATS architecture?¶
A:
Key idea: Stack of fully-connected blocks with forward and backward residuals.
Architecture: 1. Each block has two outputs: - Forward: Forecast (contribution to final prediction) - Backward: Backcast (explains input, removed for next block)
- Two configurations:
- Generic: Learns any pattern
- Interpretable: Separate trend + seasonality blocks
Formula: $\(\hat{y} = \sum_{b=1}^{B} \hat{y}_b, \quad x_{b+1} = x_b - \hat{x}_b\)$
Where \(\hat{y}_b\) = forecast from block \(b\), \(\hat{x}_b\) = backcast from block \(b\).
Advantages: - Pure deep learning (no feature engineering) - Interpretable mode separates trend/seasonality - Competitive with M4 competition winner (Smyl's ES-RNN); outperformed other neural methods on M4 benchmarks
Explainable AI (XAI): SHAP & LIME¶
Q: Зачем нужен XAI в production?¶
A: 4 ключевые причины:
- Regulatory Compliance: EU AI Act (2024), GDPR right to explanation — high-risk AI системы обязаны объяснять решения
- Trust Building: 78% enterprise AI rejected due to lack of interpretability (2025)
- Debugging: XAI помогает найти почему модель ошибается
- Bias Detection: Выявление unfair patterns в predictions
Q: SHAP vs LIME — в чём разница?¶
A:
| Критерий | SHAP | LIME |
|---|---|---|
| Теория | Game theory (Shapley values) | Local surrogate models |
| Гарантии | Consistency, Additivity, Efficiency | Local fidelity only |
| Скорость | TreeSHAP: ~65ms, KernelSHAP: ~450ms | ~85ms (tabular) |
| Stability | 95% | 82% |
| Memory | TreeSHAP: 78MB, KernelSHAP: 680MB | 92MB |
| Model-specific | TreeSHAP, DeepSHAP, LinearSHAP | Model-agnostic |
SHAP formula: $\(\phi_i(f, x) = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!(|N|-|S|-1)!}{|N|!} [f(S \cup \{i\}) - f(S)]\)$
LIME formula: $\(\xi(x) = \arg\min_{g \in G} \mathcal{L}(f, g, \pi_x) + \Omega(g)\)$
Где \(\pi_x(z) = \exp(-D(x, z)^2 / \sigma^2)\) — kernel weighting.
Q: Когда использовать SHAP, а когда LIME?¶
A:
Выбирай SHAP когда: - Tree-based модели (Random Forest, XGBoost, LightGBM) — TreeSHAP exact + fast - Нужны global explanations (summary plots, dependence plots) - Regulated industries (finance, healthcare) — theoretical rigor важен - Comparing feature importance across instances
Выбирай LIME когда: - Novel architectures без SHAP implementation - Нужно quick explanation для single prediction - Stakeholders non-technical — local linear понятнее - Ограниченные compute resources
Best practice: Hybrid approach — SHAP для production monitoring, LIME для ad-hoc investigations.
Q: Как SHAP обеспечивает consistency?¶
A: 4 математических свойства (axioms):
- Efficiency: \(\sum_{i=1}^{M} \phi_i = f(x) - E[f(X)]\) — сумма SHAP values = deviation from baseline
- Symmetry: Если features вносят одинаковый вклад во все coalitions → равные SHAP values
- Dummy: Features которые не влияют на prediction → SHAP = 0
- Additivity: Для ensemble: SHAP_total = SHAP_model1 + SHAP_model2
Эти гарантии делают SHAP единственным методом удовлетворяющим всем desiderata одновременно.
Q: LIME unstability — как решить?¶
A: Problem: Small input changes → significantly different explanations (18% variance).
Solutions:
-
Multiple runs + average:
-
Increase num_samples: Default 5000, increase to 15000+ for stability
- Cross-validation on explanations: Run LIME multiple times, check variance
- Use SHAP instead: 95% stability vs 82% for LIME
Q: Как интерпретировать SHAP values?¶
A:
For single prediction: - \(\phi_i > 0\) → feature i pushes prediction UP - \(\phi_i < 0\) → feature i pushes prediction DOWN - \(|\phi_i|\) = magnitude of contribution
Example (Credit Approval):
Income: +0.35 (pushes toward approval)
CreditScore: +0.30 (pushes toward approval)
Debt: -0.22 (pushes toward rejection)
Age: +0.08 (small positive)
Base value: 0.50 (average approval rate)
Final: 0.50 + 0.35 + 0.30 - 0.22 + 0.08 = 1.01 → APPROVED
Global interpretation: - Summary plot: Feature importance ranking across all instances - Dependence plot: How feature value affects SHAP value - Interaction plot: Feature interactions
Q: SHAP для deep learning — какие подходы?¶
A:
- DeepSHAP: Combines SHAP with DeepLIFT backpropagation
- Fast for neural networks
-
Uses gradient * input decomposition
-
GradientSHAP: Integrates gradients with SHAP
- Works for any differentiable model
-
More expensive but theoretically sound
-
PartitionSHAP: For hierarchical models (Transformers)
- Handles attention layers properly
import shap
# DeepSHAP for PyTorch
explainer = shap.DeepExplainer(model, background_data)
shap_values = explainer.shap_values(test_data)
# GradientSHAP
explainer = shap.GradientExplainer(model, background_data)
shap_values = explainer.shap_values(test_data)
Q: Production XAI pipeline — как построить?¶
A:
Architecture:
Client Request → API Gateway → XAI Engine → Explanation Cache → Response
↓
Model Registry
↓
Monitoring Service
Key components:
- Precomputation: Cache common explanations during training
- Adaptive sampling: Early stopping when explanation stabilizes
- Redis cache: Store precomputed SHAP values
- Fallback: LIME for cold-start, SHAP for cached
Performance optimizations: - Caching: Reduce latency from 2.1s → 120ms - Batch explanations: Compute SHAP for multiple instances together - TreeSHAP for tree models: 10x faster than KernelSHAP
Q: Common failure modes XAI?¶
A:
- Correlated features: SHAP/LIME underestimate importance when features are highly correlated
-
Solution: Group correlated features, use conditional expectations
-
Out-of-distribution: Explanations unreliable for OOD samples
- Error can exceed 40% for far-from-training instances
-
Solution: Flag OOD samples, don't trust explanations blindly
-
Feature interactions: Linear explanations miss non-linear interactions
-
Solution: Use SHAP interaction values (expensive: O(n²))
-
Baseline dependency: Results sensitive to background dataset
- Solution: Use representative background, document choice
Neural Architecture Search (NAS)¶
Q: Что такое NAS и зачем он нужен?¶
A: NAS — автоматический поиск оптимальной архитектуры нейросети.
3 компонента: 1. Search Space: Какие операции/связи допустимы (conv, pooling, attention) 2. Search Strategy: Как исследовать пространство (RL, EA, gradient-based) 3. Performance Estimation: Как быстро оценить candidate (proxy tasks, weight sharing)
Зачем: - Ручной дизайн требует 120,000+ GPU hours/month (Tesla, 2025) - NAS находит architectures которые люди не придумают - Hardware-aware NAS оптимизирует под конкретное устройство
Q: Какие search strategies в NAS?¶
A:
| Strategy | Как работает | Pros | Cons |
|---|---|---|---|
| RL | RNN controller генерирует architectures, reward = accuracy | Осваивает сложные spaces | 1800 GPU-days (NASNet) |
| Evolutionary | Population, mutation, crossover, selection | Простой, parallelizable | Expensive evaluation |
| DARTS | Continuous relaxation, gradient descent on architecture params | 1 GPU-day | Discretization gap |
| Bayesian | Gaussian process models performance | Sample-efficient | Struggles with high-dim |
| Random | Uniform sampling | Baseline, simple | Slow for large spaces |
| One-Shot | Train supernet once, sample subnets | Fast evaluation | Weight sharing bias |
DARTS key insight: $\(\bar{o}(x) = \sum_{o \in \mathcal{O}} \frac{\exp(\alpha_o)}{\sum_{o'} \exp(\alpha_{o'})} o(x)\)$
Где \(\alpha\) — learnable architecture parameters. После training: argmax \(\alpha\) → discrete architecture.
Q: Что такое cell-based search space?¶
A: Instead of searching whole network, search small reusable cell.
NASNet cells: - Normal cell: Same spatial resolution - Reduction cell: Halves resolution (stride-2)
Cell = DAG: - Nodes = operations (3x3 conv, 5x5 conv, pooling) - Edges = connections - Stacked N times → full network
Advantages: - Transferable (CIFAR → ImageNet) - Smaller search space - Faster search
Limitations: - Low variance among found architectures - Constrained expressiveness
Q: Hardware-Aware NAS — как работает?¶
A: Incorporate hardware constraints into search.
Metrics added to objective: - Latency: \(\mathcal{L} = \text{Accuracy} - \lambda \cdot \text{Latency}\) - Memory: Peak memory usage - Energy: FLOPs × power per operation
Approaches:
- ProxylessNAS: Learn to prune paths, measure on target device
- MnasNet: Multi-objective optimization (accuracy + latency)
- Once-for-All: Train supernet, specialize for different devices
Example (Mobile optimization):
Q: One-Shot NAS — в чём идея?¶
A: Train one supernet containing all architectures, evaluate by sampling.
Once-for-All Network (OFA): 1. Train supernet supporting all configurations 2. At inference: sample subnet with desired constraints 3. No retraining needed
Weight sharing benefits: - 10,000x faster than training from scratch - Single training → multiple deployment targets
Challenge: Weight sharing bias — shared weights may not reflect standalone performance.
Solutions: - Progressive shrinking (OFA): Train large, gradually add smaller configs - Sandwich rule: Train min, max, random each step
Q: Когда NAS НЕ стоит использовать?¶
A:
Не используйте NAS когда: 1. Small scale (<7B params): Overhead не окупается 2. Single-domain task: Нет benefit от specialization 3. Latency-critical: Search overhead too high 4. Limited compute: Search может занять недели 5. Strong baseline exists: ResNet/EfficientNet достаточно
Rule of thumb: NAS оправдан когда: - Unique hardware constraints (edge, mobile) - Novel task без established architectures - Budget ≥ 100 GPU-days for search - Expected significant efficiency gains
Q: EfficientNet — как NAS помог?¶
A: EfficientNet = NAS + Compound Scaling.
Шаг 1: NAS (baseline) - Found EfficientNet-B0 via MnasNet - Optimized for accuracy + latency
Шаг 2: Compound Scaling - Scale all dimensions together: $\(\text{depth} = \alpha \cdot \phi\)$
$\(\text{resolution} = \gamma \cdot \phi\)$
- Constraint: \(\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2\)
- \(\phi\) = user-specified coefficient (B0→B7)
Result: 8.4x smaller + 6.1x faster than GPipe while achieving similar accuracy.
Cost-Sensitive Learning¶
Источники: CodeGenes: Cost-Sensitive Learning in PyTorch (2025), LinkedIn: Class Weights & Cost-Sensitive Learning (2025), Elkan (2001)
Q: Что такое Cost-Sensitive Learning?¶
A:
Definition: ML technique where different misclassification errors have different costs.
Example: Medical diagnosis - FN (sick → healthy): Missing cancer = very costly - FP (healthy → sick): Unnecessary tests = less costly
Cost matrix: $\(C = \begin{bmatrix} 0 & c_{01} \\ c_{10} & 0 \end{bmatrix}\)$
Where \(C_{ij}\) = cost of predicting class \(j\) when true class is \(i\).
Q: Как реализовать cost-sensitive learning в PyTorch?¶
A:
Method 1: Weighted Cross-Entropy
import torch.nn as nn
# Define class weights (higher for minority/costly class)
class_weights = torch.tensor([1.0, 10.0]) # [class 0, class 1]
criterion = nn.CrossEntropyLoss(weight=class_weights)
loss = criterion(predictions, targets)
Method 2: Custom Cost-Sensitive Loss
def cost_sensitive_loss(predictions, targets, cost_matrix):
"""
predictions: (batch, n_classes) logits
targets: (batch,) class indices
cost_matrix: (n_classes, n_classes)
"""
n_classes = cost_matrix.shape[0]
one_hot = torch.eye(n_classes)[targets] # (batch, n_classes)
# Get cost for each sample's true class
costs = one_hot @ cost_matrix # (batch, n_classes)
# Weighted log-likelihood
log_probs = F.log_softmax(predictions, dim=1)
loss = -torch.sum(costs * log_probs, dim=1).mean()
return loss
# Example: FN costs 10x more than FP
cost_matrix = torch.tensor([
[0, 1], # True 0: cost of predicting 0, 1
[10, 0] # True 1: cost of predicting 0, 1
], dtype=torch.float32)
Method 3: Sample-wise weights
# Different weight for each sample
sample_weights = torch.tensor([1, 5, 1, 10, ...])
# Compute per-sample loss
losses = F.cross_entropy(predictions, targets, reduction='none')
weighted_loss = (losses * sample_weights).mean()
Q: Когда использовать cost-sensitive learning?¶
A:
| Scenario | Approach | Cost Matrix Example |
|---|---|---|
| Medical diagnosis | High FN cost | FN=10, FP=1 |
| Fraud detection | High FN cost | FN=100, FP=1 |
| Spam filter | High FP cost | FN=1, FP=10 |
| Loan approval | Asymmetric | Default=50, Rejection=1 |
Rule of thumb: - Set cost ratio = inverse of acceptable error ratio - If FN is 10x worse than FP → weight(class_1) = 10 * weight(class_0)
Q: Cost-sensitive vs class imbalance — в чём разница?¶
A:
| Aspect | Class Imbalance | Cost-Sensitive |
|---|---|---|
| Focus | Sample frequency | Error cost |
| Solution | Resampling, class weights | Cost matrix, threshold adjustment |
| When to use | Minority class underrepresented | Errors have different costs |
They're related but not identical: - Class imbalance: 99% negative, 1% positive - Cost-sensitive: Missing a positive costs 100x more
Combined approach:
# Weighted loss for imbalanced + cost-sensitive
class_weights = compute_class_weight('balanced', classes, y_train)
# Adjust for costs
class_weights[1] *= 10 # Further upweight positive class
criterion = nn.CrossEntropyLoss(weight=torch.tensor(class_weights))
Q: Как оценить cost-sensitive model?¶
A:
1. Cost-weighted accuracy:
def cost_weighted_accuracy(y_true, y_pred, cost_matrix):
total_cost = 0
for t, p in zip(y_true, y_pred):
total_cost += cost_matrix[t, p]
return total_cost / len(y_true)
2. Expected cost: $\(\text{Expected Cost} = \sum_{i,j} C_{ij} \cdot P(\text{predict } j | \text{true } i) \cdot P(\text{true } i)\)$
3. Cost curves: Plot cost vs threshold for different operating points
4. Business metrics: Connect to actual business KPIs - Fraud: $ caught vs $ lost - Medical: Lives saved vs unnecessary procedures
16. Missing Data Handling¶
Basic¶
Q: Какие типы missing data существуют?
A: Rubin's Classification (1976):
Type Full Name Definition Example Strategy MCAR Missing Completely At Random P(missing) independent of all variables Data entry error, random sensor failure Deletion OK MAR Missing At Random P(missing) depends on observed data Men less likely to report depression Imputation OK MNAR Missing Not At Random P(missing) depends on missing value itself High earners don't report salary Model missingness Important: MCAR is the only case where deletion is unbiased. MAR/MNAR require imputation.
Q: Когда drop vs impute missing values?
A:
Drop (listwise deletion) when: - MCAR mechanism confirmed - < 5% missing per column - Large dataset, small impact
Impute when: - MAR or MNAR mechanism - > 5% missing per column - Small dataset - Missingness is informative
Code check:
Medium¶
Q: Какие методы imputation существуют?
A:
Method Description Best For Bias Risk Mean/Median Replace with central tendency Numerical, MCAR Underestimates variance Mode Most frequent value Categorical Same as mean Forward/Backward fill Use adjacent values Time series Temporal leakage KNN Imputer k-nearest neighbors Numerical patterns Computationally expensive MICE Multiple Imputation by Chained Equations Any Gold standard for MAR Iterative Model-based (Bayesian Ridge) Complex patterns Assumes MAR MICE implementation:
Q: Что такое Multiple Imputation и зачем нужна?
A: Single imputation problem: Imputed values are treated as certain → underestimates variance.
Multiple Imputation (MI) solution: 1. Create m datasets with different imputed values 2. Analyze each dataset separately 3. Pool results using Rubin's Rules
Rubin's Rules for pooling: $\(\bar{Q} = \frac{1}{m}\sum_{i=1}^{m} \hat{Q}_i\)$
(pooled estimate)
\[\bar{U} = \frac{1}{m}\sum_{i=1}^{m} U_i\]
(within-imputation variance)
\[B = \frac{1}{m-1}\sum_{i=1}^{m}(\hat{Q}_i - \bar{Q})^2\]
(between-imputation variance)
\[T = \bar{U} + (1 + \frac{1}{m})B\]
(total variance)
When to use: MAR mechanism, research/analysis context, need valid confidence intervals.
Q: Как обрабатывать missing values в categorical features?
A:
Strategies: 1. New category: "Unknown" or "Missing" — simplest, preserves missingness info 2. Mode imputation: Most frequent — can distort distribution 3. Model-based: Predict category from other features 4. Weight of Evidence (WoE): For binary classification, encode as WoE value
# Strategy 1: New category df['category'].fillna('Missing', inplace=True) # Strategy 3: Model-based (using other features) from sklearn.ensemble import RandomForestClassifier mask = df['category'].isna() if mask.sum() > 0: clf = RandomForestClassifier() clf.fit(df.loc[~mask, other_features], df.loc[~mask, 'category']) df.loc[mask, 'category'] = clf.predict(df.loc[mask, other_features])
Killer¶
Q: Спроектируйте missing data strategy для fraud detection pipeline.
A:
Analysis Phase:
# 1. Diagnose missingness mechanism def diagnose_missingness(df, target_col): """Check if missingness predicts target""" df['is_missing'] = df[target_col].isna().astype(int) from scipy.stats import chi2_contingency for col in df.select_dtypes(include='object').columns: contingency = pd.crosstab(df['is_missing'], df[col]) chi2, p, _, _ = chi2_contingency(contingency) if p < 0.05: print(f"Missingness of {target_col} related to {col}: p={p:.4f}")Pipeline Architecture:
Raw Data → Missing Flag Creation → Imputation Model → Feature Engineering → Model ↓ ↓ ↓ [is_X_missing=1] [Predicted value] [Original + Flag + Imputed]Implementation:
from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer, KNNImputer from sklearn.compose import ColumnTransformer # Create missing indicators for important features important_features = ['transaction_amount', 'user_age', 'device_score'] for col in important_features: df[f'{col}_missing'] = df[col].isna().astype(int) # Different strategies for different columns preprocessor = ColumnTransformer([ ('num_knn', KNNImputer(n_neighbors=5), numerical_cols), ('cat_mode', SimpleImputer(strategy='most_frequent'), categorical_cols), ('cat_new', SimpleImputer(strategy='constant', fill_value='Unknown'), high_missing_cols) ]) pipeline = Pipeline([ ('imputer', preprocessor), ('scaler', StandardScaler()), ('model', XGBClassifier()) ])Key decisions: - Flag missingness for high-value features (model can learn "missing = suspicious") - KNN for numerical with patterns - "Unknown" category for categorical with > 10% missing - Monitor: imputation quality, drift in missingness patterns
17. Model Debugging¶
Basic¶
Q: Что такое slice-based evaluation?
A: Slice-based evaluation — анализ model performance на подмножествах (slices) данных вместо одного aggregate metric.
Зачем: Aggregate metrics скрывают проблемы на underrepresented groups.
Slice types: - Demographic: gender, age, geography - Behavioral: new vs returning users, device type - Data-driven: high confidence vs low confidence, feature-based
# Slice-based evaluation def evaluate_slices(model, X, y, slice_cols): results = {} for col in slice_cols: for value in X[col].unique(): mask = X[col] == value if mask.sum() >= 50: # Minimum samples results[f"{col}={value}"] = { 'accuracy': accuracy_score(y[mask], model.predict(X[mask])), 'count': mask.sum() } return results
Q: Как проводить error analysis для ML модели?
A: Systematic Error Analysis Process:
- Collect errors: All misclassified samples
- Categorize: By error type (FP, FN), feature values, prediction confidence
- Pattern hunt: What do errors have in common?
- Hypothesis: Why is model making these errors?
- Fix: More data, new features, different model
Code:
# Error analysis errors = X_test[y_test != y_pred].copy() errors['true'] = y_test[y_test != y_pred] errors['pred'] = y_pred[y_test != y_pred] errors['confidence'] = y_proba[y_test != y_pred].max(axis=1) # Look for patterns for col in X_test.columns: print(f"\n{col} distribution in errors vs all:") print(errors[col].value_counts(normalize=True).head()) print(X_test[col].value_counts(normalize=True).head())
Medium¶
Q: Что такое data debugging и как его делать?
A: Data debugging — поиск проблем в данных, которые вызывают model issues.
Common data bugs: - Label noise: Incorrect labels in training data - Feature leakage: Target information in features - Distribution shift: Train/test different distributions - Outliers: Extreme values affecting model - Duplicates: Same samples causing overfitting
Debugging techniques:
# 1. Check label consistency from cleanlab.classification import CleanLearning clf = CleanLearning(clf=XGBClassifier()) clf.fit(X_train, y_train) label_issues = clf.get_label_issues() # Potentially mislabeled samples # 2. Check for leakage from sklearn.feature_selection import mutual_info_classif mi = mutual_info_classif(X_train, y_train) suspicious = [f for f, score in zip(features, mi) if score > 0.8] # Too predictive # 3. Distribution check from scipy.stats import ks_2samp for col in X_train.columns: stat, p = ks_2samp(X_train[col], X_test[col]) if p < 0.01: print(f"Distribution shift in {col}: p={p:.4f}")
Q: Как организовать regression testing для ML моделей?
A: ML Regression Testing — автоматическая проверка что новая модель не хуже старой на критических сценариях.
Test suite components: 1. Golden dataset: Curated examples representing key scenarios 2. Performance thresholds: Min acceptable metrics 3. Slice-specific checks: Must not degrade on important slices 4. Prediction stability: Similar inputs → similar outputs
class ModelRegressionTest: def __init__(self, baseline_model, golden_data, thresholds): self.baseline = baseline_model self.golden_X, self.golden_y = golden_data self.thresholds = thresholds # {'accuracy': 0.85, 'slice_degradation': 0.02} def test(self, new_model): # 1. Overall performance baseline_acc = accuracy_score(self.golden_y, self.baseline.predict(self.golden_X)) new_acc = accuracy_score(self.golden_y, new_model.predict(self.golden_X)) assert new_acc >= self.thresholds['accuracy'], f"Accuracy below threshold: {new_acc}" # 2. No significant regression assert new_acc >= baseline_acc - self.thresholds['slice_degradation'], \ f"Regression from baseline: {baseline_acc} → {new_acc}" # 3. Slice-specific checks for slice_name, mask in self.slices.items(): baseline_slice = accuracy_score(self.golden_y[mask], self.baseline.predict(self.golden_X[mask])) new_slice = accuracy_score(self.golden_y[mask], new_model.predict(self.golden_X[mask])) assert new_slice >= baseline_slice - 0.05, f"Regression on {slice_name}" return {"status": "PASSED", "baseline_acc": baseline_acc, "new_acc": new_acc}
Killer¶
Q: Спроектируйте model debugging workflow для production recommendation system.
A:
Architecture:
Production Logs → Error Collector → Pattern Analyzer → Alerting → Root Cause → Fix ↓ ↓ ↓ ↓ ↓ [predictions] [misclassifies] [slices] [on-call] [retrain] [features] [low confidence] [drifts] [features] [outcomes] [edge cases] [biases]Implementation:
class ModelDebugger: def __init__(self, model, feature_store): self.model = model self.fs = feature_store self.error_buffer = [] self.slice_metrics = defaultdict(list) def log_prediction(self, user_id, item_id, features, prediction, outcome=None): """Log every prediction for debugging.""" record = { 'timestamp': datetime.now(), 'user_id': user_id, 'item_id': item_id, 'features': features, 'prediction': prediction, 'confidence': prediction.max(), 'outcome': outcome # Filled later if available } self.error_buffer.append(record) def analyze_errors(self): """Periodic error analysis.""" # 1. Low confidence predictions low_conf = [r for r in self.error_buffer if r['confidence'] < 0.6] if len(low_conf) > 100: self.alert(f"High rate of low-confidence predictions: {len(low_conf)}") # 2. Slice-based analysis for slice_col in ['user_segment', 'item_category', 'device']: for slice_val in set(r['features'].get(slice_col) for r in self.error_buffer): slice_errors = [r for r in self.error_buffer if r['features'].get(slice_col) == slice_val and r.get('outcome') == 'error'] error_rate = len(slice_errors) / max(1, len([r for r in self.error_buffer if r['features'].get(slice_col) == slice_val])) if error_rate > 0.1: self.alert(f"High error rate on {slice_col}={slice_val}: {error_rate:.2%}") # 3. Feature drift recent_features = pd.DataFrame([r['features'] for r in self.error_buffer[-1000:]]) baseline_features = self.fs.get_historical_features() for col in recent_features.columns: drift = self._compute_psi(recent_features[col], baseline_features[col]) if drift > 0.2: self.alert(f"Feature drift detected in {col}: PSI={drift:.2f}") def _compute_psi(self, expected, actual, buckets=10): """Population Stability Index.""" def scale_range (input, min, max): input += -(np.min(input)) input /= np.max(input) / (max - min) input += min return input breakpoints = np.arange(0, buckets + 1) / buckets * 100 if len(actual.unique()) == 1: return 0 breakpoints = np.nanpercentile(actual, breakpoints) expected_percents = np.histogram(expected, breakpoints)[0] / len(expected) actual_percents = np.histogram(actual, breakpoints)[0] / len(actual) psi_value = np.sum((actual_percents - expected_percents) * np.log(actual_percents / expected_percents + 0.0001)) return psi_valueKey metrics to monitor: - Error rate by slice (user segment, item category) - Low confidence rate - Feature drift (PSI > 0.2) - Prediction distribution shift - Latency by model version
18. AutoML Theory¶
Basic¶
Q: Что такое AutoML и какие проблемы решает?
A: AutoML (Automated Machine Learning) автоматизирует полный ML pipeline: - Hyperparameter Optimization (HPO): Поиск оптимальных гиперпараметров - Neural Architecture Search (NAS): Автоматический поиск архитектуры - Feature Engineering: Автоматическое создание фичей - Model Selection: Выбор лучшего алгоритма - Ensembling: Автоматическое объединение моделей
Проблемы которые решает: - Эксперты тратят 60-80% времени на tuning - Человеческие ошибки и предвзятость - Непоследовательность между инженерами - Сложность для новичков
Medium¶
Q: Как работает Bayesian Optimization для HPO?
A: Bayesian Optimization — model-based подход к поиску гиперпараметров.
Формула оптимизации: $\(x^* = \text{argmax}_{x \in X} f(x)\)$
где \(x\) — конфигурация гиперпараметров, \(f(x)\) — performance metric.
Gaussian Process Prior: $\(f(x) \sim GP(\mu(x), k(x, x'))\)$
Acquisition Functions:
Expected Improvement (EI): $\(EI(x) = E[\max(f(x) - f(x^*), 0)] = \int_{-\infty}^{\infty} \max(f - f^*, 0) p(f|x) df\)$
Probability of Improvement (PI): $\(PI(x) = P(f(x) > f(x^*)) = \Phi\left(\frac{\mu(x) - f(x^*) - \xi}{\sigma(x)}\right)\)$
Upper Confidence Bound (UCB): $\(UCB(x) = \mu(x) + \beta \sigma(x)\)$
Python implementation:
import numpy as np from scipy.stats import norm from sklearn.gaussian_process import GaussianProcessRegressor class BayesianOptimizer: def __init__(self, param_bounds, n_initial=5): self.bounds = param_bounds # {'lr': (0.0001, 0.1), 'batch_size': (16, 256)} self.n_initial = n_initial self.X_observed = [] self.y_observed = [] self.gp = GaussianProcessRegressor() def expected_improvement(self, X, xi=0.01): mu, sigma = self.gp.predict(X, return_std=True) sigma = np.maximum(sigma, 1e-9) # avoid div by zero f_best = np.max(self.y_observed) with np.errstate(divide='warn'): imp = mu - f_best - xi Z = imp / sigma ei = imp * norm.cdf(Z) + sigma * norm.pdf(Z) ei[sigma == 0.0] = 0.0 return ei def suggest_next(self, n_candidates=1000): if len(self.X_observed) < self.n_initial: return self._random_sample() self.gp.fit(np.array(self.X_observed), np.array(self.y_observed)) candidates = self._generate_candidates(n_candidates) ei = self.expected_improvement(candidates) return candidates[np.argmax(ei)] def update(self, x, y): self.X_observed.append(x) self.y_observed.append(y)Сравнение методов HPO:
Method Efficiency Parallelizable Best For Grid Search 45% Yes (embarrassingly) Small param spaces Random Search 65% Yes Baseline, early exploration Bayesian (GP) 95% Limited (sequential) Expensive evaluations TPE 90% Limited High-dimensional spaces Multi-fidelity 95%+ Yes Large datasets, deep learning
Q: В чём разница между Grid Search, Random Search и Bayesian?
A:
Grid Search: - Перебирает все комбинации на сетке - Экспоненциальный рост: \(O(n^d)\) где \(d\) — число параметров - Неэффективен: многие комбинации бесполезны - Пример: 3 params × 10 values = 1000 trials
Random Search: - Случайная выборка из пространства - Лучшая эффективность при том же бюджете - Не учитывает предыдущие результаты - Формула: \(P(\text{top 5\%}) = 1 - (1 - 0.05)^n\)
Bayesian Optimization: - Строит surrogate model (GP) по результатам - Баланс exploration vs exploitation - Каждая новая точка информативна - Идеален для дорогих вычислений
Когда что использовать: - < 10 trials → Random Search - 10-100 trials → Bayesian (GP/TPE) - Cheap evaluation (seconds) → Grid/Random - Expensive (hours) → Bayesian + early stopping
Killer¶
Q: Спроектируйте AutoML систему для команды из 50 DS.
A:
Requirements: 50 DS, 1000+ experiments/week, diverse workloads (tabular, CV, NLP).
Architecture:
graph TD subgraph CTRL["AutoML Controller"] HPO["HPO Engine<br/>Optuna/TPE"] NAS["NAS Engine<br/>DARTS"] FE["Feature Engine"] ENS["Ensemble Engine"] HPO --> SCHED NAS --> SCHED FE --> SCHED ENS --> SCHED SCHED["Trial Scheduler (Ray Tune)<br/>Resource allocation, ASHA, 100+ parallel"] end SCHED --> REG["Model Registry (MLflow)"] style HPO fill:#e8eaf6,stroke:#3f51b5 style NAS fill:#e8eaf6,stroke:#3f51b5 style FE fill:#e8eaf6,stroke:#3f51b5 style ENS fill:#e8eaf6,stroke:#3f51b5 style SCHED fill:#e8f5e9,stroke:#4caf50 style REG fill:#fff3e0,stroke:#ef6c00Key components:
- HPO Engine: Optuna с TPE sampler для high-dimensional spaces
- NAS: DARTS для CV, AutoML-Text для NLP
- Early Stopping: ASHA (Asynchronous Successive Halving)
- Multi-fidelity: Сначала на 10% данных, потом на 100%
Cost optimization: - Warm-starting: transfer learning между похожими задачами - Budget-aware: остановить если не превосходит baseline после N trials - Meta-learning: использовать историю команды для инициализации
Governance: - Auto-logging всех экспериментов - Comparison vs baseline required для promotion - Weekly AutoML reports: savings, best practices discovered
Q: Что такое Multi-Fidelity Optimization?
A: Multi-Fidelity использует дешевые аппроксимации для ускорения HPO.
Идея: Сначала evaluate на маленьком subset данных/short training, потом только лучшие на full fidelity.
Методы:
Successive Halving (SH):
# Start with N configs, train for r epochs # Keep top 1/η, train for r*η epochs # Repeat until 1 config at max epochs def successive_halving(configs, n_initial, eta=3): n = len(configs) r = r_min while n > 1: results = [train(c, epochs=r) for c in configs] n_keep = n // eta configs = top_k(configs, results, k=n_keep) n, r = n_keep, r * eta return configs[0]ASHA (Asynchronous SHA):
- Параллельная версия SH
- Configs запускаются асинхронно
Promote когда достигли milestone
Hyperband:
- Комбинирует SH с разными budget allocations
- Робаст к разным типам задач
Формула Hyperband: $\(s_{max} = \lfloor \log_\eta(R/r_{min}) \rfloor\)$
\[B = (s_{max} + 1) R\]
19. Federated Learning¶
Basic¶
Q: Что такое Federated Learning?
A: ML парадигма, где модель обучается на распределённых данных без перемещения данных на центральный сервер.
Ключевые принципы: - Data stays on device (privacy) - Only model updates are shared - Central server aggregates updates - Model improves collaboratively
Q: Как работает FedAvg (Federated Averaging)?
A:
FedAvg Algorithm (McMahan et al.): 1. Server initializes global model \(w^0\) 2. For each round \(t\): - Server sends \(w^t\) to selected clients \(S_t\) - Each client \(k\) trains locally: \(w_k^{t+1} = w^t - \eta \nabla L_k(w^t)\) - Clients send updates back - Server aggregates: \(w^{t+1} = \sum_{k \in S_t} \frac{n_k}{n} w_k^{t+1}\)
где \(n_k\) — количество samples у клиента \(k\), \(n = \sum n_k\)
Medium¶
Q: Какие проблемы FedAvg и как их решают?
A:
Problem Cause Solution Client Drift Heterogeneous data FedProx (proximal term) Communication cost Large model updates Compression, sparse updates Stragglers Slow clients Async aggregation Non-IID data Different distributions Data sharing, clustering FedProx: $\(\min_w L_k(w) + \frac{\mu}{2}\|w - w^t\|^2\)$
Proximal term keeps local updates close to global model.
Q: Local vs Global updates — в чём разница?
A:
Local updates (client-side): - Multiple SGD steps before sending to server - More computation, less communication - Formula: \(w_k \leftarrow w_k - \eta \sum_{i} \nabla \ell(x_i, y_i; w_k)\) for \(E\) epochs
Communication-efficiency trade-off: - More local epochs \(E\) → less communication, but more drift - Typical: \(E \in [1, 5]\) for stability
Q: Что такое Differential Privacy в Federated Learning?
A:
DP-FedAvg: Add noise to updates before sending to server $\(\tilde{g}_k = g_k + \mathcal{N}(0, \sigma^2 C^2)\)$
где \(C\) — clipping norm, \(\sigma\) — noise scale
Privacy guarantee: \((\epsilon, \delta)\)-DP - Lower \(\epsilon\) → stronger privacy, more noise - Trade-off: privacy vs accuracy
Killer¶
Q: Спроектируйте FL систему для предсказания клавиатуры на мобильных устройствах.
A:
Architecture:
[User Devices] → [Secure Aggregation] → [FL Server] → [Global Model] ↑ ↓ ←——————— Model Distribution ←———————Key decisions:
- Model: LSTM/Transformer, ~5-10M params (must fit on device)
- Participation: Sample 100-1000 users per round from millions
- Local training: 1-5 epochs on user's typing data
- Aggregation: Weighted by data size \(n_k\)
- Privacy: DP-FedAvg with \(\epsilon \approx 8\)
Python (simplified):
def fedavg_round(server_model, client_updates, client_sizes): total_size = sum(client_sizes) new_weights = {} for name, param in server_model.named_parameters(): weighted_sum = sum( sizes[i] / total_size * updates[i][name] for i, updates in enumerate(client_updates) ) new_weights[name] = weighted_sum return new_weightsChallenges: - Device heterogeneity (battery, compute) - Non-IID data (different users, different vocab) - Concept drift (new slang, languages)
Q: FedAvg vs FedProx vs SCAFFOLD — когда что использовать?
A:
Algorithm Best For Key Innovation FedAvg IID-ish data, stable clients Baseline, simple FedProx Heterogeneous data Proximal term reduces drift SCAFFOLD Highly non-IID Control variates correct drift SCAFFOLD insight: Client drift = \(\nabla L_k(w) - \nabla L(w)\) - Maintains control variates \(c_k\) to estimate drift - Updates: \(w_k \leftarrow w_k - \eta(g_k - c_k + c)\) - Achieves 45% faster convergence on non-IID data (2025 benchmarks)
20. TabPFN — Foundation Model for Tabular Data¶
Basic¶
Q: Что такое TabPFN?
A: Tabular Prior-data Fitted Network — foundation model для tabular data, использующий in-context learning вместо gradient descent.
Ключевые характеристики: - Pre-trained на synthetic tabular datasets - Zero-shot prediction (no training on your data) - Transformer-based architecture - Outperforms XGBoost/LightGBM on small datasets (<10K samples)
Q: В чём разница TabPFN vs традиционные ML модели?
A:
Aspect Traditional (XGBoost) TabPFN Training Gradient descent on data Pre-trained, no training Data requirement More data = better Small data specialist Inference Fast tree traversal Forward pass through transformer Hyperparameters Many (lr, depth, etc.) Minimal (none for basic use) Max samples Unlimited 50K (TabPFN-2.5)
Medium¶
Q: Как работает TabPFN?
A:
Pre-training Phase: 1. Generate synthetic tabular datasets from priors 2. Train transformer to predict labels given (X_train, y_train, x_test) 3. Model learns general tabular patterns
Inference (In-Context Learning):
from tabpfn import TabPFNClassifier classifier = TabPFNClassifier() classifier.fit(X_train, y_train) # No actual training! predictions = classifier.predict(X_test)Architecture: - Input: Training set + test sample as sequence - Encoder: Feature embedding + positional encoding - Decoder: Transformer predicts label probabilities
Q: Какие ограничения у TabPFN?
A:
Limitation TabPFN v2 TabPFN-2.5 Max samples 10,000 50,000 Max features 100 2,000 Max classes 10 ~100 GPU required Yes Yes Practical limitations: - Slow on large datasets (O(n²) attention) - Categorical features need preprocessing - No native support for missing values - Regression needs separate model
Q: Когда использовать TabPFN vs XGBoost?
A:
Use TabPFN when: - Dataset < 50K samples - Limited time for hyperparameter tuning - Quick baseline needed - Data is clean (no missing values)
Use XGBoost/LightGBM when: - Large datasets (>50K) - Need feature importance - Complex preprocessing needed - Production deployment (no GPU)
Benchmarks (2025 Nature paper): - TabPFN outperforms on 57% of datasets <10K samples - Average accuracy gain: +2.7% vs best competitor
Killer¶
Q: Как интегрировать TabPFN в production pipeline?
A:
Hybrid approach:
def smart_classifier(X_train, y_train, X_test): n_samples = len(X_train) if n_samples < 5000: # TabPFN for small data model = TabPFNClassifier() model.fit(X_train, y_train) return model.predict(X_test) elif n_samples < 50000: # Compare TabPFN vs XGBoost tabpfn_score = cross_val_score(TabPFNClassifier(), X_train, y_train) xgb_score = cross_val_score(XGBClassifier(), X_train, y_train) if tabpfn_score > xgb_score: return TabPFNClassifier().fit(X_train, y_train).predict(X_test) else: return XGBClassifier().fit(X_train, y_train).predict(X_test) else: # Large data: traditional methods return XGBClassifier().fit(X_train, y_train).predict(X_test)Production considerations: - GPU required for inference - Batch inference for throughput - Fallback to XGBoost on timeout - Model versioning (TabPFN updates)
Q: Что нового в TabPFN-2.5?
A:
Key improvements (Nov 2025): - 20x increase in data cells (50K samples × 2K features) - Better handling of high-cardinality categorical features - Improved regression support - Faster inference (optimized attention)
When to upgrade: - Datasets near v2 limits - Need for more classes - Large feature sets
21. Production ML Deployment Patterns¶
Источники: MatterAI Deployment Strategies (Jan 2026), ML Journey Shadow vs Canary (Sept 2025), Raghu's Deployment Patterns, FICO Champion/Challenger (Dec 2025)
Basic¶
Q: Какие основные паттерны deployment для ML моделей?
A:
Pattern Описание Risk Level Use Case Blue-Green Two identical environments, instant switch Low Critical systems, zero downtime Canary Gradual traffic shift (1%→100%) Medium Risk mitigation with real users Shadow Parallel run, no user impact None Model validation, load testing A/B Testing Deterministic routing by user Medium Statistical comparison Champion-Challenger Continuous model competition Low Continuous improvement
Q: Что такое Blue-Green deployment?
A: Поддержка двух идентичных production environments: - Blue — текущая production версия - Green — новая версия для deployment
Process: 1. Deploy new model to Green 2. Run validation tests 3. Switch traffic via load balancer: Blue (0%) → Green (100%) 4. Blue становится standby для instant rollback
Infrastructure (Kubernetes + Istio):
Medium¶
Q: Как работает Canary deployment?
A: Gradual rollout с progressive traffic shifting:
Traffic Ramp:
Kubernetes Argo Rollouts:
apiVersion: argoproj.io/v1alpha1 kind: Rollout metadata: name: model-inference spec: replicas: 10 strategy: canary: steps: - setWeight: 5 - pause: {duration: 10m} - setWeight: 20 - pause: {duration: 10m} - setWeight: 50 - pause: {duration: 10m} analysis: templates: - templateName: success-rateAutomated Gates (triggers rollback if breached): - P95 latency < 200ms - Error rate < 0.1% - Prediction distribution drift (KL divergence < 0.1) - Business metrics (conversion rate stable)
Q: Что такое Shadow deployment и когда его использовать?
A: Shadow model получает те же input данные, что и production, но predictions НЕ влияют на пользователей.
Architecture:
Implementation:
class ShadowDeployment: def __init__(self, production_model, shadow_model): self.prod = production_model self.shadow = shadow_model self.logger = PredictionLogger() async def predict(self, features): # Production prediction (returned to user) prod_pred = await self.prod.predict(features) # Shadow prediction (logged, not returned) shadow_pred = await self.shadow.predict(features) self.logger.log( features=features, prod_prediction=prod_pred, shadow_prediction=shadow_pred, timestamp=datetime.now() ) return prod_pred # Only production predictionUse Cases: - Validate new model on real traffic (risk-free) - Compare prediction distributions - Load testing new infrastructure - Data drift detection
Q: Чем A/B Testing отличается от Canary?
A:
Aspect Canary A/B Testing Traffic split Random percentage Deterministic (user ID hash) Purpose Risk mitigation Statistical comparison User consistency May see different models Same user sees same model Duration Until full rollout Fixed experiment period Analysis Operational metrics Business metrics + significance A/B User Segmentation:
Killer¶
Q: Спроектируйте Champion-Challenger pipeline для recommendation system.
A:
Architecture:
graph TD subgraph REG["Model Registry"] CHAMP["Champion v2.3<br/>87.2%"] CH1["Challenger 1<br/>v2.4-alpha, 86.8%"] CH2["Challenger 2<br/>v2.4-beta, 87.5%"] end CHAMP -->|"90%"| ROUTER["Traffic Router"] CH1 -->|"5% shadow"| ROUTER CH2 -->|"5%"| ROUTER ROUTER --> METRICS["Metrics Collector<br/>CTR, Conversion, Revenue, Latency"] METRICS --> DECISION{"Promotion Decision<br/>challenger > champion by 2%+<br/>for 7 days?"} DECISION -->|"Yes"| PROMOTE["Promote to Champion"] DECISION -->|"No"| KEEP["Keep current Champion"] style CHAMP fill:#e8f5e9,stroke:#4caf50 style CH1 fill:#e8eaf6,stroke:#3f51b5 style CH2 fill:#e8eaf6,stroke:#3f51b5 style ROUTER fill:#fff3e0,stroke:#ef6c00 style METRICS fill:#f3e5f5,stroke:#9c27b0 style PROMOTE fill:#e8f5e9,stroke:#4caf50 style KEEP fill:#fce4ec,stroke:#c62828Implementation:
class ChampionChallengerPipeline: def __init__(self, registry, traffic_router, metrics): self.registry = registry self.router = traffic_router self.metrics = metrics self.promotion_threshold = 0.02 # 2% improvement self.min_observation_days = 7 def get_model(self, user_id, context): champion = self.registry.get_champion() challengers = self.registry.get_challengers() # Route traffic assignment = self.router.assign(user_id) if assignment == 'champion': return champion else: # Shadow: return champion prediction but log challenger challenger = challengers[assignment] return self.shadow_predict(champion, challenger, context) async def evaluate_promotion(self): champion = self.registry.get_champion() challengers = self.registry.get_challengers() for challenger in challengers: if challenger.observation_days < self.min_observation_days: continue # Statistical significance test improvement = self.metrics.compare( challenger, champion, metric='conversion_rate' ) if (improvement > self.promotion_threshold and self.metrics.is_significant(challenger, champion)): await self.promote(challenger) async def promote(self, new_champion): old_champion = self.registry.get_champion() self.registry.demote(old_champion) self.registry.promote(new_champion) self.router.update_weights(champion=1.0)Promotion Criteria: - Metric improvement > threshold (e.g., 2%) - Statistical significance (p < 0.05) - Minimum observation period - No degradation on critical slices - Stakeholder approval (for major changes)
Q: Когда использовать каждый deployment pattern?
A:
Decision Tree:
Is zero downtime required? ├── Yes → Blue-Green (critical systems: payments, auth) └── No → Is risk tolerance low? ├── Yes → Shadow → Canary → Full └── No → Canary (fast iteration) Need statistical comparison? └── Yes → A/B Testing with significance analysis Continuous improvement culture? └── Yes → Champion-Challenger with automationPattern Combinations (Best Practice): 1. Shadow + Canary: Shadow for 2 weeks → Canary 1%→100% 2. Champion-Challenger + Shadow: Multiple challengers in shadow mode 3. A/B + Canary: A/B test on canary traffic only
Cost Comparison:
Pattern Infra Cost Rollback Speed Real-User Validation Blue-Green High (2x) Instant No Canary Medium (1.2x) Fast Yes Shadow Medium (1.5x) N/A No A/B Testing Medium Fast Yes Champion-Challenger Medium Fast Yes
Q: Как реализовать automated rollback для ML deployment?
A:
class AutomatedRollback: def __init__(self, thresholds, monitoring): self.thresholds = { 'p95_latency_ms': 200, 'error_rate': 0.001, 'prediction_drift_kl': 0.1, 'conversion_rate_drop': 0.05, } self.monitoring = monitoring async def check_and_rollback(self, deployment): metrics = await self.monitoring.get_metrics(deployment) for metric, threshold in self.thresholds.items(): current = metrics.get(metric, 0) if self._breaches_threshold(metric, current, threshold): await self.rollback(deployment) await self.alert( f"Rollback triggered: {metric}={current}, threshold={threshold}" ) return True return False def _breaches_threshold(self, metric, current, threshold): if 'drop' in metric: return current > threshold # Higher drop is bad else: return current > threshold # Higher latency/error is bad async def rollback(self, deployment): # Switch back to previous stable version await deployment.switch_to_previous() await deployment.scale_down_canary()Rollback Triggers: 1. Latency spike > 2x baseline 2. Error rate > 0.1% 3. Prediction distribution shift (PSI > 0.2) 4. Business metric drop > 5% 5. Manual trigger from on-call
22. Data Drift Detection¶
Источники: AllDays Tech Model Drift 2026, Label Your Data Drift Detection, Towards Data Science Drift (Jan 2026)
Basic¶
Q: Что такое data drift и почему это проблема?
A: Data drift — изменение распределения входных данных с течением времени:
\[P_{t_0}(X) \neq P_t(X), \quad t > t_0\]Типы Drift:
Type Definition Example Data Drift Input distribution changes New user demographics, seasonality Concept Drift P(y|X) changes Fraud patterns evolve, buying behavior shifts Label Drift P(y) changes Class imbalance shifts, policy changes Formal decomposition: $\(P(X, y) = P(X) \times P(y|X)\)$
Q: Почему drift неизбежен в production?
A: 1. Real-world change: Seasonality, macro events, adversaries adapt 2. Product change: New features, UI changes, pricing changes 3. Pipeline change: Schema changes, logging changes, feature computation bugs
Medium¶
Q: Какие методы обнаружения drift существуют?
A:
Method Use Case Formula/Approach KS Test Continuous features $D = \max Chi-Square Categorical features \(\chi^2 = \sum \frac{(O-E)^2}{E}\) PSI Score/bin distribution \(\sum (Actual\% - Expected\%) \times \ln\frac{Actual\%}{Expected\%}\) Wasserstein Continuous, sensitive Earth Mover's Distance PSI Thresholds: - PSI < 0.1: No significant drift - 0.1 ≤ PSI < 0.25: Moderate drift, monitor - PSI ≥ 0.25: Significant drift, investigate
Q: Как реализовать PSI (Population Stability Index)?
A:
import numpy as np def compute_psi(expected, actual, buckets=10): """Compute Population Stability Index.""" breakpoints = np.percentile(expected, np.linspace(0, 100, buckets + 1)) expected_counts, _ = np.histogram(expected, bins=breakpoints) actual_counts, _ = np.histogram(actual, bins=breakpoints) expected_pct = expected_counts / len(expected) + 1e-10 actual_pct = actual_counts / len(actual) + 1e-10 psi = np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct)) return psi
Q: Что такое Adversarial Validation?
A: Метод определения насколько train и production distribution различаются:
- Label train data as 0, production data as 1
- Train classifier to distinguish them
- If AUC ≈ 0.5 → distributions similar (good)
- If AUC > 0.7 → significant drift (problem)
Killer¶
Q: Когда retraining необходим vs достаточно мониторинга?
A:
Drift Type Performance Impact Action Data drift only No degradation Monitor, no action Data drift + perf drop Model degrading Investigate root cause Concept drift Always impacts Retrain with recent data Pipeline bug Varies Fix pipeline first Retrain Triggers: - Business metric drop > 5% - Model accuracy drops below threshold - Multiple features showing drift simultaneously
23. Hyperparameter Interactions & Learning Curves¶
Comprehensive guide to hyperparameter tuning strategies and training diagnostics.
Hyperparameter Tuning Strategies Comparison¶
| Aspect | Grid Search | Random Search | Bayesian Optimization |
|---|---|---|---|
| Strategy | Exhaustive, all combinations | Random sampling | Probabilistic modeling |
| Efficiency | Exponential growth | Efficient for large spaces | Very efficient, fewer evaluations |
| Implementation | Easy (sklearn GridSearchCV) | Easy (sklearn RandomizedSearchCV) | Complex (Optuna, Hyperopt) |
| Best For | Small spaces (<10 params) | High-dimensional spaces | Expensive evaluations |
| Scalability | Limited | Good | Excellent |
| Exploration | Thorough but wasteful | Broad coverage | Smart exploration/exploitation |
Grid Search Details¶
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 10, None],
'min_samples_split': [2, 5, 10]
}
# Total combinations: 3 * 4 * 3 = 36
grid_search = GridSearchCV(
RandomForestClassifier(),
param_grid,
cv=5,
scoring='f1',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
Pros: Comprehensive, simple, reproducible Cons: \(O(n^k)\) complexity, wastes resources on unimportant dimensions
Random Search Details¶
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform, randint
param_distributions = {
'n_estimators': randint(50, 500),
'max_depth': randint(3, 20),
'learning_rate': loguniform(1e-4, 1e-1),
'min_samples_split': randint(2, 20)
}
random_search = RandomizedSearchCV(
GradientBoostingClassifier(),
param_distributions,
n_iter=50, # Number of random samples
cv=5,
scoring='f1',
n_jobs=-1
)
Key Insight (Bergstra & Bengio 2012): Random search often finds better configs in fewer trials because: - Not all hyperparameters are equally important - Random sampling covers more distinct values per dimension
Bayesian Optimization¶
import optuna
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 50, 500),
'max_depth': trial.suggest_int('max_depth', 3, 20),
'learning_rate': trial.suggest_float('learning_rate', 1e-4, 1e-1, log=True),
'min_samples_split': trial.suggest_int('min_samples_split', 2, 20)
}
model = GradientBoostingClassifier(**params)
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1')
return scores.mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
How it works: 1. Builds probabilistic model (Gaussian Process) of objective function 2. Uses acquisition function (EI, UCB) to select next hyperparameters 3. Balances exploration (new regions) vs exploitation (known good regions)
Learning Curves Interpretation¶
Learning curves plot training/validation error vs training set size or epochs.
Well-Fitted Model¶
- Small gap between train and validation - Both curves converge to low error - Action: Model is readyOverfitting Model¶
Error
│
│ Train ----------
│ (approaches zero)
│ Val ----___
│ ___/‾‾‾ (increases!)
└───────────────────────── Size/Epochs
Underfitting Model¶
Error
│
│ Train ------ (high)
│
│ Val ------- (high, similar)
│
└───────────────────────── Size/Epochs
Learning Curve Analysis Code¶
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
def plot_learning_curve(estimator, X, y, cv=5):
train_sizes, train_scores, val_scores = learning_curve(
estimator, X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=cv,
scoring='neg_mean_squared_error',
n_jobs=-1
)
train_mean = -np.mean(train_scores, axis=1)
val_mean = -np.mean(val_scores, axis=1)
train_std = np.std(train_scores, axis=1)
val_std = np.std(val_scores, axis=1)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, 'o-', label='Training error')
plt.plot(train_sizes, val_mean, 'o-', label='Validation error')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1)
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1)
plt.xlabel('Training set size')
plt.ylabel('MSE')
plt.legend()
plt.grid(True)
plt.show()
Early Stopping Strategies¶
Early stopping prevents overfitting by stopping training when validation performance stops improving.
Basic Early Stopping¶
from sklearn.model_selection import train_test_split
class EarlyStopping:
def __init__(self, patience=5, min_delta=0.001):
self.patience = patience
self.min_delta = min_delta
self.counter = 0
self.best_score = None
self.should_stop = False
def __call__(self, val_score):
if self.best_score is None:
self.best_score = val_score
elif val_score < self.best_score + self.min_delta:
self.counter += 1
if self.counter >= self.patience:
self.should_stop = True
else:
self.best_score = val_score
self.counter = 0
return self.should_stop
# Usage in training loop
early_stopping = EarlyStopping(patience=10, min_delta=0.001)
for epoch in range(max_epochs):
train_loss = train_one_epoch(model, train_loader)
val_loss = validate(model, val_loader)
if early_stopping(-val_loss): # Negative because we want to maximize
print(f"Early stopping at epoch {epoch}")
break
Early Stopping in Gradient Boosting¶
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(
n_estimators=1000,
learning_rate=0.01,
validation_fraction=0.1,
n_iter_no_change=10, # Early stopping patience
tol=1e-4 # Minimum improvement
)
model.fit(X_train, y_train)
print(f"Actual n_estimators used: {model.n_estimators_}")
PyTorch Early Stopping with Checkpointing¶
import torch
def train_with_early_stopping(model, train_loader, val_loader, epochs, patience=5):
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.CrossEntropyLoss()
best_val_loss = float('inf')
patience_counter = 0
best_model_state = None
for epoch in range(epochs):
# Training
model.train()
for X, y in train_loader:
optimizer.zero_grad()
loss = criterion(model(X), y)
loss.backward()
optimizer.step()
# Validation
model.eval()
val_loss = 0
with torch.no_grad():
for X, y in val_loader:
val_loss += criterion(model(X), y).item()
val_loss /= len(val_loader)
# Early stopping logic
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
best_model_state = model.state_dict().copy()
else:
patience_counter += 1
if patience_counter >= patience:
print(f"Early stopping at epoch {epoch}")
model.load_state_dict(best_model_state)
break
return model
Decision Framework: Which Tuning Strategy to Use¶
| Scenario | Recommended Strategy | Why |
|---|---|---|
| <5 hyperparameters | Grid Search | Small space, comprehensive |
| 5-20 hyperparameters | Random Search | Efficient exploration |
| >20 hyperparameters | Bayesian (Optuna) | Smart search |
| Expensive training (>1hr) | Bayesian + Early Stopping | Minimize evaluations |
| Limited compute budget | Random (n=50) + Early Stopping | Good coverage, low cost |
| Production deployment | Bayesian + Cross-validation | Robust, reproducible |
Best Practices¶
- Start coarse, then fine: Wide search first, narrow later
- Use domain knowledge: Set sensible ranges based on experience
- Monitor learning curves: Diagnose over/underfitting early
- Apply early stopping: Save compute, prevent overfitting
- Document experiments: Track all configurations and results
- Cross-validation: Use k-fold CV for reliable estimates
- Parallelize: Use n_jobs=-1 or distributed tuning (Ray Tune)
Источники: AICompetence Grid vs Random vs Bayesian (May 2025), GeeksforGeeks Learning Curves (Jul 2025), Bergstra & Bengio (2012), Snoek et al. (2012)
Basic¶
Q: Почему Random Search часто работает лучше Grid Search?
A: Bergstra & Bengio (2012) показали: 1. Не все параметры важны: Важен только ~1-2 параметра, остальные мало влияют 2. Grid тратит ресурсы: Перебирает все комбинации неважных параметров 3. Random покрывает больше: При том же бюджете исследует больше значений важных параметров
Q: Что показывает Learning Curve?
A: График зависимости ошибки от размера обучающей выборки или эпох: - X-axis: Training set size или epochs - Y-axis: Error (MSE, loss) или accuracy - Две линии: Training error и Validation error
Medium¶
Q: Как диагностировать переобучение по Learning Curve?
A:
Symptom Training Error Validation Error Gap Overfitting Very low High, increasing Large Underfitting High High Small Good fit Low Low Small Actions for overfitting: More data, regularization, early stopping, simpler model
Q: Как работает Early Stopping?
A: Остановка обучения когда validation loss перестаёт улучшаться:
if val_loss < best_val_loss - min_delta: best_val_loss = val_loss counter = 0 else: counter += 1 if counter >= patience: stop_training()Параметры: patience (сколько эпох ждать), min_delta (минимальное улучшение)
Killer¶
Q: Как выбрать стратегию tuning для production системы?
A:
- Budget assessment: Сколько времени/ресурсов доступно?
- Model complexity: Deep learning → Bayesian, Classical ML → Random/Grid
- Iteration cost: Дорогое обучение → Bayesian + early stopping
- Risk tolerance: Production → k-fold CV + multiple runs
Recommended pipeline:
Q: Как избежать overfitting на validation set при tuning?
A: 1. Nested CV: Inner loop для tuning, outer loop для оценки 2. Hold-out test set: Не использовать для tuning вообще 3. Ограничить trials: Не перебирать тысячи комбинаций 4. Early stopping: Не "подгонять" под validation
Обновлено: 2026-02-12, Ralph iteration 106 — добавлен Cross-Validation Edge Cases (Section 24)
24. Cross-Validation Edge Cases¶
Advanced cross-validation techniques for robust model evaluation.
Nested Cross-Validation¶
Problem: When tuning hyperparameters, standard CV causes optimism bias — we use validation data both to select hyperparameters AND to report performance.
Solution: Nested CV separates model selection from evaluation: - Outer loop (evaluation): Honest test of tuned model generalization - Inner loop (selection): Hyperparameter tuning on training data only
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
# Inner loop: hyperparameter search
inner_cv = KFold(n_splits=3, shuffle=True, random_state=1)
param_grid = {'n_estimators': [50, 100], 'max_depth': [3, 5, 10]}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=inner_cv)
# Outer loop: evaluation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
nested_scores = cross_val_score(grid, X, y, cv=outer_cv)
print(f"Nested CV score: {nested_scores.mean():.3f} (+/- {nested_scores.std():.3f})")
Nested vs Standard CV Comparison¶
| Aspect | Standard CV (GridSearchCV) | Nested CV |
|---|---|---|
| Purpose | Tune hyperparameters | Evaluate tuning pipeline |
| Data leakage | Possible (optimistic bias) | Prevented |
| Computation | \(k \times n_{params}\) | \(k_{outer} \times k_{inner} \times n_{params}\) |
| When to use | Final model selection | Model comparison, publication |
Time Series Cross-Validation¶
Problem: Standard K-fold CV breaks temporal structure — training on future data to predict past = data leakage.
Walk-Forward Validation (Expanding Window)¶
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
print(f"Fold {fold}: Train={len(train_idx)}, Test={len(test_idx)}")
Sliding Window (Fixed Size)¶
class SlidingWindowCV:
def __init__(self, window_size, step=1):
self.window_size = window_size
self.step = step
def split(self, X):
n = len(X)
for i in range(self.window_size, n, self.step):
train_idx = np.arange(i - self.window_size, i)
test_idx = np.arange(i, min(i + self.step, n))
yield train_idx, test_idx
Blocked Time Series CV (with Embargo)¶
class BlockedTimeSeriesCV:
def __init__(self, n_splits=5, embargo=0):
self.n_splits = n_splits
self.embargo = embargo # Gap between train and test
def split(self, X):
n = len(X)
k = n // (self.n_splits + 1)
for i in range(self.n_splits):
test_start = i * k + k
test_end = test_start + k
train_end = test_start - self.embargo
yield np.arange(0, train_end), np.arange(test_start, test_end)
Time Series CV Comparison¶
| Method | Window | Memory | Best For |
|---|---|---|---|
| Expanding | Grows | All history | Stable systems |
| Sliding | Fixed | Recent only | Concept drift |
| Blocked | Fixed + gap | No leakage | Financial data |
Bootstrap .632 Estimator¶
Formula: \(\hat{Err}^{.632} = 0.368 \times \overline{err} + 0.632 \times \hat{Err}^{(1)}\)
Where \(\overline{err}\) = training error, \(\hat{Err}^{(1)}\) = OOB error
def bootstrap_632_score(model, X, y, n_bootstraps=100):
n = len(y)
oob_errors, train_errors = [], []
for _ in range(n_bootstraps):
indices = resample(range(n), n_samples=n, replace=True)
oob_mask = ~np.isin(range(n), indices)
if oob_mask.sum() == 0: continue
model.fit(X[indices], y[indices])
train_errors.append(np.mean(model.predict(X[indices]) != y[indices]))
oob_errors.append(np.mean(model.predict(X[oob_mask]) != y[oob_mask]))
return 0.368 * np.mean(train_errors) + 0.632 * np.mean(oob_errors)
Why 0.632? Bootstrap sample includes ~63.2% of data (1 - 1/e).
Repeated K-Fold CV¶
from sklearn.model_selection import RepeatedKFold
cv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
scores = cross_val_score(model, X, y, cv=cv) # 50 evaluations
Decision Framework: Which CV to Use¶
| Data Type | CV Method | Why |
|---|---|---|
| Standard (i.i.d.) | K-fold (k=5 or 10) | Good bias-variance trade-off |
| Small dataset | LOOCV or .632 Bootstrap | Maximize training data |
| Imbalanced | Stratified K-fold | Preserve class ratios |
| Time series | Walk-forward / Blocked | Respect temporal order |
| Grouped (clusters) | GroupKFold | Keep groups together |
| Hyperparameter tuning | Nested CV | Prevent optimism bias |
Источники: Medium Nested CV (May 2025), MLMastery Time Series CV (Jan 2026)
Basic¶
Q: Зачем нужен Nested Cross-Validation?
A: Стандартный GridSearchCV даёт optimistic bias — мы используем validation данные дважды (для выбора гиперпараметров И для оценки качества).
Nested CV: Outer loop = честная оценка, Inner loop = подбор гиперпараметров.
Q: Почему нельзя использовать K-fold для Time Series?
A: Random shuffle ломает временной порядок — train на будущем, test на прошлом = data leakage.
Medium¶
Q: В чём разница Expanding vs Sliding Window?
A: Expanding растёт (вся история), Sliding фиксирован (только недавние). Expanding для стабильных систем, Sliding для concept drift.
Q: Что такое Bootstrap .632?
A: Комбинация training error и OOB error: \(0.368 \times \text{train\_err} + 0.632 \times \text{OOB\_err}\). Снижает bias для малых данных.
Killer¶
Q: Как правильно организовать CV pipeline с preprocessing?
A: КРИТИЧНО: Pipeline внутри CV, не снаружи:
Типичные заблуждения¶
Заблуждение: Переобучение можно определить только по метрикам на валидации
Learning curves (train vs val) -- необходимый, но не достаточный инструмент. По данным Kaggle 2025, 34% случаев "переобучения" на самом деле вызваны data leakage в preprocessing pipeline (например, fit scaler на всех данных до split). Всегда проверяйте pipeline целиком через nested cross-validation.
Заблуждение: Feature selection всегда улучшает качество модели
На практике агрессивный feature selection может удалить признаки с weak-but-useful сигналом. Исследование (Boulesteix et al., 2024) показало, что на datasets с >50 признаками L1-регуляризация (Lasso) в среднем на 2-4% хуже по AUC чем Ridge (L2) + все признаки, если между признаками высокая корреляция. Используйте Elastic Net как компромисс.
Заблуждение: Gradient Boosting всегда лучше Random Forest
Согласно мета-анализу AutoML Benchmark 2025 на 104 табличных датасетах, Random Forest побеждает GBDT в 38% случаев -- особенно на зашумленных данных (SNR < 2), малых выборках (<500 строк) и при дисбалансе классов 1:50+. RF также значительно устойчивее к гиперпараметрам: default RF проигрывает tuned RF на 1-2%, а default XGBoost проигрывает tuned XGBoost на 5-8%.
See Also¶
- Metrics Cheatsheet — дерево выбора метрик, confusion matrix, ROC/PR-AUC
- Model Selection Cheatsheet — дерево выбора алгоритма по типу задачи
- Hyperparameters Cheatsheet — ключевые гиперпараметры всех моделей
- Debugging Cheatsheet — систематическая отладка ML моделей
- sklearn Cheatsheet — Pipeline, GridSearchCV, ColumnTransformer
- Deep Learning Interview Q&A — нейросети, оптимизация, архитектуры
- System Design Interview Q&A — latency, A/B tests, drift detection
- Math Interview Q&A — линал, теорвер, статистика