Classical ML: учебные материалы¶
~18 минут чтения
Предварительно: Выбор модели | sklearn
Материалы для 16 задач из категории Classical ML. Формулы, код, ключевые источники для каждого алгоритма. Обновлено: 2026-02-11
1. ML Algorithms from Scratch (6 задач)¶
KNN (impl_003_knn)¶
Лучшие источники: - Stanford CS229: KNN — Lecture notes - StatQuest: KNN — YouTube визуализация - Hands-On ML Chapter 3 — Практический
Ключевые концепции:
| Distance | Formula | Когда использовать |
|---|---|---|
| Euclidean | \(\sqrt{\sum(x_i-y_i)^2}\) | Default, continuous |
| Manhattan | $\sum\mid x_i-y_i | $ |
| Cosine | \(1 - \frac{x \cdot y}{\|x\|\|y\|}\) | Text, sparse |
Curse of Dimensionality: $\(\text{Volume of unit ball} = \frac{\pi^{d/2}}{\Gamma(d/2+1)} \to 0 \text{ as } d \to \infty\)$
При high dimensions все точки "далеко" друг от друга.
Logistic Regression (impl_001_logistic_regression)¶
Лучшие источники: - StatQuest: Logistic Regression - CS229 Notes: Logistic Regression - Scikit-learn User Guide
Ключевые формулы: $\(\sigma(z) = \frac{1}{1+e^{-z}}\)$
Decision Boundary: \(w^Tx + b = 0\)
K-Means (impl_002_kmeans)¶
Лучшие источники: - StatQuest: K-Means - CS229: K-Means
Алгоритм: 1. Init centroids (random или k-means++) 2. Assign points to nearest centroid 3. Update centroids = mean of assigned points 4. Repeat until convergence
K-means++ initialization:
def kmeans_plus_plus_init(X, k):
centroids = [X[np.random.randint(len(X))]]
for _ in range(k-1):
distances = np.min([np.linalg.norm(X - c, axis=1) for c in centroids], axis=0)
probs = distances ** 2 / np.sum(distances ** 2)
centroids.append(X[np.random.choice(len(X), p=probs)])
return np.array(centroids)
Elbow Method: Plot \(J(C)\) vs \(k\), find "elbow"
Naive Bayes (impl_005_naive_bayes)¶
Лучшие источники: - StatQuest: Naive Bayes - Scikit-learn: Naive Bayes
Gaussian NB: $\(P(x_i|y=c) = \frac{1}{\sqrt{2\pi\sigma_c^2}} \exp\left(-\frac{(x_i-\mu_c)^2}{2\sigma_c^2}\right)\)$
Prediction: $\(\hat{y} = \arg\max_c P(c) \prod_i P(x_i|c)\)$
Log-space (prevent underflow): $\(\log P(y|x) = \log P(y) + \sum_i \log P(x_i|y)\)$
Decision Tree (impl_004_decision_tree)¶
Лучшие источники: - StatQuest: Decision Trees - ESL Chapter 9 — Trees, CART
Split Criteria:
| Criterion | Formula | Range |
|---|---|---|
| Gini | \(1 - \sum p_k^2\) | [0, 1-1/K] (binary: [0, 0.5]) |
| Entropy | \(-\sum p_k \log_2 p_k\) | [0, 1] |
| MSE (regression) | \(\frac{1}{n}\sum(y_i - \bar{y})^2\) | [0, ∞) |
Recursive splitting:
def build_tree(X, y, depth=0):
if stopping_condition:
return Leaf(majority_class(y))
feature, threshold = find_best_split(X, y)
left = build_tree(X[X[:, feature] <= threshold], y[...], depth+1)
right = build_tree(X[X[:, feature] > threshold], y[...], depth+1)
return Node(feature, threshold, left, right)
Neural Network from Scratch (impl_006_neural_network)¶
Лучшие источники: - Karpathy: micrograd — MUST DO - Karpathy: nn-zero-to-hero - 3Blue1Brown: Neural Networks
Architecture:
Forward pass: $\(z = Wx + b, \quad a = \text{ReLU}(z)\)$
Backward pass (chain rule): $\(\frac{\partial L}{\partial W} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial W} = \frac{\partial L}{\partial z} \cdot x^T\)$
2. ML Classic Practice (10 задач)¶
Linear Regression (classic_005_linear_regression)¶
Лучшие источники: - ISLR Chapter 3 — Linear Regression - StatQuest: Linear Regression
OLS Closed-form: $\(\hat{w} = (X^TX)^{-1}X^Ty\)$
R-squared: $\(R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}\)$
Adjusted R-squared: $\(R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-p-1}\)$
Classification Metrics (classic_002_classification_metrics)¶
Лучшие источники: - Google ML Crash Course - Scikit-learn: Metrics
Confusion Matrix:
| Predicted + | Predicted - | |
|---|---|---|
| Actual + | TP | FN |
| Actual - | FP | TN |
Metrics: $\(\text{Precision} = \frac{TP}{TP+FP}\)$
Class Imbalance (classic_003_class_imbalance)¶
Лучшие источники: - Imbalanced-learn documentation - SMOTE Paper
Strategies:
| Method | Description | Pros/Cons |
|---|---|---|
| Undersampling | Remove majority | Fast, loses data |
| Oversampling | Duplicate minority | Simple, overfitting |
| SMOTE | Synthetic samples | Better, complex |
| Class weights | Loss modification | Easy, no data change |
| Focal Loss | Hard example mining | Deep learning |
SMOTE algorithm:
def smote(X_minority, k=5, n_samples=100):
synthetic = []
for x in X_minority:
neighbors = knn(x, X_minority, k)
for _ in range(n_samples // len(X_minority)):
neighbor = random.choice(neighbors)
alpha = random.random()
synthetic.append(x + alpha * (neighbor - x))
return synthetic
Ranking Metrics (classic_001_ranking_metrics)¶
Лучшие источники: - Ranking Metrics Explained - NDCG Wikipedia
Metrics:
| Metric | Formula | Use Case |
|---|---|---|
| MRR | \(1/\text{rank of first relevant}\) | Search |
| MAP | Mean of AP@k | Retrieval |
| NDCG | \(DCG/IDCG\) | Ranked lists |
NDCG: $\(DCG@k = \sum_{i=1}^{k} \frac{2^{rel_i}-1}{\log_2(i+1)}\)$
Feature Selection (feat_002_feature_selection)¶
Лучшие источники: - Scikit-learn: Feature Selection
Methods:
| Type | Methods | Когда использовать |
|---|---|---|
| Filter | Correlation, MI, chi2 | Fast baseline |
| Wrapper | RFE, forward/backward | Best quality |
| Embedded | L1, tree importance | Built-in |
RFE (Recursive Feature Elimination):
from sklearn.feature_selection import RFE
rfe = RFE(estimator=LogisticRegression(), n_features_to_select=10)
X_selected = rfe.fit_transform(X, y)
Target Encoding (feat_001_target_encoding)¶
Лучшие источники: - Target Encoding Explained - Category Encoders
Formula: $\(\text{TE}(cat) = \frac{n \cdot \mu_{cat} + m \cdot \mu_{global}}{n + m}\)$
где \(n\) = count в категории, \(m\) = smoothing parameter.
Leave-One-Out:
def target_encoding_loo(df, col, target):
global_mean = df[target].mean()
return df.groupby(col)[target].transform(
lambda x: (x.sum() - x) / (len(x) - 1)
)
Critical: Use CV to avoid leakage!
SVM (classic_007_svm)¶
Лучшие источники: - StatQuest: SVM - ESL Chapter 12 — SVMs
Hard Margin: $\(\min_{w,b} \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^Tx_i + b) \geq 1\)$
Soft Margin (slack variables): $\(\min_{w,b,\xi} \frac{1}{2}\|w\|^2 + C\sum\xi_i\)$
Kernel Trick: $\(K(x, x') = \phi(x)^T\phi(x')\)$
| Kernel | Formula |
|---|---|
| Linear | \(x^Tx'\) |
| Polynomial | \((\gamma x^Tx' + r)^d\) |
| RBF | \(\exp(-\gamma\|x-x'\|^2)\) |
GBDT Internals (classic_008_gbdt_internals)¶
Лучшие источники: - Gradient Boosting from Scratch - XGBoost Documentation
Key Idea: Fit trees to residuals (negative gradient).
Algorithm:
def gbdt_fit(X, y, n_trees, lr):
F = np.mean(y) # Initial prediction
for _ in range(n_trees):
residuals = y - F # Negative gradient for MSE
tree = DecisionTreeRegressor(max_depth=3)
tree.fit(X, residuals)
F += lr * tree.predict(X)
return F
XGBoost vs LightGBM vs CatBoost:
| Feature | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Trees | Symmetric | Leaf-wise | Symmetric |
| Categorical | One-hot needed | Native | Native (best) |
| Speed | Good | Fastest | Good |
| Tuning | Medium | Medium | Easy |
3. Ensemble Theory Deep Dive¶
Bias-Variance Decomposition¶
Total Error: $\(\text{MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}\)$
Для ensemble из \(B\) моделей: $\(\text{Var}(\bar{f}) = \rho \sigma^2 + \frac{1-\rho}{B}\sigma^2\)$
где \(\rho\) = корреляция между предсказаниями, \(\sigma^2\) = variance отдельной модели.
Key Insight: Увеличение \(B\) уменьшает только второй член. Diversity (\(1-\rho\)) критичен!
Bagging (Bootstrap Aggregating)¶
Mathematical Formulation: $\(\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^{B}\hat{f}_b(x)\)$
Variance Reduction: Для \(B\) uncorrelated predictors: $\(\text{Var}(\hat{f}_{bag}) = \frac{\text{Var}(\hat{f})}{B}\)$
Для correlated (\(\rho > 0\)): $\(\text{Var}(\hat{f}_{bag}) = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2\)$
Random Forest Innovation: Random feature selection at each split decorrelates trees: $\(\text{features\_per\_split} \approx \sqrt{p} \text{ (classification)}, \frac{p}{3} \text{ (regression)}\)$
Boosting¶
Key Idea: Sequential learning from mistakes (fit residuals).
Gradient Boosting as Gradient Descent in Function Space: $\(F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)\)$
где \(h_m\) обучается на negative gradient: $\(r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F=F_{m-1}}\)$
AdaBoost Sample Reweighting: $\(w_i^{(m+1)} = w_i^{(m)} \cdot \exp(\alpha_m \cdot \mathbb{I}[y_i \neq \hat{y}_i^{(m)}])\)$
где \(\alpha_m = \frac{1}{2}\ln\frac{1-\epsilon_m}{\epsilon_m}\)
XGBoost Regularization: $\(\mathcal{L} = \sum_i L(y_i, \hat{y}_i) + \sum_k \Omega(f_k)\)$
где \(T\) = число leaves, \(w\) = leaf weights.
Bagging vs Boosting Comparison¶
| Aspect | Bagging (RF) | Boosting (GBDT) |
|---|---|---|
| Training | Parallel | Sequential |
| Bias | Same as single tree | Reduces with iterations |
| Variance | Reduces significantly | Can increase |
| Overfitting | Resistant | Prone (need early stopping) |
| Outliers | Robust | Sensitive |
| Trees | Deep (fully grown) | Shallow (weak learners) |
| Interpretability | Feature importance | Feature importance + SHAP |
| Best for | High variance | High bias |
Why More Trees Can Hurt Boosting¶
Random Forest: $\(\lim_{B \to \infty} \text{Var}(\text{RF}) = \rho\sigma^2\)$
Converges to stable error. More trees = better (diminishing returns).
Gradient Boosting: $\(F_M(x) = \sum_{m=1}^{M} \eta \cdot h_m(x)\)$
Each iteration fits residuals. Eventually fits noise → overfitting.
Solution: Early stopping, regularization, small learning rate (\(\eta \in [0.01, 0.3]\)).
Stacking (Stacked Generalization)¶
Architecture:
Level 0: [Model1, Model2, ..., ModelK]
↓ predictions on holdout
Level 1: Meta-model (usually Linear/Logistic)
↓
Final prediction
Mathematical Formulation: $\(\hat{y} = g(\hat{f}_1(x), \hat{f}_2(x), ..., \hat{f}_K(x))\)$
Critical: Use out-of-fold predictions for meta-features to avoid leakage!
from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge
def stacking_fit(base_models, meta_model, X, y, n_folds=5):
kf = KFold(n_splits=n_folds, shuffle=True)
meta_features = np.zeros((len(X), len(base_models)))
for i, model in enumerate(base_models):
for train_idx, val_idx in kf.split(X):
model.fit(X[train_idx], y[train_idx])
meta_features[val_idx, i] = model.predict(X[val_idx])
meta_model.fit(meta_features, y)
return meta_model
Ensemble Diversity Metrics¶
Q-statistic (for classifiers): $\(Q_{ij} = \frac{N^{11}N^{00} - N^{01}N^{10}}{N^{11}N^{00} + N^{01}N^{10}}\)$
\(Q \in [-1, 1]\), where \(Q = 0\) means independent.
Double Fault Measure: $\(DF_{ij} = \frac{N^{00}}{N}\)$
Fraction of samples both classifiers get wrong. Lower = more diverse.
Python Implementation¶
# Random Forest with OOB evaluation
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=200,
max_features='sqrt',
oob_score=True, # Out-of-bag evaluation
n_jobs=-1
)
rf.fit(X_train, y_train)
print(f"OOB Score: {rf.oob_score_:.3f}")
# Gradient Boosting with early stopping
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(
n_estimators=1000,
learning_rate=0.05,
max_depth=4,
validation_fraction=0.1,
n_iter_no_change=10, # Early stopping
tol=1e-4
)
gb.fit(X_train, y_train)
print(f"Actual iterations: {gb.n_estimators_}")
# XGBoost with full regularization
import xgboost as xgb
params = {
'objective': 'binary:logistic',
'learning_rate': 0.05,
'max_depth': 6,
'reg_alpha': 0.1, # L1 regularization
'reg_lambda': 1.0, # L2 regularization
'subsample': 0.8,
'colsample_bytree': 0.8,
}
model = xgb.train(params, dtrain, num_boost_round=500,
evals=[(dval, 'val')], early_stopping_rounds=20)
# CatBoost with native categorical handling
from catboost import CatBoostClassifier
cat = CatBoostClassifier(
iterations=1000,
learning_rate=0.05,
depth=6,
cat_features=categorical_cols, # Native handling!
verbose=False
)
cat.fit(X_train, y_train, eval_set=(X_val, y_val),
early_stopping_rounds=50)
Interview Questions¶
Q1: Why does Random Forest not overfit with more trees but Gradient Boosting can?
A: RF trees are independent and parallel. As \(B \to \infty\), variance converges to \(\rho\sigma^2\) (law of large numbers). Boosting trees are sequential - each fits residuals from previous. Eventually fits noise, not signal. Solution: early stopping, regularization.
Q2: When would you prefer Bagging over Boosting?
A: - High variance problem (complex models, small data) → Bagging - High bias problem (simple models, large data) → Boosting - Noisy data with outliers → Bagging (robust) - Need interpretable feature importance → Both OK - Limited compute for training → Bagging (parallelizable)
Q3: Explain the learning rate in Gradient Boosting.
A: Learning rate \(\eta \in (0, 1]\) scales each tree's contribution: $\(F_m = F_{m-1} + \eta \cdot h_m\)$
Small \(\eta\) requires more trees but: - Better generalization (smoother updates) - More robust to overfitting - Often \(\eta \in [0.01, 0.3]\) works best
Trade-off: \(\eta \times n\_trees \approx \text{constant}\) for similar performance.
Q4: What is the role of max_features in Random Forest?
A: Controls decorrelation between trees:
- sqrt(p) for classification (default)
- p/3 for regression
- Smaller → more diverse trees, higher bias, lower variance
- Larger → more correlated trees, lower bias, higher variance
Q5: How does CatBoost handle categorical features differently?
A: - No one-hot encoding needed (memory efficient) - Uses ordered target statistics: each sample's encoding uses only previous samples - Prevents target leakage - Handles high-cardinality categories naturally - Often outperforms manually encoded features
4. Online Learning for Streaming Data¶
Online vs Batch Learning¶
| Aspect | Batch Learning | Online Learning |
|---|---|---|
| Data Access | Full dataset available | Sequential, one at a time |
| Memory Usage | Proportional to data size | Constant (model size only) |
| Model Updates | Periodic retraining | Continuous incremental updates |
| Concept Drift | Model becomes stale | Adapts automatically |
| Latency | Hours to days for retraining | Milliseconds per update |
| Convergence | Global optimization possible | Sequential optimization with regret bounds |
Online Gradient Descent (OGD)¶
Update Rule: $\(w_{t+1} = w_t - \eta_t \nabla L(w_t, x_t, y_t)\)$
где \(\eta_t\) = learning rate at time \(t\), \(L\) = loss function.
Learning Rate Schedules: - Fixed: \(\eta_t = \eta_0\) — works for stationary distributions - Decreasing: \(\eta_t = \eta_0 / \sqrt{t}\) — theoretical convergence guarantees - Adaptive: Increase when model performs poorly (track concept drift)
The Regret Framework¶
Online learning performance measured through regret — difference between cumulative loss and best fixed model in hindsight:
Goal: Sublinear regret growth (\(O(\sqrt{T})\) or \(O(\log T)\)) → average per-step regret → 0.
Follow-The-Regularized-Leader (FTRL)¶
Update Rule: $\(w_{t+1} = \arg\min_w \left[ \sum_{i=1}^{t} L(w, x_i, y_i) + R(w) \right]\)$
где \(R(w)\) = regularization (L1 for sparsity, L2 for smoothness).
FTRL-Proximal (Google production): - Per-coordinate learning rates adapted to gradient history - L1 penalty drives weights to zero (sparse models) - Crucial for high-dimensional features (millions/billions)
Perceptron and Passive-Aggressive Algorithms¶
Perceptron Update (if misclassified): $\(w_{t+1} = w_t + y_t x_t\)$
Passive-Aggressive (PA): $\(w_{t+1} = w_t + \tau_t y_t x_t, \quad \tau_t = \frac{\ell_t}{\|x_t\|^2}\)$
Makes minimal update to achieve correct classification with margin.
Concept Drift Handling¶
Types of Concept Drift: | Type | Description | Example | |------|-------------|---------| | Sudden | Abrupt distribution change | System failure, policy change | | Gradual | Slow evolution over time | User preference shifts | | Incremental | Steady directional change | Market trends | | Recurring | Cyclical patterns | Seasonal effects |
Drift Detection Methods: - ADWIN (Adaptive Windowing): Statistical test for distribution change - DDM (Drift Detection Method): Monitor error rate changes - EDDM (Early DDM): Detect drift earlier using distance between errors - Page-Hinkley Test: Detect mean shifts in stream
Adaptation Strategies: 1. Sliding Window: Keep only recent \(W\) samples 2. Exponential Decay: Weight recent samples more: \(w_t = \alpha^t\) 3. Ensemble Reset: Replace worst performing model when drift detected
Python Implementation¶
from sklearn.linear_model import SGDClassifier, PassiveAggressiveClassifier
from sklearn.naive_bayes import MultinomialNB
import numpy as np
# Online SGD Classifier
clf = SGDClassifier(loss='log_loss', learning_rate='optimal', eta0=0.01)
# Simulating streaming data
def online_learning_stream(clf, data_stream, classes):
for X_batch, y_batch in data_stream:
# Partial fit updates model incrementally
clf.partial_fit(X_batch, y_batch, classes=classes)
# Model is ready immediately after each batch
return clf
# Example with chunks
classes = np.array([0, 1])
for chunk_idx in range(10):
X_chunk, y_chunk = get_next_chunk(chunk_idx)
clf.partial_fit(X_chunk, y_chunk, classes=classes)
# Passive-Aggressive Classifier (good for text)
pa_clf = PassiveAggressiveClassifier(C=1.0, max_iter=1000)
pa_clf.partial_fit(X_first, y_first, classes=classes)
# River library for advanced streaming ML
from river import linear_model, optim, preprocessing
model = (
preprocessing.StandardScaler() |
linear_model.LogisticRegression(optimizer=optim.SGD(0.01))
)
for x_dict, y in data_stream:
# Learn from one sample
y_pred = model.predict_one(x_dict)
model.learn_one(x_dict, y)
Hoeffding Trees (VFDT)¶
Key Idea: Use Hoeffding bound to decide when to split with statistical confidence:
If best split - second best split > \(\epsilon\), make the split.
Advantages: - Single pass through data - Constant memory - Theoretical guarantees
from river import forest, tree
# Hoeffding Adaptive Tree
hat = tree.HoeffdingAdaptiveTreeClassifier()
for x_dict, y in data_stream:
y_pred = hat.predict_one(x_dict)
hat.learn_one(x_dict, y)
# Adaptive Random Forest
arf = forest.ARFClassifier(n_models=10)
for x_dict, y in data_stream:
arf.learn_one(x_dict, y)
Interview Questions¶
Q1: When would you choose online learning over batch learning?
A: Use online learning when: - Data volume exceeds memory (can't store all) - Real-time adaptation needed (fraud detection, recommendations) - Concept drift expected (user preferences, market conditions) - Latency requirements demand immediate updates - Data arrives continuously (IoT sensors, clickstreams)
Q2: Explain the regret framework in online learning.
A: Regret = difference between cumulative loss of online algorithm vs. best fixed model in hindsight: $\(\text{Regret}(T) = \sum_{t=1}^{T} \ell_t(w_t) - \min_w \sum_{t=1}^{T} \ell_t(w)\)$
Goal: sublinear regret (\(O(\sqrt{T})\)) → average regret → 0. Guarantees convergence despite sequential, myopic updates.
Q3: How does FTRL-Proximal achieve sparsity?
A: FTRL-Proximal adds L1 regularization to the FTRL objective: $\(w_{t+1} = \arg\min_w [\sum L(w) + \lambda_1 \|w\|_1 + \lambda_2 \|w\|_2^2]\)$
L1 penalty drives many weights to exactly zero. Combined with per-coordinate learning rates (features with consistent gradients get smaller rates), this produces sparse models suitable for high-dimensional problems (millions of features).
Q4: What is concept drift and how do you handle it?
A: Concept drift = change in data distribution over time. Types: sudden, gradual, incremental, recurring. Handling strategies: - Detection: ADWIN, DDM, EDDM, Page-Hinkley tests - Adaptation: sliding windows, exponential decay, ensemble reset - Models: Hoeffding Adaptive Trees, Adaptive Random Forest
Q5: Compare SGD partial_fit vs standard fit.
A:
- fit(): Trains on entire dataset, overwrites previous state
- partial_fit(): Incremental update, preserves learned weights
# Batch (full retrain)
clf.fit(X_all, y_all)
# Online (incremental)
for X_chunk, y_chunk in stream:
clf.partial_fit(X_chunk, y_chunk, classes=[0,1])
5. Multi-Label Classification¶
Multi-Label vs Single-Label¶
| Aspect | Single-Label | Multi-Label |
|---|---|---|
| Output | One category | Multiple categories |
| Example | Email → Spam OR Not Spam | Article → Politics AND Economics AND Tech |
| Output Space | \(K\) classes | \(2^K\) possible combinations |
| Complexity | Linear in \(K\) | Exponential in \(K\) |
Problem Transformation Methods¶
1. Binary Relevance (BR)¶
Idea: Train \(L\) independent binary classifiers (one per label).
Pros: Simple, parallelizable, no label dependency assumptions Cons: Ignores label correlations
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
br = MultiOutputClassifier(LogisticRegression(max_iter=1000))
br.fit(X_train, y_train) # y_train shape: (n_samples, n_labels)
y_pred = br.predict(X_test)
2. Classifier Chains (CC)¶
Idea: Chain classifiers, each uses previous predictions as features.
Pros: Captures label dependencies Cons: Order-dependent, error propagation
from sklearn.multioutput import ClassifierChain
from sklearn.ensemble import RandomForestClassifier
cc = ClassifierChain(
RandomForestClassifier(n_estimators=100),
order='random', # or specify order
random_state=42
)
cc.fit(X_train, y_train)
y_pred = cc.predict(X_test)
3. Label Powerset (LP)¶
Idea: Treat each unique label combination as a single class.
Pros: Captures all label correlations Cons: Exponential classes, severe class imbalance
from skmultilearn.problem_transform import LabelPowerset
lp = LabelPowerset(RandomForestClassifier())
lp.fit(X_train, y_train)
Comparison Table¶
| Method | Label Dependencies | Complexity | Parallelizable | Best For |
|---|---|---|---|---|
| Binary Relevance | None | \(O(L)\) | Yes | Independent labels |
| Classifier Chains | Sequential | \(O(L)\) | No | Sequential dependencies |
| Label Powerset | All | \(O(2^L)\) | Yes | Few label combinations |
Multi-Label Evaluation Metrics¶
Hamming Loss¶
Fraction of incorrect label predictions: $\(\text{Hamming Loss} = \frac{1}{N \cdot L} \sum_{i=1}^{N} \sum_{\ell=1}^{L} \mathbb{I}[y_{i\ell} \neq \hat{y}_{i\ell}]\)$
Lower is better (0 = perfect).
Jaccard Score (Intersection over Union)¶
Average over samples: \(\frac{1}{N} \sum_i \text{Jaccard}_i\)
Higher is better (1 = perfect).
Subset Accuracy (Exact Match)¶
Very strict — requires all labels correct.
from sklearn.metrics import hamming_loss, jaccard_score, f1_score
# Evaluation
hl = hamming_loss(y_true, y_pred) # Lower is better
js = jaccard_score(y_true, y_pred, average='samples') # Higher is better
f1_macro = f1_score(y_true, y_pred, average='macro')
f1_micro = f1_score(y_true, y_pred, average='micro')
Algorithm Adaptation Methods¶
MLkNN (Multi-Label k-Nearest Neighbors)¶
Uses label frequencies among k nearest neighbors: $\(\hat{y}_\ell = \mathbb{I}\left[\frac{\sum_{j \in N_k(x)} \mathbb{I}[y_{j\ell}=1]}{k} > \text{threshold}\right]\)$
Python Implementation¶
import numpy as np
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier, ClassifierChain
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import hamming_loss, jaccard_score, f1_score
# Generate multi-label data
X, y = make_multilabel_classification(
n_samples=1000,
n_features=20,
n_classes=5,
n_labels=2, # avg labels per instance
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Binary Relevance
br = MultiOutputClassifier(LogisticRegression(max_iter=1000))
br.fit(X_train, y_train)
y_pred_br = br.predict(X_test)
# Classifier Chain
cc = ClassifierChain(RandomForestClassifier(n_estimators=100))
cc.fit(X_train, y_train)
y_pred_cc = cc.predict(X_test)
# Evaluation function
def evaluate_multilabel(y_true, y_pred, name):
return {
'Method': name,
'Hamming Loss': hamming_loss(y_true, y_pred),
'Jaccard': jaccard_score(y_true, y_pred, average='samples'),
'F1 Micro': f1_score(y_true, y_pred, average='micro'),
'F1 Macro': f1_score(y_true, y_pred, average='macro')
}
print(evaluate_multilabel(y_test, y_pred_br, 'Binary Relevance'))
print(evaluate_multilabel(y_test, y_pred_cc, 'Classifier Chain'))
Interview Questions¶
Q1: When would you choose Binary Relevance vs Classifier Chains?
A: - Binary Relevance: Labels are independent, need parallelization, simple baseline - Classifier Chains: Labels are correlated (e.g., "Python" → "Programming"), sequential dependencies matter
Q2: Why is Label Powerset rarely used despite capturing all correlations?
A: Two main issues: 1. Exponential classes: With \(L\) labels, up to \(2^L\) combinations. Many are rare or unseen. 2. Class imbalance: Most combinations have few samples, some have none.
Better for small \(L\) where most combinations appear in training.
Q3: Explain Hamming Loss vs Subset Accuracy.
A: - Hamming Loss: Fraction of wrong labels (per-label). Allows partial credit. Range: [0, 1], lower is better. - Subset Accuracy: Exact match required (per-instance). No partial credit. Very strict.
Example: True = [1,1,0,0], Pred = [1,0,0,0] - Hamming Loss = ¼ = 0.25 (one wrong label) - Subset Accuracy = 0 (not exact match)
Q4: Design a system for multi-label document classification with 1000 labels.
A: 1. Feature extraction: TF-IDF or embeddings (BERT, sentence-transformers) 2. Model choice: - If label correlations weak: Binary Relevance with Linear SVM (fast, scalable) - If correlations matter: Classifier Chains (ensemble of chains for robustness) - For deep learning: Binary cross-entropy with sigmoid output 3. Handling imbalance: Per-label class weights, focal loss 4. Evaluation: Hamming Loss (primary), Jaccard Score, per-label F1 5. Production: Model compression, label pruning (remove rare labels)
Q5: How do you handle extreme label imbalance in multi-label settings?
A: 1. Per-label weighting: Higher weight for rare labels in loss 2. Focal Loss: Focus on hard examples 3. SMOTE variants: Oversample minority label combinations 4. Negative sampling: For very many labels, sample negative labels 5. Label hierarchy: Group similar labels, predict at group level first
6. Semi-Supervised Learning¶
Core Assumptions¶
SSL relies on key assumptions about data structure:
| Assumption | Description | When It Breaks |
|---|---|---|
| Smoothness | Similar inputs → similar outputs | Noisy data, outliers |
| Cluster | Data forms groups, boundaries avoid clusters | High-dimensional, sparse data |
| Manifold | Data lies on low-dimensional manifold | Manifold crossing, holes |
| Low-density | Decision boundaries pass through low-density regions | Class overlap |
SSL Methods Taxonomy¶
1. Wrapper Methods (Self-Training)¶
Idea: Train on labeled data → predict unlabeled → add high-confidence predictions → repeat.
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.ensemble import RandomForestClassifier
# -1 = unlabeled
y_semi = y_labeled.copy()
y_semi[len(y_labeled):] = -1
base_clf = RandomForestClassifier(n_estimators=100)
self_training = SelfTrainingClassifier(base_clf, threshold=0.9)
self_training.fit(X_combined, y_semi)
Risk: Confirmation bias — errors reinforce themselves.
2. Consistency Regularization¶
Key Idea: Model should give similar predictions for perturbed inputs.
Loss: $\(\mathcal{L} = \mathcal{L}_{sup} + \lambda \cdot \mathbb{E}_{x, x'}[\|f(x) - f(x')\|^2]\)$
where \(x'\) is augmented version of \(x\).
FixMatch Algorithm: 1. Generate weak and strong augmentations 2. If weak prediction confident (>\(\tau\)), use as pseudo-label 3. Train on strong augmentation with pseudo-label
# FixMatch pseudo-code
def fixmatch_loss(model, x_l, y_l, x_u, tau=0.95):
# Supervised loss
loss_sup = CE(model(x_l), y_l)
# Unlabeled: weak and strong augmentation
x_u_weak = weak_augment(x_u)
x_u_strong = strong_augment(x_u)
pseudo_labels = model(x_u_weak)
mask = (pseudo_labels.max(dim=1) > tau).float()
loss_unsup = (CE(model(x_u_strong), pseudo_labels) * mask).mean()
return loss_sup + lambda_u * loss_unsup
3. Mean Teacher¶
Idea: Teacher (EMA of student) generates targets for student.
# Mean Teacher update
def update_teacher(teacher, student, alpha=0.99):
for t_param, s_param in zip(teacher.parameters(), student.parameters()):
t_param.data = alpha * t_param.data + (1 - alpha) * s_param.data
4. MixMatch¶
Combines multiple techniques: 1. Augmentation: Multiple augmentations per sample 2. Pseudo-labeling: Average predictions → sharpen 3. MixUp: Interpolate labeled and unlabeled
Comparison Table¶
| Method | Key Idea | Pros | Cons |
|---|---|---|---|
| Self-Training | Pseudo-label high-confidence | Simple, any model | Confirmation bias |
| FixMatch | Strong/weak aug + threshold | SOTA on CIFAR | Needs good augmentations |
| Mean Teacher | EMA teacher targets | Stable, no threshold | Architecture-specific |
| MixMatch | Mix multiple techniques | Combines strengths | Many hyperparameters |
| S3VM | Adjust margin for unlabeled | Theoretical | Doesn't scale |
Production Considerations¶
Confirmation Bias Mitigation: - Ensemble disagreement - Uncertainty-aware filtering - Teacher-student with EMA - Human review of low-confidence predictions
Distribution Mismatch: - Monitor drift between labeled and unlabeled distributions - Use statistical tests (KL divergence, PSI) - Filter out-of-distribution unlabeled data
Loss Balancing: $\(\mathcal{L}_{total} = \mathcal{L}_{sup} + \lambda(t) \cdot \mathcal{L}_{unsup}\)$
Ramp up \(\lambda(t)\) gradually: \(\lambda(t) = \lambda_{max} \cdot e^{-5(1-t/T)^2}\)
Python Implementation¶
import torch
import torch.nn.functional as F
from sklearn.semi_supervised import LabelSpreading
# Sklearn: Label Propagation
lp = LabelSpreading(kernel='knn', n_neighbors=7)
lp.fit(X_combined, y_semi) # -1 for unlabeled
y_pred = lp.predict(X_test)
# PyTorch: FixMatch-style training
def train_ssl_epoch(model, optimizer, labeled_loader, unlabeled_loader, tau=0.95):
model.train()
for (x_l, y_l), x_u in zip(labeled_loader, unlabeled_loader):
# Supervised loss
logits_l = model(x_l)
loss_sup = F.cross_entropy(logits_l, y_l)
# Pseudo-labeling with threshold
with torch.no_grad():
logits_u = model(x_u)
probs_u = F.softmax(logits_u, dim=1)
max_probs, pseudo_labels = probs_u.max(dim=1)
mask = (max_probs >= tau).float()
# Unsupervised loss (only confident predictions)
logits_u_strong = model(strong_augment(x_u))
loss_unsup = (F.cross_entropy(logits_u_strong, pseudo_labels, reduction='none') * mask).mean()
loss = loss_sup + 1.0 * loss_unsup
optimizer.zero_grad()
loss.backward()
optimizer.step()
Interview Questions¶
Q1: When does SSL help vs. hurt?
A: - Helps: Small labeled set (5-10%), unlabeled matches distribution, smooth/clustered data - Hurts: Distribution mismatch, noisy unlabeled data, class overlap, confirmation bias
Always validate on held-out labeled data.
Q2: Explain the confirmation bias problem in self-training.
A: Early errors in pseudo-labels get reinforced across epochs. A wrong high-confidence prediction becomes training data for future iterations, making the model more confident in the wrong direction.
Solutions: ensemble disagreement, uncertainty filtering, Mean Teacher (stable EMA targets), human review.
Q3: Compare FixMatch vs Mean Teacher.
A: - FixMatch: Uses confidence threshold + strong augmentations. If weak-aug prediction >\(\tau\), train on strong-aug version. Threshold-based filtering. - Mean Teacher: Teacher (EMA of student) provides soft targets. No threshold needed. Consistency loss between student and teacher predictions.
Both are SOTA, FixMatch often simpler to implement.
Q4: How much labeled data do you need for SSL to work?
A: Rule of thumb: 5-10% of full dataset often sufficient. But depends on: - Task complexity (simple → less labels needed) - Data quality (clean → less labels needed) - Model capacity (larger models need more data)
Start small, validate often, expand labeled set if gains plateau.
Q5: How do you detect distribution drift between labeled and unlabeled data?
A: 1. Feature statistics: Compare mean/variance of features 2. KL divergence: Compare predicted class distributions 3. PSI (Population Stability Index): Feature-level drift 4. kNN distance: Average distance to labeled neighbors 5. Domain classifier: Train classifier to distinguish labeled vs unlabeled
Flag samples with high drift scores for human review.
7. Active Learning¶
Active Learning Loop¶
Active learning is a special case of supervised ML where the model chooses which data to label.
Loop: 1. Initial Training: Train on small labeled set 2. Inference: Predict on unlabeled pool 3. Query: Select most informative samples (acquisition function) 4. Label: Human annotates selected samples 5. Augment & Retrain: Add to training set, repeat
Query Strategies¶
1. Uncertainty Sampling¶
Select samples where model is least confident.
Least Confidence: $\(x^* = \arg\max_x \left(1 - P_\theta(\hat{y}|x)\right)\)$
where \(\hat{y} = \arg\max_y P_\theta(y|x)\)
Margin Sampling: $\(x^* = \arg\min_x \left(P_\theta(\hat{y}_1|x) - P_\theta(\hat{y}_2|x)\right)\)$
where \(\hat{y}_1, \hat{y}_2\) are top-2 predictions.
Entropy: $\(x^* = \arg\max_x H(P_\theta(y|x)) = -\sum_y P_\theta(y|x) \log P_\theta(y|x)\)$
2. Query-by-Committee (QBC)¶
Train ensemble of models, select samples with highest disagreement.
Vote Entropy: $\(x^* = \arg\max_x \left(-\sum_y \frac{V(y)}{C} \log \frac{V(y)}{C}\right)\)$
where \(V(y)\) = votes for class \(y\), \(C\) = committee size.
KL Divergence: $\(x^* = \arg\max_x \frac{1}{C}\sum_{c=1}^{C} D_{KL}(P_{\theta_c} \| P_{avg})\)$
3. Expected Model Change¶
Select samples that would cause largest gradient change: $\(x^* = \arg\max_x \|\nabla_\theta \mathcal{L}(\theta; x, \hat{y})\|\)$
4. Diversity Sampling¶
Avoid selecting similar samples. Use clustering or embeddings.
from sklearn.cluster import KMeans
# Diversity sampling via clustering
def diversity_sampling(X_unlabeled, n_samples, embeddings):
kmeans = KMeans(n_clusters=n_samples)
kmeans.fit(embeddings)
# Select sample closest to each centroid
selected = []
for i in range(n_samples):
cluster_samples = np.where(kmeans.labels_ == i)[0]
centroid = kmeans.cluster_centers_[i]
distances = np.linalg.norm(embeddings[cluster_samples] - centroid, axis=1)
selected.append(cluster_samples[np.argmin(distances)])
return selected
Comparison Table¶
| Strategy | Formula | Best For | Pitfalls |
|---|---|---|---|
| Least Confidence | \(\max(1-P(\hat{y}))\) | Binary/multiclass | Ignores runner-up |
| Margin | \(\min(P(\hat{y}_1)-P(\hat{y}_2))\) | Multiclass | Needs calibrated probs |
| Entropy | \(\max H(P(y\mid x))\) | Many classes | Can select outliers |
| QBC | Max vote entropy | Model ensembles | Computation heavy |
| Diversity | Clustering | Large pools | May miss rare cases |
Python Implementation¶
import numpy as np
from sklearn.base import clone
class ActiveLearner:
def __init__(self, estimator, strategy='entropy'):
self.estimator = estimator
self.strategy = strategy
def query(self, X_pool, n_samples=10):
"""Select most informative samples from pool."""
probs = self.estimator.predict_proba(X_pool)
if self.strategy == 'least_confident':
scores = 1 - np.max(probs, axis=1)
elif self.strategy == 'margin':
sorted_probs = np.sort(probs, axis=1)[:, ::-1]
scores = sorted_probs[:, 0] - sorted_probs[:, 1]
scores = -scores # Lower margin = higher score
elif self.strategy == 'entropy':
scores = -np.sum(probs * np.log(probs + 1e-10), axis=1)
return np.argsort(scores)[-n_samples:]
def teach(self, X, y):
"""Retrain with new labeled samples."""
self.estimator.fit(X, y)
# Active learning loop
def active_learning_loop(estimator, X_labeled, y_labeled, X_pool,
oracle, n_iterations=10, n_per_iter=10):
learner = ActiveLearner(estimator, strategy='entropy')
learner.teach(X_labeled, y_labeled)
for i in range(n_iterations):
# Query most informative samples
query_idx = learner.query(X_pool, n_samples=n_per_iter)
# Get labels from oracle (human annotator)
X_query = X_pool[query_idx]
y_query = oracle(X_query) # Human labels these
# Add to labeled set
X_labeled = np.vstack([X_labeled, X_query])
y_labeled = np.concatenate([y_labeled, y_query])
# Remove from pool
mask = np.ones(len(X_pool), dtype=bool)
mask[query_idx] = False
X_pool = X_pool[mask]
# Retrain
learner.teach(X_labeled, y_labeled)
print(f"Iteration {i+1}: {len(X_labeled)} labeled samples")
return learner
# Using modAL library (specialized for active learning)
from modAL.models import ActiveLearner as ModALLearner
from modAL.uncertainty import entropy_sampling
learner = ModALLearner(
estimator=RandomForestClassifier(),
query_strategy=entropy_sampling
)
learner.fit(X_initial, y_initial)
query_idx, query_inst = learner.query(X_pool, n_instances=10)
learner.teach(X_pool[query_idx], y_new_labels)
When Active Learning Fails¶
Red Flags: - Biased queries: Model selects from narrow region, ignores others - Annotation drift: Annotator fatigue → lower quality labels over time - Cold start problem: Initial model too weak → bad query selection - Outlier focus: Uncertainty sampling picks noise/outliers
Mitigation: - Mix uncertainty with diversity sampling - Regular validation on held-out test set - Start with diversity-based selection for cold start - Filter outliers before querying
Interview Questions¶
Q1: When would you use active learning vs. labeling everything?
A: Use AL when: - Labeling is expensive (medical imaging, legal docs) - Task has many "easy" samples model already handles - Budget/time is limited - Model already has reasonable baseline
Don't use AL when: - Task is novel (no good initial model) - Every sample is valuable - You need to guarantee coverage - Annotation is cheap
Q2: Compare uncertainty sampling vs query-by-committee.
A: - Uncertainty Sampling: Single model, use confidence/probability. Fast, simple. Needs calibrated probabilities. - QBC: Multiple models, measure disagreement. Better captures model uncertainty, not just confidence. More compute, needs ensemble.
QBC often better for deep learning where single-model confidence is unreliable.
Q3: How do you prevent active learning from selecting only outliers?
A: 1. Diversity constraint: Mix uncertainty with clustering-based diversity 2. Typicality filter: Exclude samples far from training distribution 3. Human review: Annotator can reject "obvious error" queries 4. Hybrid strategies: 50% uncertainty + 50% random/diversity
Q4: Design an annotation system for multi-class image classification.
A: 1. Cold start: Use pre-trained features + clustering to select diverse initial batch 2. Query strategy: Mix entropy (70%) + diversity (30%) 3. Batch mode: Query 100-500 images per iteration (not 1) 4. Annotation UI: Show model predictions, allow quick correction 5. Stopping criterion: Stop when validation accuracy plateaus for 3 iterations 6. Quality control: Random sample review, annotator agreement tracking
Q5: What's the difference between active learning and semi-supervised learning?
A: | Aspect | Active Learning | Semi-Supervised | |--------|----------------|-----------------| | Labels from | Human (oracle) | Model (pseudo-labels) | | Unlabeled data | Pool to query | Training data | | Cost | Annotation cost | Compute cost | | Quality | Guaranteed correct | May propagate errors | | When to use | Labels expensive | Some labels available |
Can combine: use AL to select samples for human labeling + SSL to leverage remaining unlabeled.
8. Практические навыки¶
Feature Engineering Checklist¶
- Missing values: impute vs drop
- Categorical: one-hot, target, embedding
- Numerical: scale, log transform, bin
- Outliers: cap, remove, transform
- Interactions: polynomial, domain-specific
- Time features: cyclical encoding
Cross-Validation Strategy¶
| Data Type | CV Strategy |
|---|---|
| IID | K-Fold (5-10) |
| Imbalanced | Stratified K-Fold |
| Time Series | TimeSeriesSplit |
| Groups | GroupKFold |
Hyperparameter Tuning¶
| Method | Когда использовать |
|---|---|
| Grid Search | Few params, small space |
| Random Search | Large space |
| Bayesian (Optuna) | Expensive training |
| Successive Halving | Many configs, limited time |
Связи между темами¶
Linear Regression → Logistic Regression → Neural Networks
↓ ↓ ↓
OLS closed-form Sigmoid + BCE Backprop
↓ ↓ ↓
Regularization Classification Optimizers
↓ Metrics ↓
L1/L2 → Feature Selection ← ← ← ← ← Gradient Descent
↓
Decision Trees → Random Forest → Gradient Boosting
↓
Gini/Entropy ← ← ← ← Information Theory
Рекомендуемый порядок изучения¶
- Week 1: KNN, Linear/Logistic Regression (from scratch)
- Week 2: Decision Trees, Random Forest
- Week 3: Gradient Boosting (theory + XGBoost)
- Week 4: SVM (kernel trick), Naive Bayes
- Week 5: Feature Engineering + Selection
- Week 6: Practical: Kaggle competition