Перейти к содержанию

Classical ML: учебные материалы

~18 минут чтения

Предварительно: Выбор модели | sklearn

Материалы для 16 задач из категории Classical ML. Формулы, код, ключевые источники для каждого алгоритма. Обновлено: 2026-02-11


1. ML Algorithms from Scratch (6 задач)

KNN (impl_003_knn)

Лучшие источники: - Stanford CS229: KNN — Lecture notes - StatQuest: KNN — YouTube визуализация - Hands-On ML Chapter 3 — Практический

Ключевые концепции:

Distance Formula Когда использовать
Euclidean \(\sqrt{\sum(x_i-y_i)^2}\) Default, continuous
Manhattan $\sum\mid x_i-y_i $
Cosine \(1 - \frac{x \cdot y}{\|x\|\|y\|}\) Text, sparse

Curse of Dimensionality: $\(\text{Volume of unit ball} = \frac{\pi^{d/2}}{\Gamma(d/2+1)} \to 0 \text{ as } d \to \infty\)$

При high dimensions все точки "далеко" друг от друга.

Logistic Regression (impl_001_logistic_regression)

Лучшие источники: - StatQuest: Logistic Regression - CS229 Notes: Logistic Regression - Scikit-learn User Guide

Ключевые формулы: $\(\sigma(z) = \frac{1}{1+e^{-z}}\)$

\[L = -[y\log(\hat{y}) + (1-y)\log(1-\hat{y})]\]
\[\frac{\partial L}{\partial w} = X^T(\hat{y} - y)\]

Decision Boundary: \(w^Tx + b = 0\)

K-Means (impl_002_kmeans)

Лучшие источники: - StatQuest: K-Means - CS229: K-Means

Алгоритм: 1. Init centroids (random или k-means++) 2. Assign points to nearest centroid 3. Update centroids = mean of assigned points 4. Repeat until convergence

K-means++ initialization:

def kmeans_plus_plus_init(X, k):
    centroids = [X[np.random.randint(len(X))]]
    for _ in range(k-1):
        distances = np.min([np.linalg.norm(X - c, axis=1) for c in centroids], axis=0)
        probs = distances ** 2 / np.sum(distances ** 2)
        centroids.append(X[np.random.choice(len(X), p=probs)])
    return np.array(centroids)

Elbow Method: Plot \(J(C)\) vs \(k\), find "elbow"

Naive Bayes (impl_005_naive_bayes)

Лучшие источники: - StatQuest: Naive Bayes - Scikit-learn: Naive Bayes

Gaussian NB: $\(P(x_i|y=c) = \frac{1}{\sqrt{2\pi\sigma_c^2}} \exp\left(-\frac{(x_i-\mu_c)^2}{2\sigma_c^2}\right)\)$

Prediction: $\(\hat{y} = \arg\max_c P(c) \prod_i P(x_i|c)\)$

Log-space (prevent underflow): $\(\log P(y|x) = \log P(y) + \sum_i \log P(x_i|y)\)$

Decision Tree (impl_004_decision_tree)

Лучшие источники: - StatQuest: Decision Trees - ESL Chapter 9 — Trees, CART

Split Criteria:

Criterion Formula Range
Gini \(1 - \sum p_k^2\) [0, 1-1/K] (binary: [0, 0.5])
Entropy \(-\sum p_k \log_2 p_k\) [0, 1]
MSE (regression) \(\frac{1}{n}\sum(y_i - \bar{y})^2\) [0, ∞)

Recursive splitting:

def build_tree(X, y, depth=0):
    if stopping_condition:
        return Leaf(majority_class(y))

    feature, threshold = find_best_split(X, y)
    left = build_tree(X[X[:, feature] <= threshold], y[...], depth+1)
    right = build_tree(X[X[:, feature] > threshold], y[...], depth+1)
    return Node(feature, threshold, left, right)

Neural Network from Scratch (impl_006_neural_network)

Лучшие источники: - Karpathy: micrograd — MUST DO - Karpathy: nn-zero-to-hero - 3Blue1Brown: Neural Networks

Architecture:

Input → Linear → ReLU → Linear → Softmax → Output

Forward pass: $\(z = Wx + b, \quad a = \text{ReLU}(z)\)$

Backward pass (chain rule): $\(\frac{\partial L}{\partial W} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial W} = \frac{\partial L}{\partial z} \cdot x^T\)$


2. ML Classic Practice (10 задач)

Linear Regression (classic_005_linear_regression)

Лучшие источники: - ISLR Chapter 3 — Linear Regression - StatQuest: Linear Regression

OLS Closed-form: $\(\hat{w} = (X^TX)^{-1}X^Ty\)$

R-squared: $\(R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}\)$

Adjusted R-squared: $\(R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-p-1}\)$

Classification Metrics (classic_002_classification_metrics)

Лучшие источники: - Google ML Crash Course - Scikit-learn: Metrics

Confusion Matrix:

Predicted + Predicted -
Actual + TP FN
Actual - FP TN

Metrics: $\(\text{Precision} = \frac{TP}{TP+FP}\)$

\[\text{Recall} = \frac{TP}{TP+FN}\]
\[F1 = \frac{2 \cdot P \cdot R}{P + R}\]

Class Imbalance (classic_003_class_imbalance)

Лучшие источники: - Imbalanced-learn documentation - SMOTE Paper

Strategies:

Method Description Pros/Cons
Undersampling Remove majority Fast, loses data
Oversampling Duplicate minority Simple, overfitting
SMOTE Synthetic samples Better, complex
Class weights Loss modification Easy, no data change
Focal Loss Hard example mining Deep learning

SMOTE algorithm:

def smote(X_minority, k=5, n_samples=100):
    synthetic = []
    for x in X_minority:
        neighbors = knn(x, X_minority, k)
        for _ in range(n_samples // len(X_minority)):
            neighbor = random.choice(neighbors)
            alpha = random.random()
            synthetic.append(x + alpha * (neighbor - x))
    return synthetic

Ranking Metrics (classic_001_ranking_metrics)

Лучшие источники: - Ranking Metrics Explained - NDCG Wikipedia

Metrics:

Metric Formula Use Case
MRR \(1/\text{rank of first relevant}\) Search
MAP Mean of AP@k Retrieval
NDCG \(DCG/IDCG\) Ranked lists

NDCG: $\(DCG@k = \sum_{i=1}^{k} \frac{2^{rel_i}-1}{\log_2(i+1)}\)$

\[NDCG@k = \frac{DCG@k}{IDCG@k}\]

Feature Selection (feat_002_feature_selection)

Лучшие источники: - Scikit-learn: Feature Selection

Methods:

Type Methods Когда использовать
Filter Correlation, MI, chi2 Fast baseline
Wrapper RFE, forward/backward Best quality
Embedded L1, tree importance Built-in

RFE (Recursive Feature Elimination):

from sklearn.feature_selection import RFE
rfe = RFE(estimator=LogisticRegression(), n_features_to_select=10)
X_selected = rfe.fit_transform(X, y)

Target Encoding (feat_001_target_encoding)

Лучшие источники: - Target Encoding Explained - Category Encoders

Formula: $\(\text{TE}(cat) = \frac{n \cdot \mu_{cat} + m \cdot \mu_{global}}{n + m}\)$

где \(n\) = count в категории, \(m\) = smoothing parameter.

Leave-One-Out:

def target_encoding_loo(df, col, target):
    global_mean = df[target].mean()
    return df.groupby(col)[target].transform(
        lambda x: (x.sum() - x) / (len(x) - 1)
    )

Critical: Use CV to avoid leakage!

SVM (classic_007_svm)

Лучшие источники: - StatQuest: SVM - ESL Chapter 12 — SVMs

Hard Margin: $\(\min_{w,b} \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^Tx_i + b) \geq 1\)$

Soft Margin (slack variables): $\(\min_{w,b,\xi} \frac{1}{2}\|w\|^2 + C\sum\xi_i\)$

Kernel Trick: $\(K(x, x') = \phi(x)^T\phi(x')\)$

Kernel Formula
Linear \(x^Tx'\)
Polynomial \((\gamma x^Tx' + r)^d\)
RBF \(\exp(-\gamma\|x-x'\|^2)\)

GBDT Internals (classic_008_gbdt_internals)

Лучшие источники: - Gradient Boosting from Scratch - XGBoost Documentation

Key Idea: Fit trees to residuals (negative gradient).

Algorithm:

def gbdt_fit(X, y, n_trees, lr):
    F = np.mean(y)  # Initial prediction
    for _ in range(n_trees):
        residuals = y - F  # Negative gradient for MSE
        tree = DecisionTreeRegressor(max_depth=3)
        tree.fit(X, residuals)
        F += lr * tree.predict(X)
    return F

XGBoost vs LightGBM vs CatBoost:

Feature XGBoost LightGBM CatBoost
Trees Symmetric Leaf-wise Symmetric
Categorical One-hot needed Native Native (best)
Speed Good Fastest Good
Tuning Medium Medium Easy

3. Ensemble Theory Deep Dive

Bias-Variance Decomposition

Total Error: $\(\text{MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}\)$

Для ensemble из \(B\) моделей: $\(\text{Var}(\bar{f}) = \rho \sigma^2 + \frac{1-\rho}{B}\sigma^2\)$

где \(\rho\) = корреляция между предсказаниями, \(\sigma^2\) = variance отдельной модели.

Key Insight: Увеличение \(B\) уменьшает только второй член. Diversity (\(1-\rho\)) критичен!

Bagging (Bootstrap Aggregating)

Mathematical Formulation: $\(\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^{B}\hat{f}_b(x)\)$

Variance Reduction: Для \(B\) uncorrelated predictors: $\(\text{Var}(\hat{f}_{bag}) = \frac{\text{Var}(\hat{f})}{B}\)$

Для correlated (\(\rho > 0\)): $\(\text{Var}(\hat{f}_{bag}) = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2\)$

Random Forest Innovation: Random feature selection at each split decorrelates trees: $\(\text{features\_per\_split} \approx \sqrt{p} \text{ (classification)}, \frac{p}{3} \text{ (regression)}\)$

Boosting

Key Idea: Sequential learning from mistakes (fit residuals).

Gradient Boosting as Gradient Descent in Function Space: $\(F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)\)$

где \(h_m\) обучается на negative gradient: $\(r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F=F_{m-1}}\)$

AdaBoost Sample Reweighting: $\(w_i^{(m+1)} = w_i^{(m)} \cdot \exp(\alpha_m \cdot \mathbb{I}[y_i \neq \hat{y}_i^{(m)}])\)$

где \(\alpha_m = \frac{1}{2}\ln\frac{1-\epsilon_m}{\epsilon_m}\)

XGBoost Regularization: $\(\mathcal{L} = \sum_i L(y_i, \hat{y}_i) + \sum_k \Omega(f_k)\)$

\[\Omega(f) = \gamma T + \frac{1}{2}\lambda\|w\|^2\]

где \(T\) = число leaves, \(w\) = leaf weights.

Bagging vs Boosting Comparison

Aspect Bagging (RF) Boosting (GBDT)
Training Parallel Sequential
Bias Same as single tree Reduces with iterations
Variance Reduces significantly Can increase
Overfitting Resistant Prone (need early stopping)
Outliers Robust Sensitive
Trees Deep (fully grown) Shallow (weak learners)
Interpretability Feature importance Feature importance + SHAP
Best for High variance High bias

Why More Trees Can Hurt Boosting

Random Forest: $\(\lim_{B \to \infty} \text{Var}(\text{RF}) = \rho\sigma^2\)$

Converges to stable error. More trees = better (diminishing returns).

Gradient Boosting: $\(F_M(x) = \sum_{m=1}^{M} \eta \cdot h_m(x)\)$

Each iteration fits residuals. Eventually fits noise → overfitting.

Solution: Early stopping, regularization, small learning rate (\(\eta \in [0.01, 0.3]\)).

Stacking (Stacked Generalization)

Architecture:

Level 0: [Model1, Model2, ..., ModelK]
    ↓ predictions on holdout
Level 1: Meta-model (usually Linear/Logistic)
Final prediction

Mathematical Formulation: $\(\hat{y} = g(\hat{f}_1(x), \hat{f}_2(x), ..., \hat{f}_K(x))\)$

Critical: Use out-of-fold predictions for meta-features to avoid leakage!

from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge

def stacking_fit(base_models, meta_model, X, y, n_folds=5):
    kf = KFold(n_splits=n_folds, shuffle=True)
    meta_features = np.zeros((len(X), len(base_models)))

    for i, model in enumerate(base_models):
        for train_idx, val_idx in kf.split(X):
            model.fit(X[train_idx], y[train_idx])
            meta_features[val_idx, i] = model.predict(X[val_idx])

    meta_model.fit(meta_features, y)
    return meta_model

Ensemble Diversity Metrics

Q-statistic (for classifiers): $\(Q_{ij} = \frac{N^{11}N^{00} - N^{01}N^{10}}{N^{11}N^{00} + N^{01}N^{10}}\)$

\(Q \in [-1, 1]\), where \(Q = 0\) means independent.

Double Fault Measure: $\(DF_{ij} = \frac{N^{00}}{N}\)$

Fraction of samples both classifiers get wrong. Lower = more diverse.

Python Implementation

# Random Forest with OOB evaluation
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=200,
    max_features='sqrt',
    oob_score=True,  # Out-of-bag evaluation
    n_jobs=-1
)
rf.fit(X_train, y_train)
print(f"OOB Score: {rf.oob_score_:.3f}")

# Gradient Boosting with early stopping
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=4,
    validation_fraction=0.1,
    n_iter_no_change=10,  # Early stopping
    tol=1e-4
)
gb.fit(X_train, y_train)
print(f"Actual iterations: {gb.n_estimators_}")

# XGBoost with full regularization
import xgboost as xgb

params = {
    'objective': 'binary:logistic',
    'learning_rate': 0.05,
    'max_depth': 6,
    'reg_alpha': 0.1,   # L1 regularization
    'reg_lambda': 1.0,  # L2 regularization
    'subsample': 0.8,
    'colsample_bytree': 0.8,
}
model = xgb.train(params, dtrain, num_boost_round=500,
                  evals=[(dval, 'val')], early_stopping_rounds=20)

# CatBoost with native categorical handling
from catboost import CatBoostClassifier

cat = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.05,
    depth=6,
    cat_features=categorical_cols,  # Native handling!
    verbose=False
)
cat.fit(X_train, y_train, eval_set=(X_val, y_val),
        early_stopping_rounds=50)

Interview Questions

Q1: Why does Random Forest not overfit with more trees but Gradient Boosting can?

A: RF trees are independent and parallel. As \(B \to \infty\), variance converges to \(\rho\sigma^2\) (law of large numbers). Boosting trees are sequential - each fits residuals from previous. Eventually fits noise, not signal. Solution: early stopping, regularization.

Q2: When would you prefer Bagging over Boosting?

A: - High variance problem (complex models, small data) → Bagging - High bias problem (simple models, large data) → Boosting - Noisy data with outliers → Bagging (robust) - Need interpretable feature importance → Both OK - Limited compute for training → Bagging (parallelizable)

Q3: Explain the learning rate in Gradient Boosting.

A: Learning rate \(\eta \in (0, 1]\) scales each tree's contribution: $\(F_m = F_{m-1} + \eta \cdot h_m\)$

Small \(\eta\) requires more trees but: - Better generalization (smoother updates) - More robust to overfitting - Often \(\eta \in [0.01, 0.3]\) works best

Trade-off: \(\eta \times n\_trees \approx \text{constant}\) for similar performance.

Q4: What is the role of max_features in Random Forest?

A: Controls decorrelation between trees: - sqrt(p) for classification (default) - p/3 for regression - Smaller → more diverse trees, higher bias, lower variance - Larger → more correlated trees, lower bias, higher variance

Q5: How does CatBoost handle categorical features differently?

A: - No one-hot encoding needed (memory efficient) - Uses ordered target statistics: each sample's encoding uses only previous samples - Prevents target leakage - Handles high-cardinality categories naturally - Often outperforms manually encoded features


4. Online Learning for Streaming Data

Online vs Batch Learning

Aspect Batch Learning Online Learning
Data Access Full dataset available Sequential, one at a time
Memory Usage Proportional to data size Constant (model size only)
Model Updates Periodic retraining Continuous incremental updates
Concept Drift Model becomes stale Adapts automatically
Latency Hours to days for retraining Milliseconds per update
Convergence Global optimization possible Sequential optimization with regret bounds

Online Gradient Descent (OGD)

Update Rule: $\(w_{t+1} = w_t - \eta_t \nabla L(w_t, x_t, y_t)\)$

где \(\eta_t\) = learning rate at time \(t\), \(L\) = loss function.

Learning Rate Schedules: - Fixed: \(\eta_t = \eta_0\) — works for stationary distributions - Decreasing: \(\eta_t = \eta_0 / \sqrt{t}\) — theoretical convergence guarantees - Adaptive: Increase when model performs poorly (track concept drift)

The Regret Framework

Online learning performance measured through regret — difference between cumulative loss and best fixed model in hindsight:

\[\text{Regret}(T) = \sum_{t=1}^{T} \ell_t(w_t) - \min_w \sum_{t=1}^{T} \ell_t(w)\]

Goal: Sublinear regret growth (\(O(\sqrt{T})\) or \(O(\log T)\)) → average per-step regret → 0.

Follow-The-Regularized-Leader (FTRL)

Update Rule: $\(w_{t+1} = \arg\min_w \left[ \sum_{i=1}^{t} L(w, x_i, y_i) + R(w) \right]\)$

где \(R(w)\) = regularization (L1 for sparsity, L2 for smoothness).

FTRL-Proximal (Google production): - Per-coordinate learning rates adapted to gradient history - L1 penalty drives weights to zero (sparse models) - Crucial for high-dimensional features (millions/billions)

Perceptron and Passive-Aggressive Algorithms

Perceptron Update (if misclassified): $\(w_{t+1} = w_t + y_t x_t\)$

Passive-Aggressive (PA): $\(w_{t+1} = w_t + \tau_t y_t x_t, \quad \tau_t = \frac{\ell_t}{\|x_t\|^2}\)$

Makes minimal update to achieve correct classification with margin.

Concept Drift Handling

Types of Concept Drift: | Type | Description | Example | |------|-------------|---------| | Sudden | Abrupt distribution change | System failure, policy change | | Gradual | Slow evolution over time | User preference shifts | | Incremental | Steady directional change | Market trends | | Recurring | Cyclical patterns | Seasonal effects |

Drift Detection Methods: - ADWIN (Adaptive Windowing): Statistical test for distribution change - DDM (Drift Detection Method): Monitor error rate changes - EDDM (Early DDM): Detect drift earlier using distance between errors - Page-Hinkley Test: Detect mean shifts in stream

Adaptation Strategies: 1. Sliding Window: Keep only recent \(W\) samples 2. Exponential Decay: Weight recent samples more: \(w_t = \alpha^t\) 3. Ensemble Reset: Replace worst performing model when drift detected

Python Implementation

from sklearn.linear_model import SGDClassifier, PassiveAggressiveClassifier
from sklearn.naive_bayes import MultinomialNB
import numpy as np

# Online SGD Classifier
clf = SGDClassifier(loss='log_loss', learning_rate='optimal', eta0=0.01)

# Simulating streaming data
def online_learning_stream(clf, data_stream, classes):
    for X_batch, y_batch in data_stream:
        # Partial fit updates model incrementally
        clf.partial_fit(X_batch, y_batch, classes=classes)
        # Model is ready immediately after each batch
    return clf

# Example with chunks
classes = np.array([0, 1])
for chunk_idx in range(10):
    X_chunk, y_chunk = get_next_chunk(chunk_idx)
    clf.partial_fit(X_chunk, y_chunk, classes=classes)

# Passive-Aggressive Classifier (good for text)
pa_clf = PassiveAggressiveClassifier(C=1.0, max_iter=1000)
pa_clf.partial_fit(X_first, y_first, classes=classes)

# River library for advanced streaming ML
from river import linear_model, optim, preprocessing

model = (
    preprocessing.StandardScaler() |
    linear_model.LogisticRegression(optimizer=optim.SGD(0.01))
)

for x_dict, y in data_stream:
    # Learn from one sample
    y_pred = model.predict_one(x_dict)
    model.learn_one(x_dict, y)

Hoeffding Trees (VFDT)

Key Idea: Use Hoeffding bound to decide when to split with statistical confidence:

\[\epsilon = \sqrt{\frac{R^2 \ln(1/\delta)}{2n}}\]

If best split - second best split > \(\epsilon\), make the split.

Advantages: - Single pass through data - Constant memory - Theoretical guarantees

from river import forest, tree

# Hoeffding Adaptive Tree
hat = tree.HoeffdingAdaptiveTreeClassifier()

for x_dict, y in data_stream:
    y_pred = hat.predict_one(x_dict)
    hat.learn_one(x_dict, y)

# Adaptive Random Forest
arf = forest.ARFClassifier(n_models=10)
for x_dict, y in data_stream:
    arf.learn_one(x_dict, y)

Interview Questions

Q1: When would you choose online learning over batch learning?

A: Use online learning when: - Data volume exceeds memory (can't store all) - Real-time adaptation needed (fraud detection, recommendations) - Concept drift expected (user preferences, market conditions) - Latency requirements demand immediate updates - Data arrives continuously (IoT sensors, clickstreams)

Q2: Explain the regret framework in online learning.

A: Regret = difference between cumulative loss of online algorithm vs. best fixed model in hindsight: $\(\text{Regret}(T) = \sum_{t=1}^{T} \ell_t(w_t) - \min_w \sum_{t=1}^{T} \ell_t(w)\)$

Goal: sublinear regret (\(O(\sqrt{T})\)) → average regret → 0. Guarantees convergence despite sequential, myopic updates.

Q3: How does FTRL-Proximal achieve sparsity?

A: FTRL-Proximal adds L1 regularization to the FTRL objective: $\(w_{t+1} = \arg\min_w [\sum L(w) + \lambda_1 \|w\|_1 + \lambda_2 \|w\|_2^2]\)$

L1 penalty drives many weights to exactly zero. Combined with per-coordinate learning rates (features with consistent gradients get smaller rates), this produces sparse models suitable for high-dimensional problems (millions of features).

Q4: What is concept drift and how do you handle it?

A: Concept drift = change in data distribution over time. Types: sudden, gradual, incremental, recurring. Handling strategies: - Detection: ADWIN, DDM, EDDM, Page-Hinkley tests - Adaptation: sliding windows, exponential decay, ensemble reset - Models: Hoeffding Adaptive Trees, Adaptive Random Forest

Q5: Compare SGD partial_fit vs standard fit.

A: - fit(): Trains on entire dataset, overwrites previous state - partial_fit(): Incremental update, preserves learned weights

# Batch (full retrain)
clf.fit(X_all, y_all)

# Online (incremental)
for X_chunk, y_chunk in stream:
    clf.partial_fit(X_chunk, y_chunk, classes=[0,1])

5. Multi-Label Classification

Multi-Label vs Single-Label

Aspect Single-Label Multi-Label
Output One category Multiple categories
Example Email → Spam OR Not Spam Article → Politics AND Economics AND Tech
Output Space \(K\) classes \(2^K\) possible combinations
Complexity Linear in \(K\) Exponential in \(K\)

Problem Transformation Methods

1. Binary Relevance (BR)

Idea: Train \(L\) independent binary classifiers (one per label).

\[\hat{y}_\ell = f_\ell(x) \quad \text{for } \ell = 1, ..., L\]

Pros: Simple, parallelizable, no label dependency assumptions Cons: Ignores label correlations

from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression

br = MultiOutputClassifier(LogisticRegression(max_iter=1000))
br.fit(X_train, y_train)  # y_train shape: (n_samples, n_labels)
y_pred = br.predict(X_test)

2. Classifier Chains (CC)

Idea: Chain classifiers, each uses previous predictions as features.

\[\hat{y}_\ell = f_\ell(x, \hat{y}_1, ..., \hat{y}_{\ell-1})\]

Pros: Captures label dependencies Cons: Order-dependent, error propagation

from sklearn.multioutput import ClassifierChain
from sklearn.ensemble import RandomForestClassifier

cc = ClassifierChain(
    RandomForestClassifier(n_estimators=100),
    order='random',  # or specify order
    random_state=42
)
cc.fit(X_train, y_train)
y_pred = cc.predict(X_test)

3. Label Powerset (LP)

Idea: Treat each unique label combination as a single class.

Pros: Captures all label correlations Cons: Exponential classes, severe class imbalance

from skmultilearn.problem_transform import LabelPowerset

lp = LabelPowerset(RandomForestClassifier())
lp.fit(X_train, y_train)

Comparison Table

Method Label Dependencies Complexity Parallelizable Best For
Binary Relevance None \(O(L)\) Yes Independent labels
Classifier Chains Sequential \(O(L)\) No Sequential dependencies
Label Powerset All \(O(2^L)\) Yes Few label combinations

Multi-Label Evaluation Metrics

Hamming Loss

Fraction of incorrect label predictions: $\(\text{Hamming Loss} = \frac{1}{N \cdot L} \sum_{i=1}^{N} \sum_{\ell=1}^{L} \mathbb{I}[y_{i\ell} \neq \hat{y}_{i\ell}]\)$

Lower is better (0 = perfect).

Jaccard Score (Intersection over Union)

\[\text{Jaccard}_i = \frac{|Y_i \cap \hat{Y}_i|}{|Y_i \cup \hat{Y}_i|}\]

Average over samples: \(\frac{1}{N} \sum_i \text{Jaccard}_i\)

Higher is better (1 = perfect).

Subset Accuracy (Exact Match)

\[\text{Subset Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}[Y_i = \hat{Y}_i]\]

Very strict — requires all labels correct.

from sklearn.metrics import hamming_loss, jaccard_score, f1_score

# Evaluation
hl = hamming_loss(y_true, y_pred)              # Lower is better
js = jaccard_score(y_true, y_pred, average='samples')  # Higher is better
f1_macro = f1_score(y_true, y_pred, average='macro')
f1_micro = f1_score(y_true, y_pred, average='micro')

Algorithm Adaptation Methods

MLkNN (Multi-Label k-Nearest Neighbors)

Uses label frequencies among k nearest neighbors: $\(\hat{y}_\ell = \mathbb{I}\left[\frac{\sum_{j \in N_k(x)} \mathbb{I}[y_{j\ell}=1]}{k} > \text{threshold}\right]\)$

from skmultilearn.adapt import MLkNN

mlknn = MLkNN(k=10)
mlknn.fit(X_train, y_train)

Python Implementation

import numpy as np
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier, ClassifierChain
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import hamming_loss, jaccard_score, f1_score

# Generate multi-label data
X, y = make_multilabel_classification(
    n_samples=1000,
    n_features=20,
    n_classes=5,
    n_labels=2,  # avg labels per instance
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Binary Relevance
br = MultiOutputClassifier(LogisticRegression(max_iter=1000))
br.fit(X_train, y_train)
y_pred_br = br.predict(X_test)

# Classifier Chain
cc = ClassifierChain(RandomForestClassifier(n_estimators=100))
cc.fit(X_train, y_train)
y_pred_cc = cc.predict(X_test)

# Evaluation function
def evaluate_multilabel(y_true, y_pred, name):
    return {
        'Method': name,
        'Hamming Loss': hamming_loss(y_true, y_pred),
        'Jaccard': jaccard_score(y_true, y_pred, average='samples'),
        'F1 Micro': f1_score(y_true, y_pred, average='micro'),
        'F1 Macro': f1_score(y_true, y_pred, average='macro')
    }

print(evaluate_multilabel(y_test, y_pred_br, 'Binary Relevance'))
print(evaluate_multilabel(y_test, y_pred_cc, 'Classifier Chain'))

Interview Questions

Q1: When would you choose Binary Relevance vs Classifier Chains?

A: - Binary Relevance: Labels are independent, need parallelization, simple baseline - Classifier Chains: Labels are correlated (e.g., "Python" → "Programming"), sequential dependencies matter

Q2: Why is Label Powerset rarely used despite capturing all correlations?

A: Two main issues: 1. Exponential classes: With \(L\) labels, up to \(2^L\) combinations. Many are rare or unseen. 2. Class imbalance: Most combinations have few samples, some have none.

Better for small \(L\) where most combinations appear in training.

Q3: Explain Hamming Loss vs Subset Accuracy.

A: - Hamming Loss: Fraction of wrong labels (per-label). Allows partial credit. Range: [0, 1], lower is better. - Subset Accuracy: Exact match required (per-instance). No partial credit. Very strict.

Example: True = [1,1,0,0], Pred = [1,0,0,0] - Hamming Loss = ¼ = 0.25 (one wrong label) - Subset Accuracy = 0 (not exact match)

Q4: Design a system for multi-label document classification with 1000 labels.

A: 1. Feature extraction: TF-IDF or embeddings (BERT, sentence-transformers) 2. Model choice: - If label correlations weak: Binary Relevance with Linear SVM (fast, scalable) - If correlations matter: Classifier Chains (ensemble of chains for robustness) - For deep learning: Binary cross-entropy with sigmoid output 3. Handling imbalance: Per-label class weights, focal loss 4. Evaluation: Hamming Loss (primary), Jaccard Score, per-label F1 5. Production: Model compression, label pruning (remove rare labels)

Q5: How do you handle extreme label imbalance in multi-label settings?

A: 1. Per-label weighting: Higher weight for rare labels in loss 2. Focal Loss: Focus on hard examples 3. SMOTE variants: Oversample minority label combinations 4. Negative sampling: For very many labels, sample negative labels 5. Label hierarchy: Group similar labels, predict at group level first


6. Semi-Supervised Learning

Core Assumptions

SSL relies on key assumptions about data structure:

Assumption Description When It Breaks
Smoothness Similar inputs → similar outputs Noisy data, outliers
Cluster Data forms groups, boundaries avoid clusters High-dimensional, sparse data
Manifold Data lies on low-dimensional manifold Manifold crossing, holes
Low-density Decision boundaries pass through low-density regions Class overlap

SSL Methods Taxonomy

1. Wrapper Methods (Self-Training)

Idea: Train on labeled data → predict unlabeled → add high-confidence predictions → repeat.

from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.ensemble import RandomForestClassifier

# -1 = unlabeled
y_semi = y_labeled.copy()
y_semi[len(y_labeled):] = -1

base_clf = RandomForestClassifier(n_estimators=100)
self_training = SelfTrainingClassifier(base_clf, threshold=0.9)
self_training.fit(X_combined, y_semi)

Risk: Confirmation bias — errors reinforce themselves.

2. Consistency Regularization

Key Idea: Model should give similar predictions for perturbed inputs.

Loss: $\(\mathcal{L} = \mathcal{L}_{sup} + \lambda \cdot \mathbb{E}_{x, x'}[\|f(x) - f(x')\|^2]\)$

where \(x'\) is augmented version of \(x\).

FixMatch Algorithm: 1. Generate weak and strong augmentations 2. If weak prediction confident (>\(\tau\)), use as pseudo-label 3. Train on strong augmentation with pseudo-label

# FixMatch pseudo-code
def fixmatch_loss(model, x_l, y_l, x_u, tau=0.95):
    # Supervised loss
    loss_sup = CE(model(x_l), y_l)

    # Unlabeled: weak and strong augmentation
    x_u_weak = weak_augment(x_u)
    x_u_strong = strong_augment(x_u)

    pseudo_labels = model(x_u_weak)
    mask = (pseudo_labels.max(dim=1) > tau).float()

    loss_unsup = (CE(model(x_u_strong), pseudo_labels) * mask).mean()

    return loss_sup + lambda_u * loss_unsup

3. Mean Teacher

Idea: Teacher (EMA of student) generates targets for student.

\[\theta_{teacher} = \alpha \theta_{teacher} + (1-\alpha) \theta_{student}\]
# Mean Teacher update
def update_teacher(teacher, student, alpha=0.99):
    for t_param, s_param in zip(teacher.parameters(), student.parameters()):
        t_param.data = alpha * t_param.data + (1 - alpha) * s_param.data

4. MixMatch

Combines multiple techniques: 1. Augmentation: Multiple augmentations per sample 2. Pseudo-labeling: Average predictions → sharpen 3. MixUp: Interpolate labeled and unlabeled

\[\mathcal{L} = \mathcal{L}_{CE}(X, Y) + \lambda_u \mathcal{L}_{MSE}(U, \hat{U}) + \lambda_{mix} \mathcal{L}_{MixUp}\]

Comparison Table

Method Key Idea Pros Cons
Self-Training Pseudo-label high-confidence Simple, any model Confirmation bias
FixMatch Strong/weak aug + threshold SOTA on CIFAR Needs good augmentations
Mean Teacher EMA teacher targets Stable, no threshold Architecture-specific
MixMatch Mix multiple techniques Combines strengths Many hyperparameters
S3VM Adjust margin for unlabeled Theoretical Doesn't scale

Production Considerations

Confirmation Bias Mitigation: - Ensemble disagreement - Uncertainty-aware filtering - Teacher-student with EMA - Human review of low-confidence predictions

Distribution Mismatch: - Monitor drift between labeled and unlabeled distributions - Use statistical tests (KL divergence, PSI) - Filter out-of-distribution unlabeled data

Loss Balancing: $\(\mathcal{L}_{total} = \mathcal{L}_{sup} + \lambda(t) \cdot \mathcal{L}_{unsup}\)$

Ramp up \(\lambda(t)\) gradually: \(\lambda(t) = \lambda_{max} \cdot e^{-5(1-t/T)^2}\)

Python Implementation

import torch
import torch.nn.functional as F
from sklearn.semi_supervised import LabelSpreading

# Sklearn: Label Propagation
lp = LabelSpreading(kernel='knn', n_neighbors=7)
lp.fit(X_combined, y_semi)  # -1 for unlabeled
y_pred = lp.predict(X_test)

# PyTorch: FixMatch-style training
def train_ssl_epoch(model, optimizer, labeled_loader, unlabeled_loader, tau=0.95):
    model.train()
    for (x_l, y_l), x_u in zip(labeled_loader, unlabeled_loader):

        # Supervised loss
        logits_l = model(x_l)
        loss_sup = F.cross_entropy(logits_l, y_l)

        # Pseudo-labeling with threshold
        with torch.no_grad():
            logits_u = model(x_u)
            probs_u = F.softmax(logits_u, dim=1)
            max_probs, pseudo_labels = probs_u.max(dim=1)
            mask = (max_probs >= tau).float()

        # Unsupervised loss (only confident predictions)
        logits_u_strong = model(strong_augment(x_u))
        loss_unsup = (F.cross_entropy(logits_u_strong, pseudo_labels, reduction='none') * mask).mean()

        loss = loss_sup + 1.0 * loss_unsup
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Interview Questions

Q1: When does SSL help vs. hurt?

A: - Helps: Small labeled set (5-10%), unlabeled matches distribution, smooth/clustered data - Hurts: Distribution mismatch, noisy unlabeled data, class overlap, confirmation bias

Always validate on held-out labeled data.

Q2: Explain the confirmation bias problem in self-training.

A: Early errors in pseudo-labels get reinforced across epochs. A wrong high-confidence prediction becomes training data for future iterations, making the model more confident in the wrong direction.

Solutions: ensemble disagreement, uncertainty filtering, Mean Teacher (stable EMA targets), human review.

Q3: Compare FixMatch vs Mean Teacher.

A: - FixMatch: Uses confidence threshold + strong augmentations. If weak-aug prediction >\(\tau\), train on strong-aug version. Threshold-based filtering. - Mean Teacher: Teacher (EMA of student) provides soft targets. No threshold needed. Consistency loss between student and teacher predictions.

Both are SOTA, FixMatch often simpler to implement.

Q4: How much labeled data do you need for SSL to work?

A: Rule of thumb: 5-10% of full dataset often sufficient. But depends on: - Task complexity (simple → less labels needed) - Data quality (clean → less labels needed) - Model capacity (larger models need more data)

Start small, validate often, expand labeled set if gains plateau.

Q5: How do you detect distribution drift between labeled and unlabeled data?

A: 1. Feature statistics: Compare mean/variance of features 2. KL divergence: Compare predicted class distributions 3. PSI (Population Stability Index): Feature-level drift 4. kNN distance: Average distance to labeled neighbors 5. Domain classifier: Train classifier to distinguish labeled vs unlabeled

Flag samples with high drift scores for human review.


7. Active Learning

Active Learning Loop

Active learning is a special case of supervised ML where the model chooses which data to label.

Loop: 1. Initial Training: Train on small labeled set 2. Inference: Predict on unlabeled pool 3. Query: Select most informative samples (acquisition function) 4. Label: Human annotates selected samples 5. Augment & Retrain: Add to training set, repeat

Query Strategies

1. Uncertainty Sampling

Select samples where model is least confident.

Least Confidence: $\(x^* = \arg\max_x \left(1 - P_\theta(\hat{y}|x)\right)\)$

where \(\hat{y} = \arg\max_y P_\theta(y|x)\)

Margin Sampling: $\(x^* = \arg\min_x \left(P_\theta(\hat{y}_1|x) - P_\theta(\hat{y}_2|x)\right)\)$

where \(\hat{y}_1, \hat{y}_2\) are top-2 predictions.

Entropy: $\(x^* = \arg\max_x H(P_\theta(y|x)) = -\sum_y P_\theta(y|x) \log P_\theta(y|x)\)$

2. Query-by-Committee (QBC)

Train ensemble of models, select samples with highest disagreement.

Vote Entropy: $\(x^* = \arg\max_x \left(-\sum_y \frac{V(y)}{C} \log \frac{V(y)}{C}\right)\)$

where \(V(y)\) = votes for class \(y\), \(C\) = committee size.

KL Divergence: $\(x^* = \arg\max_x \frac{1}{C}\sum_{c=1}^{C} D_{KL}(P_{\theta_c} \| P_{avg})\)$

3. Expected Model Change

Select samples that would cause largest gradient change: $\(x^* = \arg\max_x \|\nabla_\theta \mathcal{L}(\theta; x, \hat{y})\|\)$

4. Diversity Sampling

Avoid selecting similar samples. Use clustering or embeddings.

from sklearn.cluster import KMeans

# Diversity sampling via clustering
def diversity_sampling(X_unlabeled, n_samples, embeddings):
    kmeans = KMeans(n_clusters=n_samples)
    kmeans.fit(embeddings)
    # Select sample closest to each centroid
    selected = []
    for i in range(n_samples):
        cluster_samples = np.where(kmeans.labels_ == i)[0]
        centroid = kmeans.cluster_centers_[i]
        distances = np.linalg.norm(embeddings[cluster_samples] - centroid, axis=1)
        selected.append(cluster_samples[np.argmin(distances)])
    return selected

Comparison Table

Strategy Formula Best For Pitfalls
Least Confidence \(\max(1-P(\hat{y}))\) Binary/multiclass Ignores runner-up
Margin \(\min(P(\hat{y}_1)-P(\hat{y}_2))\) Multiclass Needs calibrated probs
Entropy \(\max H(P(y\mid x))\) Many classes Can select outliers
QBC Max vote entropy Model ensembles Computation heavy
Diversity Clustering Large pools May miss rare cases

Python Implementation

import numpy as np
from sklearn.base import clone

class ActiveLearner:
    def __init__(self, estimator, strategy='entropy'):
        self.estimator = estimator
        self.strategy = strategy

    def query(self, X_pool, n_samples=10):
        """Select most informative samples from pool."""
        probs = self.estimator.predict_proba(X_pool)

        if self.strategy == 'least_confident':
            scores = 1 - np.max(probs, axis=1)
        elif self.strategy == 'margin':
            sorted_probs = np.sort(probs, axis=1)[:, ::-1]
            scores = sorted_probs[:, 0] - sorted_probs[:, 1]
            scores = -scores  # Lower margin = higher score
        elif self.strategy == 'entropy':
            scores = -np.sum(probs * np.log(probs + 1e-10), axis=1)

        return np.argsort(scores)[-n_samples:]

    def teach(self, X, y):
        """Retrain with new labeled samples."""
        self.estimator.fit(X, y)

# Active learning loop
def active_learning_loop(estimator, X_labeled, y_labeled, X_pool,
                          oracle, n_iterations=10, n_per_iter=10):
    learner = ActiveLearner(estimator, strategy='entropy')
    learner.teach(X_labeled, y_labeled)

    for i in range(n_iterations):
        # Query most informative samples
        query_idx = learner.query(X_pool, n_samples=n_per_iter)

        # Get labels from oracle (human annotator)
        X_query = X_pool[query_idx]
        y_query = oracle(X_query)  # Human labels these

        # Add to labeled set
        X_labeled = np.vstack([X_labeled, X_query])
        y_labeled = np.concatenate([y_labeled, y_query])

        # Remove from pool
        mask = np.ones(len(X_pool), dtype=bool)
        mask[query_idx] = False
        X_pool = X_pool[mask]

        # Retrain
        learner.teach(X_labeled, y_labeled)

        print(f"Iteration {i+1}: {len(X_labeled)} labeled samples")

    return learner

# Using modAL library (specialized for active learning)
from modAL.models import ActiveLearner as ModALLearner
from modAL.uncertainty import entropy_sampling

learner = ModALLearner(
    estimator=RandomForestClassifier(),
    query_strategy=entropy_sampling
)
learner.fit(X_initial, y_initial)

query_idx, query_inst = learner.query(X_pool, n_instances=10)
learner.teach(X_pool[query_idx], y_new_labels)

When Active Learning Fails

Red Flags: - Biased queries: Model selects from narrow region, ignores others - Annotation drift: Annotator fatigue → lower quality labels over time - Cold start problem: Initial model too weak → bad query selection - Outlier focus: Uncertainty sampling picks noise/outliers

Mitigation: - Mix uncertainty with diversity sampling - Regular validation on held-out test set - Start with diversity-based selection for cold start - Filter outliers before querying

Interview Questions

Q1: When would you use active learning vs. labeling everything?

A: Use AL when: - Labeling is expensive (medical imaging, legal docs) - Task has many "easy" samples model already handles - Budget/time is limited - Model already has reasonable baseline

Don't use AL when: - Task is novel (no good initial model) - Every sample is valuable - You need to guarantee coverage - Annotation is cheap

Q2: Compare uncertainty sampling vs query-by-committee.

A: - Uncertainty Sampling: Single model, use confidence/probability. Fast, simple. Needs calibrated probabilities. - QBC: Multiple models, measure disagreement. Better captures model uncertainty, not just confidence. More compute, needs ensemble.

QBC often better for deep learning where single-model confidence is unreliable.

Q3: How do you prevent active learning from selecting only outliers?

A: 1. Diversity constraint: Mix uncertainty with clustering-based diversity 2. Typicality filter: Exclude samples far from training distribution 3. Human review: Annotator can reject "obvious error" queries 4. Hybrid strategies: 50% uncertainty + 50% random/diversity

Q4: Design an annotation system for multi-class image classification.

A: 1. Cold start: Use pre-trained features + clustering to select diverse initial batch 2. Query strategy: Mix entropy (70%) + diversity (30%) 3. Batch mode: Query 100-500 images per iteration (not 1) 4. Annotation UI: Show model predictions, allow quick correction 5. Stopping criterion: Stop when validation accuracy plateaus for 3 iterations 6. Quality control: Random sample review, annotator agreement tracking

Q5: What's the difference between active learning and semi-supervised learning?

A: | Aspect | Active Learning | Semi-Supervised | |--------|----------------|-----------------| | Labels from | Human (oracle) | Model (pseudo-labels) | | Unlabeled data | Pool to query | Training data | | Cost | Annotation cost | Compute cost | | Quality | Guaranteed correct | May propagate errors | | When to use | Labels expensive | Some labels available |

Can combine: use AL to select samples for human labeling + SSL to leverage remaining unlabeled.


8. Практические навыки

Feature Engineering Checklist

  • Missing values: impute vs drop
  • Categorical: one-hot, target, embedding
  • Numerical: scale, log transform, bin
  • Outliers: cap, remove, transform
  • Interactions: polynomial, domain-specific
  • Time features: cyclical encoding

Cross-Validation Strategy

Data Type CV Strategy
IID K-Fold (5-10)
Imbalanced Stratified K-Fold
Time Series TimeSeriesSplit
Groups GroupKFold

Hyperparameter Tuning

Method Когда использовать
Grid Search Few params, small space
Random Search Large space
Bayesian (Optuna) Expensive training
Successive Halving Many configs, limited time

Связи между темами

Linear Regression → Logistic Regression → Neural Networks
       ↓                    ↓                    ↓
   OLS closed-form     Sigmoid + BCE      Backprop
       ↓                    ↓                    ↓
   Regularization    Classification      Optimizers
       ↓               Metrics               ↓
   L1/L2 → Feature Selection ← ← ← ← ← Gradient Descent
   Decision Trees → Random Forest → Gradient Boosting
   Gini/Entropy ← ← ← ← Information Theory

Рекомендуемый порядок изучения

  1. Week 1: KNN, Linear/Logistic Regression (from scratch)
  2. Week 2: Decision Trees, Random Forest
  3. Week 3: Gradient Boosting (theory + XGBoost)
  4. Week 4: SVM (kernel trick), Naive Bayes
  5. Week 5: Feature Engineering + Selection
  6. Week 6: Practical: Kaggle competition