Classical ML: учебные материалы¶

~18 минут чтения

Предварительно: Выбор модели | sklearn

Материалы для 16 задач из категории Classical ML. Формулы, код, ключевые источники для каждого алгоритма. Обновлено: 2026-02-11

1. ML Algorithms from Scratch (6 задач)¶

KNN (impl_003_knn)¶

Лучшие источники: - Stanford CS229: KNN — Lecture notes - StatQuest: KNN — YouTube визуализация - Hands-On ML Chapter 3 — Практический

Ключевые концепции:

Distance	Formula	Когда использовать
Euclidean	$\sqrt{\sum(x_i-y_i)^2}$	Default, continuous
Manhattan	$\sum\mid x_i-y_i	$
Cosine	$1 - \frac{x \cdot y}{\\|x\\|\\|y\\|}$	Text, sparse

Curse of Dimensionality: $$\text{Volume of unit ball} = \frac{\pi^{d/2}}{\Gamma(d/2+1)} \to 0 \text{ as } d \to \infty$$

При high dimensions все точки "далеко" друг от друга.

Logistic Regression (impl_001_logistic_regression)¶

Лучшие источники: - StatQuest: Logistic Regression - CS229 Notes: Logistic Regression - Scikit-learn User Guide

Ключевые формулы: $$\sigma(z) = \frac{1}{1+e^{-z}}$$

\[L = -[y\log(\hat{y}) + (1-y)\log(1-\hat{y})]\]

\[\frac{\partial L}{\partial w} = X^T(\hat{y} - y)\]

Decision Boundary: $w^Tx + b = 0$

K-Means (impl_002_kmeans)¶

Лучшие источники: - StatQuest: K-Means - CS229: K-Means

Алгоритм: 1. Init centroids (random или k-means++) 2. Assign points to nearest centroid 3. Update centroids = mean of assigned points 4. Repeat until convergence

K-means++ initialization:

def kmeans_plus_plus_init(X, k):
    centroids = [X[np.random.randint(len(X))]]
    for _ in range(k-1):
        distances = np.min([np.linalg.norm(X - c, axis=1) for c in centroids], axis=0)
        probs = distances ** 2 / np.sum(distances ** 2)
        centroids.append(X[np.random.choice(len(X), p=probs)])
    return np.array(centroids)

Elbow Method: Plot $J(C)$ vs $k$, find "elbow"

Naive Bayes (impl_005_naive_bayes)¶

Лучшие источники: - StatQuest: Naive Bayes - Scikit-learn: Naive Bayes

Gaussian NB: $$P(x_i|y=c) = \frac{1}{\sqrt{2\pi\sigma_c^2}} \exp\left(-\frac{(x_i-\mu_c)^2}{2\sigma_c^2}\right)$$

Prediction: $$\hat{y} = \arg\max_c P(c) \prod_i P(x_i|c)$$

Log-space (prevent underflow): $$\log P(y|x) = \log P(y) + \sum_i \log P(x_i|y)$$

Decision Tree (impl_004_decision_tree)¶

Лучшие источники: - StatQuest: Decision Trees - ESL Chapter 9 — Trees, CART

Split Criteria:

Criterion	Formula	Range
Gini	$1 - \sum p_k^2$	[0, 1-1/K] (binary: [0, 0.5])
Entropy	$-\sum p_k \log_2 p_k$	[0, 1]
MSE (regression)	$\frac{1}{n}\sum(y_i - \bar{y})^2$	[0, ∞)

Recursive splitting:

def build_tree(X, y, depth=0):
    if stopping_condition:
        return Leaf(majority_class(y))

    feature, threshold = find_best_split(X, y)
    left = build_tree(X[X[:, feature] <= threshold], y[...], depth+1)
    right = build_tree(X[X[:, feature] > threshold], y[...], depth+1)
    return Node(feature, threshold, left, right)

Neural Network from Scratch (impl_006_neural_network)¶

Лучшие источники: - Karpathy: micrograd — MUST DO - Karpathy: nn-zero-to-hero - 3Blue1Brown: Neural Networks

Architecture:

Input → Linear → ReLU → Linear → Softmax → Output

Forward pass: $$z = Wx + b, \quad a = \text{ReLU}(z)$$

Backward pass (chain rule): $$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial W} = \frac{\partial L}{\partial z} \cdot x^T$$

2. ML Classic Practice (10 задач)¶

Linear Regression (classic_005_linear_regression)¶

Лучшие источники: - ISLR Chapter 3 — Linear Regression - StatQuest: Linear Regression

OLS Closed-form: $$\hat{w} = (X^TX)^{-1}X^Ty$$

R-squared: $$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

Adjusted R-squared: $$R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-p-1}$$

Classification Metrics (classic_002_classification_metrics)¶

Лучшие источники: - Google ML Crash Course - Scikit-learn: Metrics

Confusion Matrix:

	Predicted +	Predicted -
Actual +	TP	FN
Actual -	FP	TN

Metrics: $$\text{Precision} = \frac{TP}{TP+FP}$$

\[\text{Recall} = \frac{TP}{TP+FN}\]

\[F1 = \frac{2 \cdot P \cdot R}{P + R}\]

Class Imbalance (classic_003_class_imbalance)¶

Лучшие источники: - Imbalanced-learn documentation - SMOTE Paper

Strategies:

Method	Description	Pros/Cons
Undersampling	Remove majority	Fast, loses data
Oversampling	Duplicate minority	Simple, overfitting
SMOTE	Synthetic samples	Better, complex
Class weights	Loss modification	Easy, no data change
Focal Loss	Hard example mining	Deep learning

SMOTE algorithm:

def smote(X_minority, k=5, n_samples=100):
    synthetic = []
    for x in X_minority:
        neighbors = knn(x, X_minority, k)
        for _ in range(n_samples // len(X_minority)):
            neighbor = random.choice(neighbors)
            alpha = random.random()
            synthetic.append(x + alpha * (neighbor - x))
    return synthetic

Ranking Metrics (classic_001_ranking_metrics)¶

Лучшие источники: - Ranking Metrics Explained - NDCG Wikipedia

Metrics:

Metric	Formula	Use Case
MRR	$1/\text{rank of first relevant}$	Search
MAP	Mean of AP@k	Retrieval
NDCG	$DCG/IDCG$	Ranked lists

NDCG: $$DCG@k = \sum_{i=1}^{k} \frac{2^{rel_i}-1}{\log_2(i+1)}$$

\[NDCG@k = \frac{DCG@k}{IDCG@k}\]

Feature Selection (feat_002_feature_selection)¶

Лучшие источники: - Scikit-learn: Feature Selection

Methods:

Type	Methods	Когда использовать
Filter	Correlation, MI, chi2	Fast baseline
Wrapper	RFE, forward/backward	Best quality
Embedded	L1, tree importance	Built-in

RFE (Recursive Feature Elimination):

from sklearn.feature_selection import RFE
rfe = RFE(estimator=LogisticRegression(), n_features_to_select=10)
X_selected = rfe.fit_transform(X, y)

Target Encoding (feat_001_target_encoding)¶

Лучшие источники: - Target Encoding Explained - Category Encoders

Formula: $$\text{TE}(cat) = \frac{n \cdot \mu_{cat} + m \cdot \mu_{global}}{n + m}$$

где $n$ = count в категории, $m$ = smoothing parameter.

Leave-One-Out:

def target_encoding_loo(df, col, target):
    global_mean = df[target].mean()
    return df.groupby(col)[target].transform(
        lambda x: (x.sum() - x) / (len(x) - 1)
    )

Critical: Use CV to avoid leakage!

SVM (classic_007_svm)¶

Лучшие источники: - StatQuest: SVM - ESL Chapter 12 — SVMs

Hard Margin: $$\min_{w,b} \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^Tx_i + b) \geq 1$$

Soft Margin (slack variables): $$\min_{w,b,\xi} \frac{1}{2}\|w\|^2 + C\sum\xi_i$$

Kernel Trick: $$K(x, x') = \phi(x)^T\phi(x')$$

Kernel	Formula
Linear	$x^Tx'$
Polynomial	$(\gamma x^Tx' + r)^d$
RBF	$\exp(-\gamma\\|x-x'\\|^2)$

GBDT Internals (classic_008_gbdt_internals)¶

Лучшие источники: - Gradient Boosting from Scratch - XGBoost Documentation

Key Idea: Fit trees to residuals (negative gradient).

Algorithm:

def gbdt_fit(X, y, n_trees, lr):
    F = np.mean(y)  # Initial prediction
    for _ in range(n_trees):
        residuals = y - F  # Negative gradient for MSE
        tree = DecisionTreeRegressor(max_depth=3)
        tree.fit(X, residuals)
        F += lr * tree.predict(X)
    return F

XGBoost vs LightGBM vs CatBoost:

Feature	XGBoost	LightGBM	CatBoost
Trees	Symmetric	Leaf-wise	Symmetric
Categorical	One-hot needed	Native	Native (best)
Speed	Good	Fastest	Good
Tuning	Medium	Medium	Easy

3. Ensemble Theory Deep Dive¶

Bias-Variance Decomposition¶

Total Error: $$\text{MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$$

Для ensemble из $B$ моделей: $$\text{Var}(\bar{f}) = \rho \sigma^2 + \frac{1-\rho}{B}\sigma^2$$

где $\rho$ = корреляция между предсказаниями, $\sigma^2$ = variance отдельной модели.

Key Insight: Увеличение $B$ уменьшает только второй член. Diversity ($1-\rho$) критичен!

Bagging (Bootstrap Aggregating)¶

Mathematical Formulation: $$\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^{B}\hat{f}_b(x)$$

Variance Reduction: Для $B$ uncorrelated predictors: $$\text{Var}(\hat{f}_{bag}) = \frac{\text{Var}(\hat{f})}{B}$$

Для correlated ($\rho > 0$): $$\text{Var}(\hat{f}_{bag}) = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$$

Random Forest Innovation: Random feature selection at each split decorrelates trees: $$\text{features\_per\_split} \approx \sqrt{p} \text{ (classification)}, \frac{p}{3} \text{ (regression)}$$

Boosting¶

Key Idea: Sequential learning from mistakes (fit residuals).

Gradient Boosting as Gradient Descent in Function Space: $$F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)$$

где $h_m$ обучается на negative gradient: $$r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F=F_{m-1}}$$

AdaBoost Sample Reweighting: $$w_i^{(m+1)} = w_i^{(m)} \cdot \exp(\alpha_m \cdot \mathbb{I}[y_i \neq \hat{y}_i^{(m)}])$$

где $\alpha_m = \frac{1}{2}\ln\frac{1-\epsilon_m}{\epsilon_m}$

XGBoost Regularization: $$\mathcal{L} = \sum_i L(y_i, \hat{y}_i) + \sum_k \Omega(f_k)$$

\[\Omega(f) = \gamma T + \frac{1}{2}\lambda\|w\|^2\]

где $T$ = число leaves, $w$ = leaf weights.

Bagging vs Boosting Comparison¶

Aspect	Bagging (RF)	Boosting (GBDT)
Training	Parallel	Sequential
Bias	Same as single tree	Reduces with iterations
Variance	Reduces significantly	Can increase
Overfitting	Resistant	Prone (need early stopping)
Outliers	Robust	Sensitive
Trees	Deep (fully grown)	Shallow (weak learners)
Interpretability	Feature importance	Feature importance + SHAP
Best for	High variance	High bias

Why More Trees Can Hurt Boosting¶

Random Forest: $$\lim_{B \to \infty} \text{Var}(\text{RF}) = \rho\sigma^2$$

Converges to stable error. More trees = better (diminishing returns).

Gradient Boosting: $$F_M(x) = \sum_{m=1}^{M} \eta \cdot h_m(x)$$

Each iteration fits residuals. Eventually fits noise → overfitting.

Solution: Early stopping, regularization, small learning rate ($\eta \in [0.01, 0.3]$).

Stacking (Stacked Generalization)¶

Architecture:

Level 0: [Model1, Model2, ..., ModelK]
    ↓ predictions on holdout
Level 1: Meta-model (usually Linear/Logistic)
    ↓
Final prediction

Mathematical Formulation: $$\hat{y} = g(\hat{f}_1(x), \hat{f}_2(x), ..., \hat{f}_K(x))$$

Critical: Use out-of-fold predictions for meta-features to avoid leakage!

from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge

def stacking_fit(base_models, meta_model, X, y, n_folds=5):
    kf = KFold(n_splits=n_folds, shuffle=True)
    meta_features = np.zeros((len(X), len(base_models)))

    for i, model in enumerate(base_models):
        for train_idx, val_idx in kf.split(X):
            model.fit(X[train_idx], y[train_idx])
            meta_features[val_idx, i] = model.predict(X[val_idx])

    meta_model.fit(meta_features, y)
    return meta_model

Ensemble Diversity Metrics¶

Q-statistic (for classifiers): $$Q_{ij} = \frac{N^{11}N^{00} - N^{01}N^{10}}{N^{11}N^{00} + N^{01}N^{10}}$$

$Q \in [-1, 1]$, where $Q = 0$ means independent.

Double Fault Measure: $$DF_{ij} = \frac{N^{00}}{N}$$

Fraction of samples both classifiers get wrong. Lower = more diverse.

Python Implementation¶

# Random Forest with OOB evaluation
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=200,
    max_features='sqrt',
    oob_score=True,  # Out-of-bag evaluation
    n_jobs=-1
)
rf.fit(X_train, y_train)
print(f"OOB Score: {rf.oob_score_:.3f}")

# Gradient Boosting with early stopping
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=4,
    validation_fraction=0.1,
    n_iter_no_change=10,  # Early stopping
    tol=1e-4
)
gb.fit(X_train, y_train)
print(f"Actual iterations: {gb.n_estimators_}")

# XGBoost with full regularization
import xgboost as xgb

params = {
    'objective': 'binary:logistic',
    'learning_rate': 0.05,
    'max_depth': 6,
    'reg_alpha': 0.1,   # L1 regularization
    'reg_lambda': 1.0,  # L2 regularization
    'subsample': 0.8,
    'colsample_bytree': 0.8,
}
model = xgb.train(params, dtrain, num_boost_round=500,
                  evals=[(dval, 'val')], early_stopping_rounds=20)

# CatBoost with native categorical handling
from catboost import CatBoostClassifier

cat = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.05,
    depth=6,
    cat_features=categorical_cols,  # Native handling!
    verbose=False
)
cat.fit(X_train, y_train, eval_set=(X_val, y_val),
        early_stopping_rounds=50)

Interview Questions¶

Q1: Why does Random Forest not overfit with more trees but Gradient Boosting can?

A: RF trees are independent and parallel. As $B \to \infty$, variance converges to $\rho\sigma^2$ (law of large numbers). Boosting trees are sequential - each fits residuals from previous. Eventually fits noise, not signal. Solution: early stopping, regularization.

Q2: When would you prefer Bagging over Boosting?

A: - High variance problem (complex models, small data) → Bagging - High bias problem (simple models, large data) → Boosting - Noisy data with outliers → Bagging (robust) - Need interpretable feature importance → Both OK - Limited compute for training → Bagging (parallelizable)

Q3: Explain the learning rate in Gradient Boosting.

A: Learning rate $\eta \in (0, 1]$ scales each tree's contribution: $$F_m = F_{m-1} + \eta \cdot h_m$$

Small $\eta$ requires more trees but: - Better generalization (smoother updates) - More robust to overfitting - Often $\eta \in [0.01, 0.3]$ works best

Trade-off: $\eta \times n\_trees \approx \text{constant}$ for similar performance.

Q4: What is the role of max_features in Random Forest?

A: Controls decorrelation between trees: - sqrt(p) for classification (default) - p/3 for regression - Smaller → more diverse trees, higher bias, lower variance - Larger → more correlated trees, lower bias, higher variance

Q5: How does CatBoost handle categorical features differently?

A: - No one-hot encoding needed (memory efficient) - Uses ordered target statistics: each sample's encoding uses only previous samples - Prevents target leakage - Handles high-cardinality categories naturally - Often outperforms manually encoded features

4. Online Learning for Streaming Data¶

Online vs Batch Learning¶

Aspect	Batch Learning	Online Learning
Data Access	Full dataset available	Sequential, one at a time
Memory Usage	Proportional to data size	Constant (model size only)
Model Updates	Periodic retraining	Continuous incremental updates
Concept Drift	Model becomes stale	Adapts automatically
Latency	Hours to days for retraining	Milliseconds per update
Convergence	Global optimization possible	Sequential optimization with regret bounds

Online Gradient Descent (OGD)¶

Update Rule: $$w_{t+1} = w_t - \eta_t \nabla L(w_t, x_t, y_t)$$

где $\eta_t$ = learning rate at time $t$, $L$ = loss function.

Learning Rate Schedules: - Fixed: $\eta_t = \eta_0$ — works for stationary distributions - Decreasing: $\eta_t = \eta_0 / \sqrt{t}$ — theoretical convergence guarantees - Adaptive: Increase when model performs poorly (track concept drift)

The Regret Framework¶

Online learning performance measured through regret — difference between cumulative loss and best fixed model in hindsight:

\[\text{Regret}(T) = \sum_{t=1}^{T} \ell_t(w_t) - \min_w \sum_{t=1}^{T} \ell_t(w)\]

Goal: Sublinear regret growth ($O(\sqrt{T})$ or $O(\log T)$) → average per-step regret → 0.

Follow-The-Regularized-Leader (FTRL)¶

Update Rule: $$w_{t+1} = \arg\min_w \left[ \sum_{i=1}^{t} L(w, x_i, y_i) + R(w) \right]$$

где $R(w)$ = regularization (L1 for sparsity, L2 for smoothness).

FTRL-Proximal (Google production): - Per-coordinate learning rates adapted to gradient history - L1 penalty drives weights to zero (sparse models) - Crucial for high-dimensional features (millions/billions)

Perceptron and Passive-Aggressive Algorithms¶

Perceptron Update (if misclassified): $$w_{t+1} = w_t + y_t x_t$$

Passive-Aggressive (PA): $$w_{t+1} = w_t + \tau_t y_t x_t, \quad \tau_t = \frac{\ell_t}{\|x_t\|^2}$$

Makes minimal update to achieve correct classification with margin.

Concept Drift Handling¶

Types of Concept Drift: | Type | Description | Example | |------|-------------|---------| | Sudden | Abrupt distribution change | System failure, policy change | | Gradual | Slow evolution over time | User preference shifts | | Incremental | Steady directional change | Market trends | | Recurring | Cyclical patterns | Seasonal effects |

Drift Detection Methods: - ADWIN (Adaptive Windowing): Statistical test for distribution change - DDM (Drift Detection Method): Monitor error rate changes - EDDM (Early DDM): Detect drift earlier using distance between errors - Page-Hinkley Test: Detect mean shifts in stream

Adaptation Strategies: 1. Sliding Window: Keep only recent $W$ samples 2. Exponential Decay: Weight recent samples more: $w_t = \alpha^t$ 3. Ensemble Reset: Replace worst performing model when drift detected

Python Implementation¶

from sklearn.linear_model import SGDClassifier, PassiveAggressiveClassifier
from sklearn.naive_bayes import MultinomialNB
import numpy as np

# Online SGD Classifier
clf = SGDClassifier(loss='log_loss', learning_rate='optimal', eta0=0.01)

# Simulating streaming data
def online_learning_stream(clf, data_stream, classes):
    for X_batch, y_batch in data_stream:
        # Partial fit updates model incrementally
        clf.partial_fit(X_batch, y_batch, classes=classes)
        # Model is ready immediately after each batch
    return clf

# Example with chunks
classes = np.array([0, 1])
for chunk_idx in range(10):
    X_chunk, y_chunk = get_next_chunk(chunk_idx)
    clf.partial_fit(X_chunk, y_chunk, classes=classes)

# Passive-Aggressive Classifier (good for text)
pa_clf = PassiveAggressiveClassifier(C=1.0, max_iter=1000)
pa_clf.partial_fit(X_first, y_first, classes=classes)

# River library for advanced streaming ML
from river import linear_model, optim, preprocessing

model = (
    preprocessing.StandardScaler() |
    linear_model.LogisticRegression(optimizer=optim.SGD(0.01))
)

for x_dict, y in data_stream:
    # Learn from one sample
    y_pred = model.predict_one(x_dict)
    model.learn_one(x_dict, y)

Hoeffding Trees (VFDT)¶

Key Idea: Use Hoeffding bound to decide when to split with statistical confidence:

\[\epsilon = \sqrt{\frac{R^2 \ln(1/\delta)}{2n}}\]

If best split - second best split > $\epsilon$, make the split.

Advantages: - Single pass through data - Constant memory - Theoretical guarantees

from river import forest, tree

# Hoeffding Adaptive Tree
hat = tree.HoeffdingAdaptiveTreeClassifier()

for x_dict, y in data_stream:
    y_pred = hat.predict_one(x_dict)
    hat.learn_one(x_dict, y)

# Adaptive Random Forest
arf = forest.ARFClassifier(n_models=10)
for x_dict, y in data_stream:
    arf.learn_one(x_dict, y)

Interview Questions¶

Q1: When would you choose online learning over batch learning?

A: Use online learning when: - Data volume exceeds memory (can't store all) - Real-time adaptation needed (fraud detection, recommendations) - Concept drift expected (user preferences, market conditions) - Latency requirements demand immediate updates - Data arrives continuously (IoT sensors, clickstreams)

Q2: Explain the regret framework in online learning.

A: Regret = difference between cumulative loss of online algorithm vs. best fixed model in hindsight: $$\text{Regret}(T) = \sum_{t=1}^{T} \ell_t(w_t) - \min_w \sum_{t=1}^{T} \ell_t(w)$$

Goal: sublinear regret ($O(\sqrt{T})$) → average regret → 0. Guarantees convergence despite sequential, myopic updates.

Q3: How does FTRL-Proximal achieve sparsity?

A: FTRL-Proximal adds L1 regularization to the FTRL objective: $$w_{t+1} = \arg\min_w [\sum L(w) + \lambda_1 \|w\|_1 + \lambda_2 \|w\|_2^2]$$

L1 penalty drives many weights to exactly zero. Combined with per-coordinate learning rates (features with consistent gradients get smaller rates), this produces sparse models suitable for high-dimensional problems (millions of features).

Q4: What is concept drift and how do you handle it?

A: Concept drift = change in data distribution over time. Types: sudden, gradual, incremental, recurring. Handling strategies: - Detection: ADWIN, DDM, EDDM, Page-Hinkley tests - Adaptation: sliding windows, exponential decay, ensemble reset - Models: Hoeffding Adaptive Trees, Adaptive Random Forest

Q5: Compare SGD partial_fit vs standard fit.

A: - fit(): Trains on entire dataset, overwrites previous state - partial_fit(): Incremental update, preserves learned weights

# Batch (full retrain)
clf.fit(X_all, y_all)

# Online (incremental)
for X_chunk, y_chunk in stream:
    clf.partial_fit(X_chunk, y_chunk, classes=[0,1])

5. Multi-Label Classification¶

Multi-Label vs Single-Label¶

Aspect	Single-Label	Multi-Label
Output	One category	Multiple categories
Example	Email → Spam OR Not Spam	Article → Politics AND Economics AND Tech
Output Space	$K$ classes	$2^K$ possible combinations
Complexity	Linear in $K$	Exponential in $K$

Problem Transformation Methods¶

1. Binary Relevance (BR)¶

Idea: Train $L$ independent binary classifiers (one per label).

\[\hat{y}_\ell = f_\ell(x) \quad \text{for } \ell = 1, ..., L\]

Pros: Simple, parallelizable, no label dependency assumptions Cons: Ignores label correlations

from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression

br = MultiOutputClassifier(LogisticRegression(max_iter=1000))
br.fit(X_train, y_train)  # y_train shape: (n_samples, n_labels)
y_pred = br.predict(X_test)

2. Classifier Chains (CC)¶

Idea: Chain classifiers, each uses previous predictions as features.

\[\hat{y}_\ell = f_\ell(x, \hat{y}_1, ..., \hat{y}_{\ell-1})\]

Pros: Captures label dependencies Cons: Order-dependent, error propagation

from sklearn.multioutput import ClassifierChain
from sklearn.ensemble import RandomForestClassifier

cc = ClassifierChain(
    RandomForestClassifier(n_estimators=100),
    order='random',  # or specify order
    random_state=42
)
cc.fit(X_train, y_train)
y_pred = cc.predict(X_test)

3. Label Powerset (LP)¶

Idea: Treat each unique label combination as a single class.

Pros: Captures all label correlations Cons: Exponential classes, severe class imbalance

from skmultilearn.problem_transform import LabelPowerset

lp = LabelPowerset(RandomForestClassifier())
lp.fit(X_train, y_train)

Comparison Table¶

Method	Label Dependencies	Complexity	Parallelizable	Best For
Binary Relevance	None	$O(L)$	Yes	Independent labels
Classifier Chains	Sequential	$O(L)$	No	Sequential dependencies
Label Powerset	All	$O(2^L)$	Yes	Few label combinations

Multi-Label Evaluation Metrics¶

Hamming Loss¶

Fraction of incorrect label predictions: $$\text{Hamming Loss} = \frac{1}{N \cdot L} \sum_{i=1}^{N} \sum_{\ell=1}^{L} \mathbb{I}[y_{i\ell} \neq \hat{y}_{i\ell}]$$

Lower is better (0 = perfect).

Jaccard Score (Intersection over Union)¶

\[\text{Jaccard}_i = \frac{|Y_i \cap \hat{Y}_i|}{|Y_i \cup \hat{Y}_i|}\]

Average over samples: $\frac{1}{N} \sum_i \text{Jaccard}_i$

Higher is better (1 = perfect).

Subset Accuracy (Exact Match)¶

\[\text{Subset Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}[Y_i = \hat{Y}_i]\]

Very strict — requires all labels correct.

from sklearn.metrics import hamming_loss, jaccard_score, f1_score

# Evaluation
hl = hamming_loss(y_true, y_pred)              # Lower is better
js = jaccard_score(y_true, y_pred, average='samples')  # Higher is better
f1_macro = f1_score(y_true, y_pred, average='macro')
f1_micro = f1_score(y_true, y_pred, average='micro')

Algorithm Adaptation Methods¶

MLkNN (Multi-Label k-Nearest Neighbors)¶

Uses label frequencies among k nearest neighbors: $$\hat{y}_\ell = \mathbb{I}\left[\frac{\sum_{j \in N_k(x)} \mathbb{I}[y_{j\ell}=1]}{k} > \text{threshold}\right]$$

from skmultilearn.adapt import MLkNN

mlknn = MLkNN(k=10)
mlknn.fit(X_train, y_train)

Python Implementation¶

import numpy as np
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier, ClassifierChain
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import hamming_loss, jaccard_score, f1_score

# Generate multi-label data
X, y = make_multilabel_classification(
    n_samples=1000,
    n_features=20,
    n_classes=5,
    n_labels=2,  # avg labels per instance
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Binary Relevance
br = MultiOutputClassifier(LogisticRegression(max_iter=1000))
br.fit(X_train, y_train)
y_pred_br = br.predict(X_test)

# Classifier Chain
cc = ClassifierChain(RandomForestClassifier(n_estimators=100))
cc.fit(X_train, y_train)
y_pred_cc = cc.predict(X_test)

# Evaluation function
def evaluate_multilabel(y_true, y_pred, name):
    return {
        'Method': name,
        'Hamming Loss': hamming_loss(y_true, y_pred),
        'Jaccard': jaccard_score(y_true, y_pred, average='samples'),
        'F1 Micro': f1_score(y_true, y_pred, average='micro'),
        'F1 Macro': f1_score(y_true, y_pred, average='macro')
    }

print(evaluate_multilabel(y_test, y_pred_br, 'Binary Relevance'))
print(evaluate_multilabel(y_test, y_pred_cc, 'Classifier Chain'))

Interview Questions¶

Q1: When would you choose Binary Relevance vs Classifier Chains?

A: - Binary Relevance: Labels are independent, need parallelization, simple baseline - Classifier Chains: Labels are correlated (e.g., "Python" → "Programming"), sequential dependencies matter

Q2: Why is Label Powerset rarely used despite capturing all correlations?

A: Two main issues: 1. Exponential classes: With $L$ labels, up to $2^L$ combinations. Many are rare or unseen. 2. Class imbalance: Most combinations have few samples, some have none.

Better for small $L$ where most combinations appear in training.

Q3: Explain Hamming Loss vs Subset Accuracy.

A: - Hamming Loss: Fraction of wrong labels (per-label). Allows partial credit. Range: [0, 1], lower is better. - Subset Accuracy: Exact match required (per-instance). No partial credit. Very strict.

Example: True = [1,1,0,0], Pred = [1,0,0,0] - Hamming Loss = ¼ = 0.25 (one wrong label) - Subset Accuracy = 0 (not exact match)

Q4: Design a system for multi-label document classification with 1000 labels.

A: 1. Feature extraction: TF-IDF or embeddings (BERT, sentence-transformers) 2. Model choice: - If label correlations weak: Binary Relevance with Linear SVM (fast, scalable) - If correlations matter: Classifier Chains (ensemble of chains for robustness) - For deep learning: Binary cross-entropy with sigmoid output 3. Handling imbalance: Per-label class weights, focal loss 4. Evaluation: Hamming Loss (primary), Jaccard Score, per-label F1 5. Production: Model compression, label pruning (remove rare labels)

Q5: How do you handle extreme label imbalance in multi-label settings?

A: 1. Per-label weighting: Higher weight for rare labels in loss 2. Focal Loss: Focus on hard examples 3. SMOTE variants: Oversample minority label combinations 4. Negative sampling: For very many labels, sample negative labels 5. Label hierarchy: Group similar labels, predict at group level first

6. Semi-Supervised Learning¶

Core Assumptions¶

SSL relies on key assumptions about data structure:

Assumption	Description	When It Breaks
Smoothness	Similar inputs → similar outputs	Noisy data, outliers
Cluster	Data forms groups, boundaries avoid clusters	High-dimensional, sparse data
Manifold	Data lies on low-dimensional manifold	Manifold crossing, holes
Low-density	Decision boundaries pass through low-density regions	Class overlap

SSL Methods Taxonomy¶

1. Wrapper Methods (Self-Training)¶

Idea: Train on labeled data → predict unlabeled → add high-confidence predictions → repeat.

from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.ensemble import RandomForestClassifier

# -1 = unlabeled
y_semi = y_labeled.copy()
y_semi[len(y_labeled):] = -1

base_clf = RandomForestClassifier(n_estimators=100)
self_training = SelfTrainingClassifier(base_clf, threshold=0.9)
self_training.fit(X_combined, y_semi)

Risk: Confirmation bias — errors reinforce themselves.

2. Consistency Regularization¶

Key Idea: Model should give similar predictions for perturbed inputs.

Loss: $$\mathcal{L} = \mathcal{L}_{sup} + \lambda \cdot \mathbb{E}_{x, x'}[\|f(x) - f(x')\|^2]$$

where $x'$ is augmented version of $x$.

FixMatch Algorithm: 1. Generate weak and strong augmentations 2. If weak prediction confident (>$\tau$), use as pseudo-label 3. Train on strong augmentation with pseudo-label

# FixMatch pseudo-code
def fixmatch_loss(model, x_l, y_l, x_u, tau=0.95):
    # Supervised loss
    loss_sup = CE(model(x_l), y_l)

    # Unlabeled: weak and strong augmentation
    x_u_weak = weak_augment(x_u)
    x_u_strong = strong_augment(x_u)

    pseudo_labels = model(x_u_weak)
    mask = (pseudo_labels.max(dim=1) > tau).float()

    loss_unsup = (CE(model(x_u_strong), pseudo_labels) * mask).mean()

    return loss_sup + lambda_u * loss_unsup

3. Mean Teacher¶

Idea: Teacher (EMA of student) generates targets for student.

\[\theta_{teacher} = \alpha \theta_{teacher} + (1-\alpha) \theta_{student}\]

# Mean Teacher update
def update_teacher(teacher, student, alpha=0.99):
    for t_param, s_param in zip(teacher.parameters(), student.parameters()):
        t_param.data = alpha * t_param.data + (1 - alpha) * s_param.data

4. MixMatch¶

Combines multiple techniques: 1. Augmentation: Multiple augmentations per sample 2. Pseudo-labeling: Average predictions → sharpen 3. MixUp: Interpolate labeled and unlabeled

\[\mathcal{L} = \mathcal{L}_{CE}(X, Y) + \lambda_u \mathcal{L}_{MSE}(U, \hat{U}) + \lambda_{mix} \mathcal{L}_{MixUp}\]

Comparison Table¶

Method	Key Idea	Pros	Cons
Self-Training	Pseudo-label high-confidence	Simple, any model	Confirmation bias
FixMatch	Strong/weak aug + threshold	SOTA on CIFAR	Needs good augmentations
Mean Teacher	EMA teacher targets	Stable, no threshold	Architecture-specific
MixMatch	Mix multiple techniques	Combines strengths	Many hyperparameters
S3VM	Adjust margin for unlabeled	Theoretical	Doesn't scale

Production Considerations¶

Confirmation Bias Mitigation: - Ensemble disagreement - Uncertainty-aware filtering - Teacher-student with EMA - Human review of low-confidence predictions

Distribution Mismatch: - Monitor drift between labeled and unlabeled distributions - Use statistical tests (KL divergence, PSI) - Filter out-of-distribution unlabeled data

Loss Balancing: $$\mathcal{L}_{total} = \mathcal{L}_{sup} + \lambda(t) \cdot \mathcal{L}_{unsup}$$

Ramp up $\lambda(t)$ gradually: $\lambda(t) = \lambda_{max} \cdot e^{-5(1-t/T)^2}$

Python Implementation¶

import torch
import torch.nn.functional as F
from sklearn.semi_supervised import LabelSpreading

# Sklearn: Label Propagation
lp = LabelSpreading(kernel='knn', n_neighbors=7)
lp.fit(X_combined, y_semi)  # -1 for unlabeled
y_pred = lp.predict(X_test)

# PyTorch: FixMatch-style training
def train_ssl_epoch(model, optimizer, labeled_loader, unlabeled_loader, tau=0.95):
    model.train()
    for (x_l, y_l), x_u in zip(labeled_loader, unlabeled_loader):

        # Supervised loss
        logits_l = model(x_l)
        loss_sup = F.cross_entropy(logits_l, y_l)

        # Pseudo-labeling with threshold
        with torch.no_grad():
            logits_u = model(x_u)
            probs_u = F.softmax(logits_u, dim=1)
            max_probs, pseudo_labels = probs_u.max(dim=1)
            mask = (max_probs >= tau).float()

        # Unsupervised loss (only confident predictions)
        logits_u_strong = model(strong_augment(x_u))
        loss_unsup = (F.cross_entropy(logits_u_strong, pseudo_labels, reduction='none') * mask).mean()

        loss = loss_sup + 1.0 * loss_unsup
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Interview Questions¶

Q1: When does SSL help vs. hurt?

A: - Helps: Small labeled set (5-10%), unlabeled matches distribution, smooth/clustered data - Hurts: Distribution mismatch, noisy unlabeled data, class overlap, confirmation bias

Always validate on held-out labeled data.

Q2: Explain the confirmation bias problem in self-training.

A: Early errors in pseudo-labels get reinforced across epochs. A wrong high-confidence prediction becomes training data for future iterations, making the model more confident in the wrong direction.

Solutions: ensemble disagreement, uncertainty filtering, Mean Teacher (stable EMA targets), human review.

Q3: Compare FixMatch vs Mean Teacher.

A: - FixMatch: Uses confidence threshold + strong augmentations. If weak-aug prediction >$\tau$, train on strong-aug version. Threshold-based filtering. - Mean Teacher: Teacher (EMA of student) provides soft targets. No threshold needed. Consistency loss between student and teacher predictions.

Both are SOTA, FixMatch often simpler to implement.

Q4: How much labeled data do you need for SSL to work?

A: Rule of thumb: 5-10% of full dataset often sufficient. But depends on: - Task complexity (simple → less labels needed) - Data quality (clean → less labels needed) - Model capacity (larger models need more data)

Start small, validate often, expand labeled set if gains plateau.

Q5: How do you detect distribution drift between labeled and unlabeled data?

A: 1. Feature statistics: Compare mean/variance of features 2. KL divergence: Compare predicted class distributions 3. PSI (Population Stability Index): Feature-level drift 4. kNN distance: Average distance to labeled neighbors 5. Domain classifier: Train classifier to distinguish labeled vs unlabeled

Flag samples with high drift scores for human review.

7. Active Learning¶

Active Learning Loop¶

Active learning is a special case of supervised ML where the model chooses which data to label.

Loop: 1. Initial Training: Train on small labeled set 2. Inference: Predict on unlabeled pool 3. Query: Select most informative samples (acquisition function) 4. Label: Human annotates selected samples 5. Augment & Retrain: Add to training set, repeat

Query Strategies¶

1. Uncertainty Sampling¶

Select samples where model is least confident.

Least Confidence: $$x^* = \arg\max_x \left(1 - P_\theta(\hat{y}|x)\right)$$

where $\hat{y} = \arg\max_y P_\theta(y|x)$

Margin Sampling: $$x^* = \arg\min_x \left(P_\theta(\hat{y}_1|x) - P_\theta(\hat{y}_2|x)\right)$$

where $\hat{y}_1, \hat{y}_2$ are top-2 predictions.

Entropy: $$x^* = \arg\max_x H(P_\theta(y|x)) = -\sum_y P_\theta(y|x) \log P_\theta(y|x)$$

2. Query-by-Committee (QBC)¶

Train ensemble of models, select samples with highest disagreement.

Vote Entropy: $$x^* = \arg\max_x \left(-\sum_y \frac{V(y)}{C} \log \frac{V(y)}{C}\right)$$

where $V(y)$ = votes for class $y$, $C$ = committee size.

KL Divergence: $$x^* = \arg\max_x \frac{1}{C}\sum_{c=1}^{C} D_{KL}(P_{\theta_c} \| P_{avg})$$

3. Expected Model Change¶

Select samples that would cause largest gradient change: $$x^* = \arg\max_x \|\nabla_\theta \mathcal{L}(\theta; x, \hat{y})\|$$

4. Diversity Sampling¶

Avoid selecting similar samples. Use clustering or embeddings.

from sklearn.cluster import KMeans

# Diversity sampling via clustering
def diversity_sampling(X_unlabeled, n_samples, embeddings):
    kmeans = KMeans(n_clusters=n_samples)
    kmeans.fit(embeddings)
    # Select sample closest to each centroid
    selected = []
    for i in range(n_samples):
        cluster_samples = np.where(kmeans.labels_ == i)[0]
        centroid = kmeans.cluster_centers_[i]
        distances = np.linalg.norm(embeddings[cluster_samples] - centroid, axis=1)
        selected.append(cluster_samples[np.argmin(distances)])
    return selected

Comparison Table¶

Strategy	Formula	Best For	Pitfalls
Least Confidence	$\max(1-P(\hat{y}))$	Binary/multiclass	Ignores runner-up
Margin	$\min(P(\hat{y}_1)-P(\hat{y}_2))$	Multiclass	Needs calibrated probs
Entropy	$\max H(P(y\mid x))$	Many classes	Can select outliers
QBC	Max vote entropy	Model ensembles	Computation heavy
Diversity	Clustering	Large pools	May miss rare cases

Python Implementation¶

import numpy as np
from sklearn.base import clone

class ActiveLearner:
    def __init__(self, estimator, strategy='entropy'):
        self.estimator = estimator
        self.strategy = strategy

    def query(self, X_pool, n_samples=10):
        """Select most informative samples from pool."""
        probs = self.estimator.predict_proba(X_pool)

        if self.strategy == 'least_confident':
            scores = 1 - np.max(probs, axis=1)
        elif self.strategy == 'margin':
            sorted_probs = np.sort(probs, axis=1)[:, ::-1]
            scores = sorted_probs[:, 0] - sorted_probs[:, 1]
            scores = -scores  # Lower margin = higher score
        elif self.strategy == 'entropy':
            scores = -np.sum(probs * np.log(probs + 1e-10), axis=1)

        return np.argsort(scores)[-n_samples:]

    def teach(self, X, y):
        """Retrain with new labeled samples."""
        self.estimator.fit(X, y)

# Active learning loop
def active_learning_loop(estimator, X_labeled, y_labeled, X_pool,
                          oracle, n_iterations=10, n_per_iter=10):
    learner = ActiveLearner(estimator, strategy='entropy')
    learner.teach(X_labeled, y_labeled)

    for i in range(n_iterations):
        # Query most informative samples
        query_idx = learner.query(X_pool, n_samples=n_per_iter)

        # Get labels from oracle (human annotator)
        X_query = X_pool[query_idx]
        y_query = oracle(X_query)  # Human labels these

        # Add to labeled set
        X_labeled = np.vstack([X_labeled, X_query])
        y_labeled = np.concatenate([y_labeled, y_query])

        # Remove from pool
        mask = np.ones(len(X_pool), dtype=bool)
        mask[query_idx] = False
        X_pool = X_pool[mask]

        # Retrain
        learner.teach(X_labeled, y_labeled)

        print(f"Iteration {i+1}: {len(X_labeled)} labeled samples")

    return learner

# Using modAL library (specialized for active learning)
from modAL.models import ActiveLearner as ModALLearner
from modAL.uncertainty import entropy_sampling

learner = ModALLearner(
    estimator=RandomForestClassifier(),
    query_strategy=entropy_sampling
)
learner.fit(X_initial, y_initial)

query_idx, query_inst = learner.query(X_pool, n_instances=10)
learner.teach(X_pool[query_idx], y_new_labels)

Red Flags: - Biased queries: Model selects from narrow region, ignores others - Annotation drift: Annotator fatigue → lower quality labels over time - Cold start problem: Initial model too weak → bad query selection - Outlier focus: Uncertainty sampling picks noise/outliers

Mitigation: - Mix uncertainty with diversity sampling - Regular validation on held-out test set - Start with diversity-based selection for cold start - Filter outliers before querying

Interview Questions¶

Q1: When would you use active learning vs. labeling everything?

A: Use AL when: - Labeling is expensive (medical imaging, legal docs) - Task has many "easy" samples model already handles - Budget/time is limited - Model already has reasonable baseline

Don't use AL when: - Task is novel (no good initial model) - Every sample is valuable - You need to guarantee coverage - Annotation is cheap

Q2: Compare uncertainty sampling vs query-by-committee.

A: - Uncertainty Sampling: Single model, use confidence/probability. Fast, simple. Needs calibrated probabilities. - QBC: Multiple models, measure disagreement. Better captures model uncertainty, not just confidence. More compute, needs ensemble.

QBC often better for deep learning where single-model confidence is unreliable.

Q3: How do you prevent active learning from selecting only outliers?

A: 1. Diversity constraint: Mix uncertainty with clustering-based diversity 2. Typicality filter: Exclude samples far from training distribution 3. Human review: Annotator can reject "obvious error" queries 4. Hybrid strategies: 50% uncertainty + 50% random/diversity

Q4: Design an annotation system for multi-class image classification.

A: 1. Cold start: Use pre-trained features + clustering to select diverse initial batch 2. Query strategy: Mix entropy (70%) + diversity (30%) 3. Batch mode: Query 100-500 images per iteration (not 1) 4. Annotation UI: Show model predictions, allow quick correction 5. Stopping criterion: Stop when validation accuracy plateaus for 3 iterations 6. Quality control: Random sample review, annotator agreement tracking

Q5: What's the difference between active learning and semi-supervised learning?

A: | Aspect | Active Learning | Semi-Supervised | |--------|----------------|-----------------| | Labels from | Human (oracle) | Model (pseudo-labels) | | Unlabeled data | Pool to query | Training data | | Cost | Annotation cost | Compute cost | | Quality | Guaranteed correct | May propagate errors | | When to use | Labels expensive | Some labels available |

Can combine: use AL to select samples for human labeling + SSL to leverage remaining unlabeled.

8. Практические навыки¶

Feature Engineering Checklist¶

Missing values: impute vs drop
Categorical: one-hot, target, embedding
Numerical: scale, log transform, bin
Outliers: cap, remove, transform
Interactions: polynomial, domain-specific
Time features: cyclical encoding

Cross-Validation Strategy¶

Data Type	CV Strategy
IID	K-Fold (5-10)
Imbalanced	Stratified K-Fold
Time Series	TimeSeriesSplit
Groups	GroupKFold

Hyperparameter Tuning¶

Method	Когда использовать
Grid Search	Few params, small space
Random Search	Large space
Bayesian (Optuna)	Expensive training
Successive Halving	Many configs, limited time

Связи между темами¶

Linear Regression → Logistic Regression → Neural Networks
       ↓                    ↓                    ↓
   OLS closed-form     Sigmoid + BCE      Backprop
       ↓                    ↓                    ↓
   Regularization    Classification      Optimizers
       ↓               Metrics               ↓
   L1/L2 → Feature Selection ← ← ← ← ← Gradient Descent
       ↓
   Decision Trees → Random Forest → Gradient Boosting
       ↓
   Gini/Entropy ← ← ← ← Information Theory

Distance	Formula	Когда использовать
Euclidean	\(\sqrt{\sum(x_i-y_i)^2}\)	Default, continuous
Manhattan	$\sum\mid x_i-y_i	$
Cosine	\(1 - \frac{x \cdot y}{\\|x\\|\\|y\\|}\)	Text, sparse

Criterion	Formula	Range
Gini	\(1 - \sum p_k^2\)	[0, 1-1/K] (binary: [0, 0.5])
Entropy	\(-\sum p_k \log_2 p_k\)	[0, 1]
MSE (regression)	\(\frac{1}{n}\sum(y_i - \bar{y})^2\)	[0, ∞)

Metric	Formula	Use Case
MRR	\(1/\text{rank of first relevant}\)	Search
MAP	Mean of AP@k	Retrieval
NDCG	\(DCG/IDCG\)	Ranked lists

Kernel	Formula
Linear	\(x^Tx'\)
Polynomial	\((\gamma x^Tx' + r)^d\)
RBF	\(\exp(-\gamma\\|x-x'\\|^2)\)

Method	Label Dependencies	Complexity	Parallelizable	Best For
Binary Relevance	None	\(O(L)\)	Yes	Independent labels
Classifier Chains	Sequential	\(O(L)\)	No	Sequential dependencies
Label Powerset	All	\(O(2^L)\)	Yes	Few label combinations

Strategy	Formula	Best For	Pitfalls
Least Confidence	\(\max(1-P(\hat{y}))\)	Binary/multiclass	Ignores runner-up
Margin	\(\min(P(\hat{y}_1)-P(\hat{y}_2))\)	Multiclass	Needs calibrated probs
Entropy	\(\max H(P(y\mid x))\)	Many classes	Can select outliers
QBC	Max vote entropy	Model ensembles	Computation heavy
Diversity	Clustering	Large pools	May miss rare cases

Classical ML: учебные материалы¶

1. ML Algorithms from Scratch (6 задач)¶

KNN (impl_003_knn)¶

Logistic Regression (impl_001_logistic_regression)¶

K-Means (impl_002_kmeans)¶

Naive Bayes (impl_005_naive_bayes)¶

Decision Tree (impl_004_decision_tree)¶

Neural Network from Scratch (impl_006_neural_network)¶

2. ML Classic Practice (10 задач)¶

Linear Regression (classic_005_linear_regression)¶

Classification Metrics (classic_002_classification_metrics)¶

Class Imbalance (classic_003_class_imbalance)¶

Ranking Metrics (classic_001_ranking_metrics)¶

Feature Selection (feat_002_feature_selection)¶

Target Encoding (feat_001_target_encoding)¶

SVM (classic_007_svm)¶

GBDT Internals (classic_008_gbdt_internals)¶

3. Ensemble Theory Deep Dive¶

Bias-Variance Decomposition¶

Bagging (Bootstrap Aggregating)¶

Boosting¶

Bagging vs Boosting Comparison¶

Why More Trees Can Hurt Boosting¶

Stacking (Stacked Generalization)¶

Ensemble Diversity Metrics¶

Python Implementation¶

Interview Questions¶

4. Online Learning for Streaming Data¶

Online vs Batch Learning¶

Online Gradient Descent (OGD)¶

The Regret Framework¶

Follow-The-Regularized-Leader (FTRL)¶

Perceptron and Passive-Aggressive Algorithms¶

Concept Drift Handling¶

Python Implementation¶

Hoeffding Trees (VFDT)¶

Interview Questions¶

5. Multi-Label Classification¶

Multi-Label vs Single-Label¶

Problem Transformation Methods¶

1. Binary Relevance (BR)¶

2. Classifier Chains (CC)¶

3. Label Powerset (LP)¶

Comparison Table¶

Multi-Label Evaluation Metrics¶

Hamming Loss¶

Jaccard Score (Intersection over Union)¶

Subset Accuracy (Exact Match)¶

Algorithm Adaptation Methods¶

MLkNN (Multi-Label k-Nearest Neighbors)¶

Python Implementation¶

Interview Questions¶

6. Semi-Supervised Learning¶

Core Assumptions¶

SSL Methods Taxonomy¶

1. Wrapper Methods (Self-Training)¶

2. Consistency Regularization¶

3. Mean Teacher¶

4. MixMatch¶

Comparison Table¶

Production Considerations¶

Python Implementation¶

Interview Questions¶

7. Active Learning¶

Active Learning Loop¶

Query Strategies¶

1. Uncertainty Sampling¶

2. Query-by-Committee (QBC)¶

3. Expected Model Change¶

4. Diversity Sampling¶

Comparison Table¶

Python Implementation¶

When Active Learning Fails¶

Interview Questions¶

8. Практические навыки¶

Feature Engineering Checklist¶

Cross-Validation Strategy¶

Hyperparameter Tuning¶

Связи между темами¶

Рекомендуемый порядок изучения¶