Classical ML: обновления 2025-2026¶

~4 минуты чтения

Предварительно: Выбор модели | Гиперпараметры

Что изменилось за последний год в Classical ML: новые методы, тренды, устаревшие подходы. Обновлено: 2026-02-11

1. Gradient Boosting — Новые методы¶

NGBoost (Natural Gradient Boosting) — 2024-2025¶

Key Idea: Probabilistic predictions с uncertainty quantification.

Innovation: Natural gradient вместо обычного gradient.

Output: Distribution parameters (mean, variance), не просто point prediction.

from ngboost import NGBRegressor
ngb = NGBRegressor().fit(X_train, y_train)
y_pred, y_std = ngb.pred_dist(X_test).loc, ngb.pred_dist(X_test).scale

TabPFN (Tabular Prior-Data Fitted Network) — 2024-2025¶

Breakthrough: Neural network pretrained на synthetic tabular data.

Key insight: In-context learning для tabular (few-shot).

Results: - Beats XGBoost на small datasets (<1000 samples) - Zero training required! - 1-2 seconds inference

Limitation: Max 1000 samples, 100 features currently.

Why Trees Still Win (2025 Research)¶

Paper: "Why do tree-based models still outperform deep learning on tabular data?"

Findings: 1. Tabular data has irregular patterns 2. Trees handle heterogeneous features better 3. Neural networks overfit on noise 4. Feature scaling hurts trees less

Recommendation: Start with trees, try NN only on large homogeneous datasets.

2. AutoML для Classical ML — Mainstream¶

FLAML (Microsoft) — 2024-2025¶

Key Idea: Cost-effective hyperparameter optimization.

Features: - Learns from prior trials - Early stopping per config - Handles time budget

from flaml import AutoML
automl = AutoML()
automl.fit(X_train, y_train, time_budget=60)  # 1 minute budget

AutoGluon (Amazon) — 2025 Updates¶

Key Features: - Multi-layer stacking - Weighted ensemble - Zero-code training

Performance: Often beats manual tuning.

PyCaret 4.0 (2025)¶

Improvements: - Better model selection - Built-in experiment tracking - Improved interpretability

3. Categorical Encoding — Новые стандарты¶

CatBoost Encoding Evolution (2025)¶

Ordered Target Statistics: 1. Random permutation of data 2. For each row, compute target mean of previous rows only 3. No leakage!

Why it works: Exploits ordering structure, removes target leakage.

Entity Embeddings Revisited (2025)¶

Trend: Learn categorical embeddings with neural networks, then use in tree models.

# Two-stage approach
# Stage 1: Learn embeddings
embedding_model = train_embeddings(categorical_data)
embeddings = embedding_model.encode(categories)

# Stage 2: Use in GBDT
X_with_embeddings = np.hstack([numerical, embeddings])
gbdt.fit(X_with_embeddings, y)

4. Imbalanced Learning — Новые методы¶

Focal Loss для Classical ML (2025)¶

Origin: Originally for object detection, now adapted for tabular.

Formula: $$L_{focal} = -\alpha(1-p_t)^\gamma \log(p_t)$$

Effect: Focuses learning on hard examples.

Adaptation: Can be used with logistic regression and neural networks on tabular.

Self-Paced Learning для Imbalance (2025)¶

Idea: Start with easy examples, gradually include hard ones.

Benefit: Better convergence, less noise sensitivity.

5. Feature Engineering Automation¶

Featuretools Evolution (2025)¶

Automated Feature Engineering: - Deep Feature Synthesis - Handles time series - Entity relationships

import featuretools as ft
es = ft.EntitySet(id="data")
es.add_dataframe(dataframe_name="customers", dataframe=df, index="id")
features, feature_names = ft.dfs(entityset=es, target_dataframe_name="customers")

OpenFE (2025)¶

Key Idea: LLM-guided feature engineering.

Process: 1. Generate feature candidates with LLM 2. Evaluate candidates 3. Select best features

6. SVM — Practical Trends¶

Linear SVM для Large Scale (2025)¶

Trend: Linear SVM (SGD-based) для millions of samples.

Libraries: - scikit-learn SGDClassifier(loss='hinge') - LIBLINEAR - ThunderSVM (GPU)

When to use: Text classification, high-dimensional sparse data.

Kernel SVM — Rarely Used (2025)¶

Reality: Kernel SVMs computationally expensive ($O(n^3)$).

Modern alternatives: - Random Fourier Features + Linear SVM - Neural networks - Gradient Boosting

7. Naive Bayes — Niche Applications¶

Gaussian NB for Anomaly Detection (2025)¶

Use case: When normal data has clear distribution.

Advantage: Fast, interpretable.

Complement NB for Imbalanced Text (2025)¶

Paper: "Tackling the Poor Assumptions of Naive Bayes"

Improvement: Uses complement class statistics.

from sklearn.naive_bayes import ComplementNB
cnb = ComplementNB()  # Better for imbalanced text

8. KNN — Approximate Methods¶

ANN (Approximate Nearest Neighbors) — Mainstream (2025)¶

Algorithms: - HNSW (Hierarchical Navigable Small World) - IVF (Inverted File Index) - LSH (Locality Sensitive Hashing)

Libraries: - FAISS (Facebook) - Annoy (Spotify) - hnswlib

Speed: 10-1000x faster than brute force, 95%+ accuracy.

import faiss
index = faiss.IndexHNSWFlat(d=128, M=16)
index.add(embeddings)
D, I = index.search(query, k=10)  # 10 nearest

9. Decision Trees — Interpretability Focus¶

SHAP for Trees (2025 Standard)¶

Tree SHAP: Fast exact Shapley values for tree ensembles.

import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

Decision Tree Extraction from NN (2025)¶

Trend: Train neural network, extract decision tree for interpretability.

Use case: Regulated industries (finance, healthcare).

10. Hybrid Approaches (2025-2026)¶

Trees + Neural Networks¶

Methods: 1. Deep Neural Decision Forest: Differentiable trees 2. NODE: Neural Oblivious Decision Ensembles 3. GrowNet: Gradient boosting with neural nets as weak learners

TabNet (2025 Maturity)¶

Architecture: Attention-based feature selection + decision steps.

from pytorch_tabnet.tab_model import TabNetClassifier
model = TabNetClassifier()
model.fit(X_train, y_train, eval_set=[(X_val, y_val)])

Pros: Interpretable, learns feature interactions Cons: Slower training, more tuning

Устаревшие подходы (Deprecated)¶

Deprecated	Replacement
Grid Search для large spaces	Bayesian (Optuna)
One-hot для high cardinality	Target encoding, embeddings
Single decision tree	Ensemble (RF, GBDT)
Brute-force KNN	ANN (FAISS, HNSW)
SVM for large data	Linear SVM, GBDT
Manual feature engineering	AutoML, Featuretools

Interview Trends 2025-2026¶

Новые типы вопросов:¶

"Why do trees beat neural networks on tabular?"
Answer: Heterogeneous features, irregular patterns, noise robustness
"When would you use TabPFN?"
Answer: Small datasets (<1000), quick prototyping
"Explain CatBoost categorical handling"
Answer: Ordered target statistics, no leakage
"Approximate NN vs exact KNN"
Answer: Speed vs accuracy tradeoff, HNSW/IVF algorithms
"How does NGBoost provide uncertainty?"
Answer: Predicts distribution parameters, natural gradient

Практические задачи на интервью:¶

Implement KNN from scratch (with HNSW bonus)
Explain why XGBoost overfits and how to fix
Design feature engineering pipeline for production
Compare CatBoost vs LightGBM for specific dataset