Перейти к содержанию

Classical ML: обновления 2025-2026

~4 минуты чтения

Предварительно: Выбор модели | Гиперпараметры

Что изменилось за последний год в Classical ML: новые методы, тренды, устаревшие подходы. Обновлено: 2026-02-11


1. Gradient Boosting — Новые методы

NGBoost (Natural Gradient Boosting) — 2024-2025

Key Idea: Probabilistic predictions с uncertainty quantification.

Innovation: Natural gradient вместо обычного gradient.

Output: Distribution parameters (mean, variance), не просто point prediction.

from ngboost import NGBRegressor
ngb = NGBRegressor().fit(X_train, y_train)
y_pred, y_std = ngb.pred_dist(X_test).loc, ngb.pred_dist(X_test).scale

TabPFN (Tabular Prior-Data Fitted Network) — 2024-2025

Breakthrough: Neural network pretrained на synthetic tabular data.

Key insight: In-context learning для tabular (few-shot).

Results: - Beats XGBoost на small datasets (<1000 samples) - Zero training required! - 1-2 seconds inference

Limitation: Max 1000 samples, 100 features currently.

Why Trees Still Win (2025 Research)

Paper: "Why do tree-based models still outperform deep learning on tabular data?"

Findings: 1. Tabular data has irregular patterns 2. Trees handle heterogeneous features better 3. Neural networks overfit on noise 4. Feature scaling hurts trees less

Recommendation: Start with trees, try NN only on large homogeneous datasets.


2. AutoML для Classical ML — Mainstream

FLAML (Microsoft) — 2024-2025

Key Idea: Cost-effective hyperparameter optimization.

Features: - Learns from prior trials - Early stopping per config - Handles time budget

from flaml import AutoML
automl = AutoML()
automl.fit(X_train, y_train, time_budget=60)  # 1 minute budget

AutoGluon (Amazon) — 2025 Updates

Key Features: - Multi-layer stacking - Weighted ensemble - Zero-code training

Performance: Often beats manual tuning.

PyCaret 4.0 (2025)

Improvements: - Better model selection - Built-in experiment tracking - Improved interpretability


3. Categorical Encoding — Новые стандарты

CatBoost Encoding Evolution (2025)

Ordered Target Statistics: 1. Random permutation of data 2. For each row, compute target mean of previous rows only 3. No leakage!

Why it works: Exploits ordering structure, removes target leakage.

Entity Embeddings Revisited (2025)

Trend: Learn categorical embeddings with neural networks, then use in tree models.

# Two-stage approach
# Stage 1: Learn embeddings
embedding_model = train_embeddings(categorical_data)
embeddings = embedding_model.encode(categories)

# Stage 2: Use in GBDT
X_with_embeddings = np.hstack([numerical, embeddings])
gbdt.fit(X_with_embeddings, y)

4. Imbalanced Learning — Новые методы

Focal Loss для Classical ML (2025)

Origin: Originally for object detection, now adapted for tabular.

Formula: $\(L_{focal} = -\alpha(1-p_t)^\gamma \log(p_t)\)$

Effect: Focuses learning on hard examples.

Adaptation: Can be used with logistic regression and neural networks on tabular.

Self-Paced Learning для Imbalance (2025)

Idea: Start with easy examples, gradually include hard ones.

Benefit: Better convergence, less noise sensitivity.


5. Feature Engineering Automation

Featuretools Evolution (2025)

Automated Feature Engineering: - Deep Feature Synthesis - Handles time series - Entity relationships

import featuretools as ft
es = ft.EntitySet(id="data")
es.add_dataframe(dataframe_name="customers", dataframe=df, index="id")
features, feature_names = ft.dfs(entityset=es, target_dataframe_name="customers")

OpenFE (2025)

Key Idea: LLM-guided feature engineering.

Process: 1. Generate feature candidates with LLM 2. Evaluate candidates 3. Select best features


Linear SVM для Large Scale (2025)

Trend: Linear SVM (SGD-based) для millions of samples.

Libraries: - scikit-learn SGDClassifier(loss='hinge') - LIBLINEAR - ThunderSVM (GPU)

When to use: Text classification, high-dimensional sparse data.

Kernel SVM — Rarely Used (2025)

Reality: Kernel SVMs computationally expensive (\(O(n^3)\)).

Modern alternatives: - Random Fourier Features + Linear SVM - Neural networks - Gradient Boosting


7. Naive Bayes — Niche Applications

Gaussian NB for Anomaly Detection (2025)

Use case: When normal data has clear distribution.

Advantage: Fast, interpretable.

Complement NB for Imbalanced Text (2025)

Paper: "Tackling the Poor Assumptions of Naive Bayes"

Improvement: Uses complement class statistics.

from sklearn.naive_bayes import ComplementNB
cnb = ComplementNB()  # Better for imbalanced text

8. KNN — Approximate Methods

ANN (Approximate Nearest Neighbors) — Mainstream (2025)

Algorithms: - HNSW (Hierarchical Navigable Small World) - IVF (Inverted File Index) - LSH (Locality Sensitive Hashing)

Libraries: - FAISS (Facebook) - Annoy (Spotify) - hnswlib

Speed: 10-1000x faster than brute force, 95%+ accuracy.

import faiss
index = faiss.IndexHNSWFlat(d=128, M=16)
index.add(embeddings)
D, I = index.search(query, k=10)  # 10 nearest

9. Decision Trees — Interpretability Focus

SHAP for Trees (2025 Standard)

Tree SHAP: Fast exact Shapley values for tree ensembles.

import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

Decision Tree Extraction from NN (2025)

Trend: Train neural network, extract decision tree for interpretability.

Use case: Regulated industries (finance, healthcare).


10. Hybrid Approaches (2025-2026)

Trees + Neural Networks

Methods: 1. Deep Neural Decision Forest: Differentiable trees 2. NODE: Neural Oblivious Decision Ensembles 3. GrowNet: Gradient boosting with neural nets as weak learners

TabNet (2025 Maturity)

Architecture: Attention-based feature selection + decision steps.

from pytorch_tabnet.tab_model import TabNetClassifier
model = TabNetClassifier()
model.fit(X_train, y_train, eval_set=[(X_val, y_val)])

Pros: Interpretable, learns feature interactions Cons: Slower training, more tuning


Устаревшие подходы (Deprecated)

Deprecated Replacement
Grid Search для large spaces Bayesian (Optuna)
One-hot для high cardinality Target encoding, embeddings
Single decision tree Ensemble (RF, GBDT)
Brute-force KNN ANN (FAISS, HNSW)
SVM for large data Linear SVM, GBDT
Manual feature engineering AutoML, Featuretools

Новые типы вопросов:

  1. "Why do trees beat neural networks on tabular?"
  2. Answer: Heterogeneous features, irregular patterns, noise robustness

  3. "When would you use TabPFN?"

  4. Answer: Small datasets (<1000), quick prototyping

  5. "Explain CatBoost categorical handling"

  6. Answer: Ordered target statistics, no leakage

  7. "Approximate NN vs exact KNN"

  8. Answer: Speed vs accuracy tradeoff, HNSW/IVF algorithms

  9. "How does NGBoost provide uncertainty?"

  10. Answer: Predicts distribution parameters, natural gradient

Практические задачи на интервью:

  • Implement KNN from scratch (with HNSW bonus)
  • Explain why XGBoost overfits and how to fix
  • Design feature engineering pipeline for production
  • Compare CatBoost vs LightGBM for specific dataset