Classical ML: обновления 2025-2026¶
~4 минуты чтения
Предварительно: Выбор модели | Гиперпараметры
Что изменилось за последний год в Classical ML: новые методы, тренды, устаревшие подходы. Обновлено: 2026-02-11
1. Gradient Boosting — Новые методы¶
NGBoost (Natural Gradient Boosting) — 2024-2025¶
Key Idea: Probabilistic predictions с uncertainty quantification.
Innovation: Natural gradient вместо обычного gradient.
Output: Distribution parameters (mean, variance), не просто point prediction.
from ngboost import NGBRegressor
ngb = NGBRegressor().fit(X_train, y_train)
y_pred, y_std = ngb.pred_dist(X_test).loc, ngb.pred_dist(X_test).scale
TabPFN (Tabular Prior-Data Fitted Network) — 2024-2025¶
Breakthrough: Neural network pretrained на synthetic tabular data.
Key insight: In-context learning для tabular (few-shot).
Results: - Beats XGBoost на small datasets (<1000 samples) - Zero training required! - 1-2 seconds inference
Limitation: Max 1000 samples, 100 features currently.
Why Trees Still Win (2025 Research)¶
Paper: "Why do tree-based models still outperform deep learning on tabular data?"
Findings: 1. Tabular data has irregular patterns 2. Trees handle heterogeneous features better 3. Neural networks overfit on noise 4. Feature scaling hurts trees less
Recommendation: Start with trees, try NN only on large homogeneous datasets.
2. AutoML для Classical ML — Mainstream¶
FLAML (Microsoft) — 2024-2025¶
Key Idea: Cost-effective hyperparameter optimization.
Features: - Learns from prior trials - Early stopping per config - Handles time budget
from flaml import AutoML
automl = AutoML()
automl.fit(X_train, y_train, time_budget=60) # 1 minute budget
AutoGluon (Amazon) — 2025 Updates¶
Key Features: - Multi-layer stacking - Weighted ensemble - Zero-code training
Performance: Often beats manual tuning.
PyCaret 4.0 (2025)¶
Improvements: - Better model selection - Built-in experiment tracking - Improved interpretability
3. Categorical Encoding — Новые стандарты¶
CatBoost Encoding Evolution (2025)¶
Ordered Target Statistics: 1. Random permutation of data 2. For each row, compute target mean of previous rows only 3. No leakage!
Why it works: Exploits ordering structure, removes target leakage.
Entity Embeddings Revisited (2025)¶
Trend: Learn categorical embeddings with neural networks, then use in tree models.
# Two-stage approach
# Stage 1: Learn embeddings
embedding_model = train_embeddings(categorical_data)
embeddings = embedding_model.encode(categories)
# Stage 2: Use in GBDT
X_with_embeddings = np.hstack([numerical, embeddings])
gbdt.fit(X_with_embeddings, y)
4. Imbalanced Learning — Новые методы¶
Focal Loss для Classical ML (2025)¶
Origin: Originally for object detection, now adapted for tabular.
Formula: $\(L_{focal} = -\alpha(1-p_t)^\gamma \log(p_t)\)$
Effect: Focuses learning on hard examples.
Adaptation: Can be used with logistic regression and neural networks on tabular.
Self-Paced Learning для Imbalance (2025)¶
Idea: Start with easy examples, gradually include hard ones.
Benefit: Better convergence, less noise sensitivity.
5. Feature Engineering Automation¶
Featuretools Evolution (2025)¶
Automated Feature Engineering: - Deep Feature Synthesis - Handles time series - Entity relationships
import featuretools as ft
es = ft.EntitySet(id="data")
es.add_dataframe(dataframe_name="customers", dataframe=df, index="id")
features, feature_names = ft.dfs(entityset=es, target_dataframe_name="customers")
OpenFE (2025)¶
Key Idea: LLM-guided feature engineering.
Process: 1. Generate feature candidates with LLM 2. Evaluate candidates 3. Select best features
6. SVM — Practical Trends¶
Linear SVM для Large Scale (2025)¶
Trend: Linear SVM (SGD-based) для millions of samples.
Libraries: - scikit-learn SGDClassifier(loss='hinge') - LIBLINEAR - ThunderSVM (GPU)
When to use: Text classification, high-dimensional sparse data.
Kernel SVM — Rarely Used (2025)¶
Reality: Kernel SVMs computationally expensive (\(O(n^3)\)).
Modern alternatives: - Random Fourier Features + Linear SVM - Neural networks - Gradient Boosting
7. Naive Bayes — Niche Applications¶
Gaussian NB for Anomaly Detection (2025)¶
Use case: When normal data has clear distribution.
Advantage: Fast, interpretable.
Complement NB for Imbalanced Text (2025)¶
Paper: "Tackling the Poor Assumptions of Naive Bayes"
Improvement: Uses complement class statistics.
8. KNN — Approximate Methods¶
ANN (Approximate Nearest Neighbors) — Mainstream (2025)¶
Algorithms: - HNSW (Hierarchical Navigable Small World) - IVF (Inverted File Index) - LSH (Locality Sensitive Hashing)
Libraries: - FAISS (Facebook) - Annoy (Spotify) - hnswlib
Speed: 10-1000x faster than brute force, 95%+ accuracy.
import faiss
index = faiss.IndexHNSWFlat(d=128, M=16)
index.add(embeddings)
D, I = index.search(query, k=10) # 10 nearest
9. Decision Trees — Interpretability Focus¶
SHAP for Trees (2025 Standard)¶
Tree SHAP: Fast exact Shapley values for tree ensembles.
Decision Tree Extraction from NN (2025)¶
Trend: Train neural network, extract decision tree for interpretability.
Use case: Regulated industries (finance, healthcare).
10. Hybrid Approaches (2025-2026)¶
Trees + Neural Networks¶
Methods: 1. Deep Neural Decision Forest: Differentiable trees 2. NODE: Neural Oblivious Decision Ensembles 3. GrowNet: Gradient boosting with neural nets as weak learners
TabNet (2025 Maturity)¶
Architecture: Attention-based feature selection + decision steps.
from pytorch_tabnet.tab_model import TabNetClassifier
model = TabNetClassifier()
model.fit(X_train, y_train, eval_set=[(X_val, y_val)])
Pros: Interpretable, learns feature interactions Cons: Slower training, more tuning
Устаревшие подходы (Deprecated)¶
| Deprecated | Replacement |
|---|---|
| Grid Search для large spaces | Bayesian (Optuna) |
| One-hot для high cardinality | Target encoding, embeddings |
| Single decision tree | Ensemble (RF, GBDT) |
| Brute-force KNN | ANN (FAISS, HNSW) |
| SVM for large data | Linear SVM, GBDT |
| Manual feature engineering | AutoML, Featuretools |
Interview Trends 2025-2026¶
Новые типы вопросов:¶
- "Why do trees beat neural networks on tabular?"
-
Answer: Heterogeneous features, irregular patterns, noise robustness
-
"When would you use TabPFN?"
-
Answer: Small datasets (<1000), quick prototyping
-
"Explain CatBoost categorical handling"
-
Answer: Ordered target statistics, no leakage
-
"Approximate NN vs exact KNN"
-
Answer: Speed vs accuracy tradeoff, HNSW/IVF algorithms
-
"How does NGBoost provide uncertainty?"
- Answer: Predicts distribution parameters, natural gradient
Практические задачи на интервью:¶
- Implement KNN from scratch (with HNSW bonus)
- Explain why XGBoost overfits and how to fix
- Design feature engineering pipeline for production
- Compare CatBoost vs LightGBM for specific dataset