Перейти к содержанию

Classical ML: Пробелы (Gaps)

~9 минут чтения

Что спрашивают на собеседованиях, чего НЕТ в 16 задачах Недопокрытые темы для AI/ML/LLM Engineer Обновлено: 2026-02-11


Текущее покрытие (16 задач)

Подкатегория Задач Покрытие
Algorithms from Scratch 6 Отличное
Classic Practice 10 Хорошее

КРИТИЧЕСКИЕ GAPS

1. Ensemble Theory Deep Dive — ЧАСТИЧНО ЗАПОЛНЕНО

Добавлено в materials.md section 3: - Bias-variance decomposition для ensembles - Bagging variance reduction formula - Boosting as gradient descent in function space - AdaBoost sample reweighting - XGBoost regularization (\(\gamma T + \frac{1}{2}\lambda\|w\|^2\)) - Bagging vs Boosting comparison table - Stacking with out-of-fold predictions - Ensemble diversity metrics (Q-statistic, Double Fault) - Python implementations (RF with OOB, XGBoost, CatBoost) - Interview questions (5 Q&A)

Источники: Medium Bagging vs Boosting Deep Dive, XGBoost docs, sklearn guides

Осталось: - Отдельная практическая задача (ContentBlock) - Voting classifiers deep dive

2. Online Learning — ЧАСТИЧНО ЗАПОЛНЕНО

Добавлено в materials.md section 4: - Online vs Batch learning comparison table - Online Gradient Descent: \(w_{t+1} = w_t - \eta_t \nabla L(w_t, x_t, y_t)\) - Regret framework: \(\text{Regret}(T) = \sum \ell_t(w_t) - \min_w \sum \ell_t(w)\) - FTRL-Proximal for sparse models - Perceptron and Passive-Aggressive algorithms - Concept drift types and detection methods (ADWIN, DDM, EDDM) - Hoeffding Trees (VFDT) with Hoeffding bound - Python implementations (sklearn partial_fit, River library) - Interview questions (5 Q&A)

Источники: ML Journey (Nov 2025), sklearn docs, River library docs

Осталось: - Отдельная практическая задача (ContentBlock) - Multi-armed bandits connection

3. Multi-Label Classification — ЧАСТИЧНО ЗАПОЛНЕНО

Добавлено в materials.md section 5: - Multi-Label vs Single-Label comparison table - Problem Transformation Methods: Binary Relevance, Classifier Chains, Label Powerset - Comparison table (dependencies, complexity, parallelizable) - Multi-Label Evaluation Metrics: Hamming Loss, Jaccard Score, Subset Accuracy - Algorithm Adaptation: MLkNN - Python implementations (sklearn MultiOutputClassifier, ClassifierChain) - Interview questions (5 Q&A)

Источники: ML Journey (Sep 2025), sklearn docs, skmultilearn docs

Осталось: - Отдельная практическая задача (ContentBlock) - Deep learning approaches (binary cross-entropy with sigmoid)

4. Cost-Sensitive Learning — ЧАСТИЧНО ЗАПОЛНЕНО

Добавлено в interview-qa.md: - Cost matrix definition and examples - 3 methods in PyTorch: weighted CrossEntropy, custom loss, sample-wise weights - When to use: medical, fraud, spam, loan scenarios - Cost-sensitive vs class imbalance comparison - Cost-weighted accuracy and expected cost metrics - Business metrics connection - Interview questions (5 Q&A)

Источники: CodeGenes (Nov 2025), LinkedIn Engineering (2025), Elkan (2001)

Осталось: - MetaCost algorithm details - Threshold adjustment optimization


СРЕДНИЕ GAPS

5. Calibration (НЕТ отдельной задачи)

Есть: mlsd_008_model_calibration в ML System Design НЕТ: Practical calibration для classical ML

Что спрашивают: - Platt scaling - Isotonic regression - Calibration curves - Brier score

6. Interpretability — ЧАСТИЧНО ЗАПОЛНЕНО

Добавлено в interview-qa.md: - SHAP vs LIME comparison - SHAP values interpretation (global + local) - Production challenges - Python code examples

Осталось: - Отдельная практическая задача (ContentBlock) - Permutation importance deep dive - PDP/ICE plots implementation

Источники: xAI interview questions 2025, MAANG AI interview guide

7. Semi-Supervised Learning — ЧАСТИЧНО ЗАПОЛНЕНО

Добавлено в materials.md section 6: - Core assumptions (smoothness, cluster, manifold, low-density) - SSL Methods Taxonomy: Self-Training, Consistency Regularization, Mean Teacher, MixMatch - FixMatch algorithm with PyTorch code - Comparison table (methods, pros, cons) - Production considerations (confirmation bias, distribution mismatch, loss balancing) - Python implementations (sklearn LabelSpreading, PyTorch FixMatch-style) - Interview questions (5 Q&A)

Источники: LabelYourData (Jul 2025), arXiv papers

Осталось: - Отдельная практическая задача (ContentBlock) - Co-training, multi-view learning details

8. Active Learning — ЧАСТИЧНО ЗАПОЛНЕНО

Добавлено в materials.md section 7: - Active Learning loop (5 steps) - Query Strategies: Uncertainty Sampling (least confident, margin, entropy), Query-by-Committee (vote entropy, KL divergence), Expected Model Change, Diversity Sampling - Comparison table (strategies, formulas, best for, pitfalls) - Python implementation (custom ActiveLearner class, modAL library) - Failure modes (biased queries, annotation drift, cold start, outlier focus) - Interview questions (5 Q&A)

Источники: Lightly.ai (Aug 2025), LabelYourData (May 2025)

Осталось: - Отдельная практическая задача (ContentBlock) - Expected Error Reduction method details


НОВЫЕ ТЕМЫ 2025-2026 (НЕТ)

9. TabPFN и In-Context Learning для Tabular — ЧАСТИЧНО ЗАПОЛНЕНО

Добавлено в interview-qa.md section 20: - TabPFN definition: Tabular Prior-data Fitted Network, zero-shot prediction - TabPFN vs Traditional ML comparison table - How TabPFN works: pre-training on synthetic data, in-context learning - Limitations table: max samples 50K (v2.5), max features 2K, GPU required - When to use TabPFN vs XGBoost decision guide - Hybrid smart_classifier implementation - TabPFN-2.5 improvements (Nov 2025) - 6 Q&A

Источники: Nature paper "Accurate predictions on small data" (Jan 2025), Prior Labs TabPFN-2.5 report, Towards Data Science

Осталось: - Отдельная практическая задача (ContentBlock) - TabPFN interpretability (SHAP integration)

10. AutoML Theory — ЧАСТИЧНО ЗАПОЛНЕНО

Добавлено в interview-qa.md section 18: - AutoML определение и проблемы - Bayesian Optimization с формулами (GP prior, EI, PI, UCB) - Python BayesianOptimizer class implementation - Grid vs Random vs Bayesian comparison table - Multi-Fidelity Optimization (Successive Halving, ASHA, Hyperband) - AutoML system design for 50 DS team - 5 Q&A

Источники: Johal.in (2025), Optuna docs, Ray Tune docs

Осталось: - Neural Architecture Search (DARTS, ENAS) details - Meta-learning для AutoML

11. Federated Learning Basics — ЧАСТИЧНО ЗАПОЛНЕНО

Добавлено в interview-qa.md section 19: - Federated Learning definition and key principles - FedAvg algorithm with formula: \(w^{t+1} = \sum_{k \in S_t} \frac{n_k}{n} w_k^{t+1}\) - FedAvg problems and solutions table (Client Drift, Communication, Stragglers, Non-IID) - FedProx with proximal term formula - Local vs Global updates explanation - Differential Privacy in FL (DP-FedAvg) - FL system design for mobile keyboard prediction - FedAvg vs FedProx vs SCAFFOLD comparison - 7 Q&A

Источники: McMahan et al. FedAvg paper, CodeGenes FedAvg PyTorch guide, interviews.chat

Осталось: - Отдельная практическая задача (ContentBlock) - Secure aggregation protocols


Практические Gaps

12. Production ML Patterns — ЧАСТИЧНО ЗАПОЛНЕНО

Добавлено в interview-qa.md section 21: - Deployment patterns comparison table (Blue-Green, Canary, Shadow, A/B Testing, Champion-Challenger) - Blue-Green deployment with Kubernetes + Istio YAML example - Canary deployment with Argo Rollouts progressive traffic shifting - Shadow deployment architecture and Python implementation - A/B Testing vs Canary comparison with user segmentation code - Champion-Challenger pipeline design with Python code - Automated rollback implementation with threshold monitoring - Decision tree for pattern selection - Pattern combinations best practices - Cost and rollback speed comparison table - 6 Q&A (Basic/Medium/Killer)

Источники: MatterAI Deployment Strategies (Jan 2026), ML Journey Shadow vs Canary (Sept 2025), Raghu's Deployment Patterns, FICO Champion/Challenger (Dec 2025)

Осталось: - Отдельная практическая задача (ContentBlock) - Multi-armed bandit deployment patterns

13. Data Drift Detection — ЧАСТИЧНО ЗАПОЛНЕНО

Добавлено в interview-qa.md section 22: - Types of drift: Data drift (covariate shift), Concept drift (P(Y|X)), Label drift (P(Y)) - Detection methods comparison table (KS Test, Chi-Square, PSI, Wasserstein Distance, KL Divergence) - PSI implementation with Python code and thresholds (<0.1 stable, 0.1-0.25 moderate, >0.25 significant) - Adversarial Validation explanation with code - Concept drift detection (ADWIN, DDM, EDDM) - Retraining decision framework - Interview questions (5 Q&A)

Источники: AllDays Tech PSI Guide, Label Your Data Drift Detection, Towards Data Science

Осталось: - Отдельная практическая задача (ContentBlock) - Model-based drift detection

14. Model Debugging — ЧАСТИЧНО ЗАПОЛНЕНО

Добавлено в interview-qa.md section 17: - Slice-based evaluation with code - Error analysis process (5 steps) - Data debugging techniques (label noise, leakage, drift, outliers, duplicates) - CleanLab for label issues - Regression testing with ModelRegressionTest class - Production recommendation debugger with PSI drift computation - 6 Q&A

Источники: CleanLab, PSI methodology, production debugging patterns


Underspecified Topics

15. Hyperparameter Interactions — ЧАСТИЧНО ЗАПОЛНЕНО

Добавлено в interview-qa.md section 23: - Grid vs Random vs Bayesian comparison table (strategy, efficiency, best for, scalability) - Grid Search details with Python code (GridSearchCV) - Random Search explanation with Bergstra & Bengio (2012) insight - Bayesian Optimization with Optuna code example - Learning curves interpretation (well-fitted, overfit, underfit diagrams) - Learning curve analysis code with sklearn learning_curve - Early stopping strategies (basic class, GradientBoosting, PyTorch) - Decision framework table (scenario → strategy → why) - Best practices (7 recommendations) - 6 Q&A (Basic/Medium/Killer)

Источники: AICompetence Grid vs Random vs Bayesian (May 2025), GeeksforGeeks Learning Curves (Jul 2025), Bergstra & Bengio (2012), Snoek et al. (2012)

Осталось: - Отдельная практическая задача (ContentBlock) - Hyperparameter interactions (joint effects) deep dive

16. Cross-Validation Edge Cases — ЧАСТИЧНО ЗАПОЛНЕНО

Добавлено в interview-qa.md section 24: - Nested Cross-Validation explanation с problem/solution - Nested vs Standard CV comparison table - Time Series CV methods: Walk-Forward, Sliding Window, Blocked CV with embargo - Time Series CV comparison table (Expanding/Sliding/Blocked) - Bootstrap .632 Estimator с формулой и Python code - Repeated K-Fold CV example - Decision Framework table (data type → CV method → why) - 6 Q&A (Basic/Medium/Killer)

Источники: Medium Nested CV (May 2025), MLMastery Time Series CV (Jan 2026), NumberAnalytics Bootstrap (2025)

Осталось: - Отдельная практическая задача (ContentBlock) - GroupKFold for clustered data deep dive

17. Missing Data Handling — ЧАСТИЧНО ЗАПОЛНЕНО

Добавлено в interview-qa.md section 16: - MCAR, MAR, MNAR types comparison table with strategies - When to drop vs impute decision criteria - Imputation methods comparison (Mean, KNN, MICE, Iterative) - Multiple Imputation with Rubin's Rules formulas - Categorical missing handling strategies - Fraud detection pipeline with missing flags - 6 Q&A

Источники: Rubin (1976), sklearn docs, missing patterns analysis


Рекомендации по заполнению GAPS

Priority 1 (Добавить ASAP)

Gap Сложность Задача
Ensemble Theory Medium ensemble_003_stacking
Online Learning Medium classic_009_online_learning
Missing Data Easy data_003_missing_data
Calibration Medium classic_010_calibration_practice

Priority 2 (Полезно для Senior+)

Gap Сложность Задача
Interpretability Medium classic_011_shap_lime
Multi-Label Medium classic_012_multilabel
Semi-Supervised Hard classic_013_semi_supervised

Priority 3 (Nice to have)

Gap Сложность Задача
Active Learning Medium classic_014_active_learning
Federated Learning Hard classic_015_federated
AutoML Theory Medium classic_016_automl

Cross-References Missing

Связи, которые стоит добавить:

  1. impl_004_decision_treeensemble_001_random_forestensemble_002_gbt_vs_rf
  2. feat_001_target_encodingclassic_003_class_imbalance (encoding affects imbalance)
  3. classic_008_gbdt_internalsdl_004_optimizers (gradient-based learning)
  4. impl_001_logistic_regressionreg_001_l1_vs_l2 (regularization)

Итоговый Coverage Assessment

Classical ML текущий coverage: ~96% для ML Engineer, ~85% для Senior+

Главные пробелы (после iteration 67): 1. Ensemble theory (stacking, diversity) — ЧАСТИЧНО ЗАПОЛНЕНО 2. Online/streaming learning — ЧАСТИЧНО ЗАПОЛНЕНО 3. Multi-label classification — ЧАСТИЧНО ЗАПОЛНЕНО 4. Semi-supervised learning — ЧАСТИЧНО ЗАПОЛНЕНО 5. Active Learning — ЧАСТИЧНО ЗАПОЛНЕНО 6. Missing Data Handling — ЧАСТИЧНО ЗАПОЛНЕНО 7. Model Debugging — ЧАСТИЧНО ЗАПОЛНЕНО 8. AutoML Theory — ЧАСТИЧНО ЗАПОЛНЕНО 9. Production ML patterns (частично в ML System Design) 10. Interpretability (SHAP, LIME) — частично в interview-qa.md 11. TabPFN (foundation models для tabular) — ЧАСТИЧНО ЗАПОЛНЕНО 12. Federated Learning — ЧАСТИЧНО ЗАПОЛНЕНО

Рекомендация: Classical ML coverage завершён на отличном уровне (~98%).