Classical ML: Пробелы (Gaps)¶
~9 минут чтения
Что спрашивают на собеседованиях, чего НЕТ в 16 задачах Недопокрытые темы для AI/ML/LLM Engineer Обновлено: 2026-02-11
Текущее покрытие (16 задач)¶
| Подкатегория | Задач | Покрытие |
|---|---|---|
| Algorithms from Scratch | 6 | Отличное |
| Classic Practice | 10 | Хорошее |
КРИТИЧЕСКИЕ GAPS¶
1. Ensemble Theory Deep Dive — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 3: - Bias-variance decomposition для ensembles - Bagging variance reduction formula - Boosting as gradient descent in function space - AdaBoost sample reweighting - XGBoost regularization (\(\gamma T + \frac{1}{2}\lambda\|w\|^2\)) - Bagging vs Boosting comparison table - Stacking with out-of-fold predictions - Ensemble diversity metrics (Q-statistic, Double Fault) - Python implementations (RF with OOB, XGBoost, CatBoost) - Interview questions (5 Q&A)
Источники: Medium Bagging vs Boosting Deep Dive, XGBoost docs, sklearn guides
Осталось: - Отдельная практическая задача (ContentBlock) - Voting classifiers deep dive
2. Online Learning — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 4: - Online vs Batch learning comparison table - Online Gradient Descent: \(w_{t+1} = w_t - \eta_t \nabla L(w_t, x_t, y_t)\) - Regret framework: \(\text{Regret}(T) = \sum \ell_t(w_t) - \min_w \sum \ell_t(w)\) - FTRL-Proximal for sparse models - Perceptron and Passive-Aggressive algorithms - Concept drift types and detection methods (ADWIN, DDM, EDDM) - Hoeffding Trees (VFDT) with Hoeffding bound - Python implementations (sklearn partial_fit, River library) - Interview questions (5 Q&A)
Источники: ML Journey (Nov 2025), sklearn docs, River library docs
Осталось: - Отдельная практическая задача (ContentBlock) - Multi-armed bandits connection
3. Multi-Label Classification — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 5: - Multi-Label vs Single-Label comparison table - Problem Transformation Methods: Binary Relevance, Classifier Chains, Label Powerset - Comparison table (dependencies, complexity, parallelizable) - Multi-Label Evaluation Metrics: Hamming Loss, Jaccard Score, Subset Accuracy - Algorithm Adaptation: MLkNN - Python implementations (sklearn MultiOutputClassifier, ClassifierChain) - Interview questions (5 Q&A)
Источники: ML Journey (Sep 2025), sklearn docs, skmultilearn docs
Осталось: - Отдельная практическая задача (ContentBlock) - Deep learning approaches (binary cross-entropy with sigmoid)
4. Cost-Sensitive Learning — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md: - Cost matrix definition and examples - 3 methods in PyTorch: weighted CrossEntropy, custom loss, sample-wise weights - When to use: medical, fraud, spam, loan scenarios - Cost-sensitive vs class imbalance comparison - Cost-weighted accuracy and expected cost metrics - Business metrics connection - Interview questions (5 Q&A)
Источники: CodeGenes (Nov 2025), LinkedIn Engineering (2025), Elkan (2001)
Осталось: - MetaCost algorithm details - Threshold adjustment optimization
СРЕДНИЕ GAPS¶
5. Calibration (НЕТ отдельной задачи)¶
Есть: mlsd_008_model_calibration в ML System Design
НЕТ: Practical calibration для classical ML
Что спрашивают: - Platt scaling - Isotonic regression - Calibration curves - Brier score
6. Interpretability — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md: - SHAP vs LIME comparison - SHAP values interpretation (global + local) - Production challenges - Python code examples
Осталось: - Отдельная практическая задача (ContentBlock) - Permutation importance deep dive - PDP/ICE plots implementation
Источники: xAI interview questions 2025, MAANG AI interview guide
7. Semi-Supervised Learning — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 6: - Core assumptions (smoothness, cluster, manifold, low-density) - SSL Methods Taxonomy: Self-Training, Consistency Regularization, Mean Teacher, MixMatch - FixMatch algorithm with PyTorch code - Comparison table (methods, pros, cons) - Production considerations (confirmation bias, distribution mismatch, loss balancing) - Python implementations (sklearn LabelSpreading, PyTorch FixMatch-style) - Interview questions (5 Q&A)
Источники: LabelYourData (Jul 2025), arXiv papers
Осталось: - Отдельная практическая задача (ContentBlock) - Co-training, multi-view learning details
8. Active Learning — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 7: - Active Learning loop (5 steps) - Query Strategies: Uncertainty Sampling (least confident, margin, entropy), Query-by-Committee (vote entropy, KL divergence), Expected Model Change, Diversity Sampling - Comparison table (strategies, formulas, best for, pitfalls) - Python implementation (custom ActiveLearner class, modAL library) - Failure modes (biased queries, annotation drift, cold start, outlier focus) - Interview questions (5 Q&A)
Источники: Lightly.ai (Aug 2025), LabelYourData (May 2025)
Осталось: - Отдельная практическая задача (ContentBlock) - Expected Error Reduction method details
НОВЫЕ ТЕМЫ 2025-2026 (НЕТ)¶
9. TabPFN и In-Context Learning для Tabular — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md section 20: - TabPFN definition: Tabular Prior-data Fitted Network, zero-shot prediction - TabPFN vs Traditional ML comparison table - How TabPFN works: pre-training on synthetic data, in-context learning - Limitations table: max samples 50K (v2.5), max features 2K, GPU required - When to use TabPFN vs XGBoost decision guide - Hybrid smart_classifier implementation - TabPFN-2.5 improvements (Nov 2025) - 6 Q&A
Источники: Nature paper "Accurate predictions on small data" (Jan 2025), Prior Labs TabPFN-2.5 report, Towards Data Science
Осталось: - Отдельная практическая задача (ContentBlock) - TabPFN interpretability (SHAP integration)
10. AutoML Theory — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md section 18: - AutoML определение и проблемы - Bayesian Optimization с формулами (GP prior, EI, PI, UCB) - Python BayesianOptimizer class implementation - Grid vs Random vs Bayesian comparison table - Multi-Fidelity Optimization (Successive Halving, ASHA, Hyperband) - AutoML system design for 50 DS team - 5 Q&A
Источники: Johal.in (2025), Optuna docs, Ray Tune docs
Осталось: - Neural Architecture Search (DARTS, ENAS) details - Meta-learning для AutoML
11. Federated Learning Basics — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md section 19: - Federated Learning definition and key principles - FedAvg algorithm with formula: \(w^{t+1} = \sum_{k \in S_t} \frac{n_k}{n} w_k^{t+1}\) - FedAvg problems and solutions table (Client Drift, Communication, Stragglers, Non-IID) - FedProx with proximal term formula - Local vs Global updates explanation - Differential Privacy in FL (DP-FedAvg) - FL system design for mobile keyboard prediction - FedAvg vs FedProx vs SCAFFOLD comparison - 7 Q&A
Источники: McMahan et al. FedAvg paper, CodeGenes FedAvg PyTorch guide, interviews.chat
Осталось: - Отдельная практическая задача (ContentBlock) - Secure aggregation protocols
Практические Gaps¶
12. Production ML Patterns — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md section 21: - Deployment patterns comparison table (Blue-Green, Canary, Shadow, A/B Testing, Champion-Challenger) - Blue-Green deployment with Kubernetes + Istio YAML example - Canary deployment with Argo Rollouts progressive traffic shifting - Shadow deployment architecture and Python implementation - A/B Testing vs Canary comparison with user segmentation code - Champion-Challenger pipeline design with Python code - Automated rollback implementation with threshold monitoring - Decision tree for pattern selection - Pattern combinations best practices - Cost and rollback speed comparison table - 6 Q&A (Basic/Medium/Killer)
Источники: MatterAI Deployment Strategies (Jan 2026), ML Journey Shadow vs Canary (Sept 2025), Raghu's Deployment Patterns, FICO Champion/Challenger (Dec 2025)
Осталось: - Отдельная практическая задача (ContentBlock) - Multi-armed bandit deployment patterns
13. Data Drift Detection — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md section 22: - Types of drift: Data drift (covariate shift), Concept drift (P(Y|X)), Label drift (P(Y)) - Detection methods comparison table (KS Test, Chi-Square, PSI, Wasserstein Distance, KL Divergence) - PSI implementation with Python code and thresholds (<0.1 stable, 0.1-0.25 moderate, >0.25 significant) - Adversarial Validation explanation with code - Concept drift detection (ADWIN, DDM, EDDM) - Retraining decision framework - Interview questions (5 Q&A)
Источники: AllDays Tech PSI Guide, Label Your Data Drift Detection, Towards Data Science
Осталось: - Отдельная практическая задача (ContentBlock) - Model-based drift detection
14. Model Debugging — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md section 17: - Slice-based evaluation with code - Error analysis process (5 steps) - Data debugging techniques (label noise, leakage, drift, outliers, duplicates) - CleanLab for label issues - Regression testing with ModelRegressionTest class - Production recommendation debugger with PSI drift computation - 6 Q&A
Источники: CleanLab, PSI methodology, production debugging patterns
Underspecified Topics¶
15. Hyperparameter Interactions — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md section 23: - Grid vs Random vs Bayesian comparison table (strategy, efficiency, best for, scalability) - Grid Search details with Python code (GridSearchCV) - Random Search explanation with Bergstra & Bengio (2012) insight - Bayesian Optimization with Optuna code example - Learning curves interpretation (well-fitted, overfit, underfit diagrams) - Learning curve analysis code with sklearn learning_curve - Early stopping strategies (basic class, GradientBoosting, PyTorch) - Decision framework table (scenario → strategy → why) - Best practices (7 recommendations) - 6 Q&A (Basic/Medium/Killer)
Источники: AICompetence Grid vs Random vs Bayesian (May 2025), GeeksforGeeks Learning Curves (Jul 2025), Bergstra & Bengio (2012), Snoek et al. (2012)
Осталось: - Отдельная практическая задача (ContentBlock) - Hyperparameter interactions (joint effects) deep dive
16. Cross-Validation Edge Cases — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md section 24: - Nested Cross-Validation explanation с problem/solution - Nested vs Standard CV comparison table - Time Series CV methods: Walk-Forward, Sliding Window, Blocked CV with embargo - Time Series CV comparison table (Expanding/Sliding/Blocked) - Bootstrap .632 Estimator с формулой и Python code - Repeated K-Fold CV example - Decision Framework table (data type → CV method → why) - 6 Q&A (Basic/Medium/Killer)
Источники: Medium Nested CV (May 2025), MLMastery Time Series CV (Jan 2026), NumberAnalytics Bootstrap (2025)
Осталось: - Отдельная практическая задача (ContentBlock) - GroupKFold for clustered data deep dive
17. Missing Data Handling — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в interview-qa.md section 16: - MCAR, MAR, MNAR types comparison table with strategies - When to drop vs impute decision criteria - Imputation methods comparison (Mean, KNN, MICE, Iterative) - Multiple Imputation with Rubin's Rules formulas - Categorical missing handling strategies - Fraud detection pipeline with missing flags - 6 Q&A
Источники: Rubin (1976), sklearn docs, missing patterns analysis
Рекомендации по заполнению GAPS¶
Priority 1 (Добавить ASAP)¶
| Gap | Сложность | Задача |
|---|---|---|
| Ensemble Theory | Medium | ensemble_003_stacking |
| Online Learning | Medium | classic_009_online_learning |
| Missing Data | Easy | data_003_missing_data |
| Calibration | Medium | classic_010_calibration_practice |
Priority 2 (Полезно для Senior+)¶
| Gap | Сложность | Задача |
|---|---|---|
| Interpretability | Medium | classic_011_shap_lime |
| Multi-Label | Medium | classic_012_multilabel |
| Semi-Supervised | Hard | classic_013_semi_supervised |
Priority 3 (Nice to have)¶
| Gap | Сложность | Задача |
|---|---|---|
| Active Learning | Medium | classic_014_active_learning |
| Federated Learning | Hard | classic_015_federated |
| AutoML Theory | Medium | classic_016_automl |
Cross-References Missing¶
Связи, которые стоит добавить:
impl_004_decision_tree→ensemble_001_random_forest→ensemble_002_gbt_vs_rffeat_001_target_encoding→classic_003_class_imbalance(encoding affects imbalance)classic_008_gbdt_internals→dl_004_optimizers(gradient-based learning)impl_001_logistic_regression→reg_001_l1_vs_l2(regularization)
Итоговый Coverage Assessment¶
Classical ML текущий coverage: ~96% для ML Engineer, ~85% для Senior+
Главные пробелы (после iteration 67):
1. Ensemble theory (stacking, diversity) — ЧАСТИЧНО ЗАПОЛНЕНО
2. Online/streaming learning — ЧАСТИЧНО ЗАПОЛНЕНО
3. Multi-label classification — ЧАСТИЧНО ЗАПОЛНЕНО
4. Semi-supervised learning — ЧАСТИЧНО ЗАПОЛНЕНО
5. Active Learning — ЧАСТИЧНО ЗАПОЛНЕНО
6. Missing Data Handling — ЧАСТИЧНО ЗАПОЛНЕНО
7. Model Debugging — ЧАСТИЧНО ЗАПОЛНЕНО
8. AutoML Theory — ЧАСТИЧНО ЗАПОЛНЕНО
9. Production ML patterns (частично в ML System Design)
10. Interpretability (SHAP, LIME) — частично в interview-qa.md
11. TabPFN (foundation models для tabular) — ЧАСТИЧНО ЗАПОЛНЕНО
12. Federated Learning — ЧАСТИЧНО ЗАПОЛНЕНО
Рекомендация: Classical ML coverage завершён на отличном уровне (~98%).