ML System Design: Пробелы (Gaps)¶
~11 минут чтения
Что спрашивают на собеседованиях, чего НЕТ в 8 задачах Недопокрытые темы для AI/ML/LLM Engineer Обновлено: 2026-02-11
Текущее покрытие (8 задач)¶
| Подкатегория | Задач | Покрытие |
|---|---|---|
| Model Serving | 1 | Хорошее |
| A/B Testing | 1 | Хорошее |
| Drift Detection | 1 | Хорошее |
| Model Calibration | 1 | Хорошее |
| Ranking Metrics | 1 | Хорошее |
| Trade-offs Quiz | 1 | Отличное (15 сценариев) |
| RecSys | 1 | Базовое |
| LLM Production | 1 | Хорошее |
КРИТИЧЕСКИЕ GAPS¶
1. Feature Stores — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 9: - Feature Store definition и why it matters - Architecture components (Offline vs Online Store, Ingestion, Registry) - Point-in-Time Correctness concept with SQL example - Feature Store comparison table (Feast vs Tecton vs SageMaker) - Python Feast example (Entity, FeatureView, get_online_features, get_historical_features) - Key concepts table (Feature View, Entity, TTL, Materialization) - Interview questions (4 Q&A)
Источники: Aerospike Blog (July 2025), Reintech.io Feature Store Comparison (Jan 2026)
Осталось: - Отдельная практическая задача (ContentBlock) - Hopsworks comparison - Streaming feature pipelines details
2. ML Infrastructure — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 10: - Why ML Infrastructure matters (reproducibility, audit trail, compliance) - Core components table (Experiment Tracking, Model Registry, Data Versioning, Pipeline Orchestration) - Tool comparison (MLflow vs W&B vs DVC) - MLflow architecture diagram (Tracking, Projects, Models, Registry) - Python examples: experiment tracking with mlflow.start_run(), model registration, registry operations (transition_model_version_stage) - Model Registry workflow (Experiment → Staging → Production → Archived) - Decision framework table - Interview questions (4 Q&A)
Источники: ML Journey "Model Versioning Strategies" (Sep 2025), Conduktor "Real-Time ML Pipelines" (Feb 2026)
Осталось: - Отдельная практическая задача (ContentBlock) - Kubeflow, ZenML deep dive - CI/CD для ML specifics
3. Multi-Stage Recommender — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 13: - Funnel Architecture diagram (Retrieval → Pre-ranking → Ranking → Re-ranking) - Two-Tower architecture with training code (in-batch negatives) - ANN Indexes (FAISS, ScaNN) comparison table and implementation - Ranking models (DIN, DCN, DeepFM, DCNv2) comparison - Re-ranking with MMR (Maximal Marginal Relevance) code - Production architecture (YouTube-scale) diagram - Interview questions (6 Q&A)
Источники: Shaped.ai (May 2025), Fan Luo Blog (Oct 2025), arXiv Allegro (Jul 2025), YouTube paper
Осталось: - Отдельная практическая задача (ContentBlock) - Graph-based retrieval (PinSage, LightGCN) - Real-time feature pipelines
СРЕДНИЕ GAPS¶
4. Online Learning — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 12: - Online vs Batch Learning comparison table - Online Gradient Descent with Python code - Regret framework with formula - FTRL-Proximal for sparse high-dimensional features with Python implementation - Concept Drift Detection methods (ADWIN, DDM, EDDM, Page-Hinkley) - River library drift detection code example - Hoeffding Trees (VFDT) with Hoeffding bound formula - Flink ML Pipeline architecture and Java example - Production considerations (challenges, best practices) - Interview questions (5 Q&A)
Источники: ML Journey (Nov 2025), Conduktor (Feb 2026), Confluent Flink (Oct 2025), River docs
Осталось: - Отдельная практическая задача (ContentBlock) - Spark Streaming ML specifics - Deep online learning methods
5. Model Compression — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 21: - Knowledge Distillation (temperature scaling, soft targets, distillation loss formula) - Distillation Loss with Python code (hard loss + soft loss, KL divergence) - Types of Distillation comparison (Response, Feature, Attention, Multi-Teacher) - Neural Network Pruning (Magnitude, Structured, Global, Iterative) - Lottery Ticket Hypothesis explanation - Pruning implementation with torch.nn.utils.prune - Iterative pruning pipeline with fine-tuning - Hybrid Compression Pipeline (Pruning → Quantization → Distillation) - Edge Deployment Optimization (TensorFlow Lite, ONNX export) - Pruning benchmarks (ResNet-50, 90% sparsity, 4.2x speedup) - Interview questions (4 Q&A)
Источники: LabelYourData "Knowledge Distillation" (2025), Arik Poz "PyTorch Distillation" (Apr 2025), Johal.in "Neural Network Pruning" (Nov 2025), Frontiers "Survey of Model Compression" (2025)
Осталось: - Отдельная практическая задача (ContentBlock) - Neural Architecture Search (NAS) deep dive - Low-rank factorization
6. Multi-Armed Bandits — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 11: - Exploration-Exploitation Dilemma with regret formula R(T) - Algorithm Comparison table (ε-greedy, UCB, Thompson Sampling) - ε-Greedy implementation with Python code - UCB formula and Python implementation - Thompson Sampling (Bayesian) with Beta distribution Python code - Contextual Bandits (LinUCB) formula - When to Use Bandits vs A/B Tests comparison table - Production Use Cases: Netflix (3-tier architecture), Spotify AI DJ - Interview questions (4 Q&A)
Источники: Philipp Dubach "Bandits and Agents" (Jan 2026), Statsig "Thompson Sampling" (June 2025), Russo et al. tutorial
Осталось: - Отдельная практическая задача (ContentBlock) - LinUCB detailed implementation - Non-stationary bandits
7. Causal Inference — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 14: - Treatment Effect definitions (ATE, ATT, ATC, CATE) table - Key Assumptions (SUTVA, Unconfoundedness, Overlap, Consistency) - Method 1: Propensity Score Matching with Python code - Method 2: Difference-in-Differences (DiD) with formula and assumptions - Method 3: Instrumental Variables (IV) requirements - Method 4: Uplift Modeling (S-Learner, T-Learner, X-Learner) comparison - When to Use Which Method decision table - Interview questions (4 Q&A)
Источники: ML Journey (July 2025), Medium GrabNGoInfo, DoWhy/EconML docs
Осталось: - Отдельная практическая задача (ContentBlock) - Regression Discontinuity Design (RDD) - DoWhy refutation methods deep dive
НОВЫЕ ТЕМЫ 2025-2026 (НЕТ)¶
8. Foundation Models in Production — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 19: - Multi-tenant LLM serving architecture (Kubernetes, namespace isolation, ResourceQuotas) - Prompt Caching economics (OpenAI 50% auto, Anthropic 90% explicit, KV cache reuse) - Multi-layer caching implementation (L1: Exact, L2: Semantic, L3: Provider) - Token economics (cost per token, compression techniques, context window management) - Fallback strategies (Circuit Breaker, graceful degradation, retry with backoff) - Interview questions (4 Q&A)
Источники: Collabnix "Multi-Tenant LLM Platform" (Dec 2025), Medium "Prompt Caching" (Jan 2026)
Осталось: - Отдельная практическая задача (ContentBlock) - Diffusion models serving specifics - Multi-region deployment patterns
9. AI Agents in Production — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 20: - Agent definition and production criteria (autonomy, data, consequences) - OWASP Top 10 for LLM Applications 2025 (Prompt Injection 73%, Sensitive Data Leakage, Excessive Agency) - Defence-in-Depth Architecture (6 layers: Input Sanitization → Injection Detection → Agent Execution → Tool Call Interception → Output Validation → Observability) - Tool Allowlist with Permission Gating (ToolGatekeeper class, schema validation, least privilege) - Human-in-the-Loop (HITL) patterns with LangGraph interrupt() for pause/resume - LLM-as-Judge Evaluation (JudgeVerdict model, few-shot prompting, chain-of-thought) - Production Best Practices (10 commandments from UiPath/n8n) - OpenTelemetry GenAI Semantic Conventions (gen_ai.system, gen_ai.usage tokens) - Interview questions (4 Q&A)
Источники: RandomCommits "AI Agents in Production" (Jan 2026), Vertesia "Defence-in-Depth" (2025), UiPath "10 Commandments" (2025), LinkedIn Iain Harper, n8n Blog (Dec 2025)
Осталось: - Отдельная практическая задача (ContentBlock) - Multi-agent orchestration deep dive - Agent memory systems
10. Vector Databases for ML — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 15: - ANN Index Types comparison table (HNSW, IVF, IVF-PQ, Flat) - HNSW parameters (M, ef_construction, ef_search) with trade-offs - IVF parameters (nlist, nprobe) with trade-offs - Vector Database Comparison (Pinecone, Milvus, Weaviate, Qdrant, pgvector, Chroma) - Python examples (Qdrant, Milvus) - Hybrid Search (Vector + BM25) with RRF (Reciprocal Rank Fusion) code - Index Refresh Strategies table (Full rebuild, Incremental, Dual index) - Dual index pattern code - Interview questions (5 Q&A)
Источники: JishuLabs (2026), Markaicode (2025), Medium IVF/HNSW guide (Jan 2026), TowardsAI Hybrid Search (Jan 2026)
Осталось: - Отдельная практическая задача (ContentBlock) - GPU-accelerated indexes - Multi-tenancy patterns
Практические Gaps¶
11. Cost Optimization — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 16: - Inference cost breakdown (GPU 60-70%, Memory 15-20%) - GPU utilization strategies (batching, dynamic batching with Triton) - Spot instance strategy with preemption detection code - Model right-sizing decision framework - Cost per prediction formula with calculator - Semantic caching for LLMs with Python implementation - Auto-scaling with Kubernetes HPA - Cost optimization decision matrix
Источники: Lambda Labs GPU Pricing (2026), Neptune.ai (Jan 2026), Runhouse (2025)
Осталось: - Отдельная практическая задача (ContentBlock) - Reserved instances vs spot optimization - Multi-cloud cost arbitrage
12. Multi-Model Serving — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 17: - Why Multi-Model (different tasks, cost optimization, redundancy, A/B testing) - Routing Strategies Comparison table (Weighted, Latency-Based, Cost-Aware, Confidence-Based, Cascade, ML-Based) - Weighted Round-Robin with Python implementation - Confidence-Based Routing (Cascade) with code - Latency-Aware Routing with rolling averages and fairness band - Fallback Chain with Circuit Breaker pattern (CLOSED/OPEN/HALF_OPEN states) - A/B Testing Between Models with deterministic user assignment - Model Router Decision Matrix (scenario → strategy → why) - Interview questions (4 Q&A)
Источники: TrueFoundry "LLM Load Balancing" (2025), LogRocket "LLM Routing in Production" (2026), arXiv "Universal Model Routing" (2025)
Осталось: - Отдельная практическая задача (ContentBlock) - ML-based routing with learned policies - Multi-cloud model routing specifics
13. Data Quality for ML — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 18: - Data Quality Dimensions (Accuracy, Completeness, Consistency, Timeliness, Relevance, Uniqueness, Validity) - Validation Types by Pipeline Stage (Ingestion, Preparation, Training, Production) - Schema Validation with Great Expectations (expectations, type checks, range checks, completeness) - TensorFlow Data Validation (TFDV) with drift detection - Schema Evolution Strategies (Additive, Backward/Forward Compatible, Dual-Write) - Data Lineage Tracking with Python implementation (LineageNode, DataLineageTracker) - Data Quality Tools Comparison (Great Expectations, TFDV, Deepchecks, Pandera, Evidently AI) - Interview questions (4 Q&A)
Источники: Uplatz "Data Validation and Quality in MLOps" (Nov 2025), ML Journey "Data Lineage Tracking" (Sep 2025), Gartner Data Quality Report (2025)
Осталось: - Отдельная практическая задача (ContentBlock) - Real-time streaming data validation - PII detection and anonymization
Underspecified Topics¶
14. Monitoring & Observability — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 22: - Four Pillars of Observability (Metrics, Logs, Traces, Dashboards) - Key Metrics tables (Model Performance, Data Quality, System) - Prometheus instrumentation with Python code (Counter, Gauge, Histogram) - Prometheus configuration (prometheus.yml) - Alerting rules for ML (accuracy drop, latency spike, data drift, error rate) - Grafana dashboard design (hierarchy, PromQL queries) - Structured logging with structlog (JSON format, prediction logging) - Drift detection implementation (KS test, PSI score with Python code) - Multi-level alerting strategy (P0-P3 severity levels) - Monitoring stack comparison (Prometheus, Grafana, MLflow, Evidently AI) - Interview questions (4 Q&A)
Источники: ML Journey (Sep 2025), Johal.in "MLOps Monitoring" (Sep 2025), Diousoft "Model Monitoring & Logging" (2025), Grafana Labs "Observability Survey" (Mar 2025)
Осталось: - Отдельная практическая задача (ContentBlock) - Distributed tracing with Jaeger deep dive - SRE practices for ML (SLOs, SLIs, error budgets)
15. Security for ML — ЧАСТИЧНО ЗАПОЛНЕНО¶
Добавлено в materials.md section 23: - Attack Taxonomy table (Evasion, Poisoning, Extraction, Inversion, MIA) - Adversarial Attacks with FGSM/PGD formulas and Python code - Model Extraction Attacks with defense code (rate limiting, output perturbation) - Model Inversion Attacks (confidence score attacks, attribute inference) - Membership Inference Attacks (MIA) defense - Differential Privacy with DP-SGD implementation (gradient clipping, Gaussian noise) - Defense Summary table (attack → primary defense → secondary defense) - Multi-Layer Defense Architecture (5 layers: Data, Training, Access, Output, Monitoring) - Interview questions (4 Q&A)
Источники: SentinelOne "Model Inversion" (Jan 2026), YASH Technologies "Adversarial Defenses" (Jan 2026), ByteJournal "Model Security" (Feb 2025), NIST Adversarial ML Taxonomy (Mar 2025)
Осталось: - Отдельная практическая задача (ContentBlock) - Model watermarking deep dive - Federated learning security specifics
Рекомендации по заполнению GAPS¶
Priority 1 (Добавить ASAP)¶
| Gap | Сложность | Задача |
|---|---|---|
| Feature Stores | Medium | mlsd_009_feature_stores |
| Multi-Stage RecSys | Hard | mlsd_011_multi_stage_recsys |
| Online Learning | Medium | mlsd_012_online_learning |
Priority 2 (Полезно для Senior+)¶
| Gap | Сложность | Задача |
|---|---|---|
| ML Infrastructure | Hard | mlsd_010_ml_infrastructure |
| Multi-Armed Bandits | Medium | mlsd_013_bandits |
| Causal Inference | Hard | mlsd_014_causal_inference |
Priority 3 (Nice to have)¶
| Gap | Сложность | Задача |
|---|---|---|
| Model Compression | Medium | mlsd_015_compression |
| Vector Databases | Medium | mlsd_016_vector_db |
| Cost Optimization | Easy | mlsd_017_cost_optimization |
Cross-References Missing¶
Связи, которые стоит добавить:
mlsd_001_model_serving->llm_006_quantization(inference optimization)mlsd_002_ab_testing->stat_015_ab_test_sample_size(statistics)mlsd_003_drift_detection->stat_012_hypothesis_testing(KS test)mlsd_007_recsys->dl_010_attention(two-tower with attention)mlsd_008_llm_prod->llm_012_prompt_injection(security)
Итоговый Coverage Assessment¶
ML System Design текущий coverage: ~95% для ML Engineer, ~85% для Senior+
materials.md имеет 23 секции: 1-8. Core ML System Design (Model Serving, A/B Testing, Drift Detection, Calibration, Ranking, RecSys, Trade-offs, LLM Production) 9-15. Critical Gaps (Feature Stores, ML Infrastructure, Bandits, Online Learning, Multi-Stage RecSys, Causal Inference, Vector DBs) 16-23. Production & Emerging (Cost Optimization, Multi-Model Serving, Data Quality, Foundation Models, AI Agents, Compression, Monitoring, Security)
Главные пробелы остались:
1. Feature stores ✅ Filled (Section 9)
2. ML infrastructure/platform ✅ Filled (Section 10)
3. Multi-stage recommender systems ✅ Filled (Section 13)
4. Online learning ✅ Filled (Section 12)
5. Causal inference ✅ Filled (Section 14)
Осталось для 100%: - Практические задачи (ContentBlock) для каждой темы - SRE practices (SLOs, SLIs, error budgets) - Distributed tracing deep dive - GPU-accelerated vector indexes - Multi-tenancy patterns
Что уже хорошо покрыто¶
| Тема | Покрытие | Почему хорошо |
|---|---|---|
| A/B Testing | Good | Sample size, significance, pitfalls |
| Drift Detection | Good | PSI, KS test, code examples |
| Model Calibration | Good | Platt, Isotonic, Brier score |
| Trade-offs Quiz | Excellent | 15 production scenarios |
| LLM Production | Good | Guardrails, OWASP Top 10 |
Обновлено: 2026-02-11