System Design паттерны для ML/AI¶
~6 минут чтения
Предварительно: Аудит основ ML, Подготовка к кодингу
System design -- самый весомый раунд на интервью в FAANG для ML-инженеров: в Meta это 2 из 5 раундов (40%), в Google -- 1-2 из 4-5. По данным interviewing.io, кандидаты, которые структурируют ответ по шаблону (Clarify -> High-Level -> Deep Dive -> Bottlenecks -> Trade-offs), получают offer в 3x чаще. Эта статья покрывает 5 ключевых паттернов (experiment tracking, recommendations, real-time inference, feature store, monitoring), которые встречаются в 80%+ вопросов на ML system design интервью 2025-2026.
Тип: system design guide Дата: 2025-2026 Основано на: Interview experiences, company blogs, architecture patterns
Ключевые паттерны¶
1. ML Experiment Tracking Platform¶
Требования: - Хранение миллионов экспериментов - Метрики, гиперпараметры, артефакты - Сравнение экспериментов - Интеграция с Git
Компоненты:
graph TD
A["Frontend UI<br/>Experiment creation<br/>Visualization dashboards"] --> B["API Gateway<br/>Authentication<br/>Rate limiting"]
B --> C["Experiment Service<br/>CRUD operations<br/>Metadata management"]
C --> D["Metadata DB<br/>PostgreSQL"]
C --> E["Artifact Store<br/>S3"]
D --> F["Time Series DB<br/>metrics/logs"]
E --> G["Query Engine<br/>comparison"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#fff3e0,stroke:#ef6c00
style C fill:#e8f5e9,stroke:#4caf50
style D fill:#e8eaf6,stroke:#3f51b5
style E fill:#e8eaf6,stroke:#3f51b5
style F fill:#f3e5f5,stroke:#9c27b0
style G fill:#f3e5f5,stroke:#9c27b0
Scaling: - Horizontal scaling для API servers - Database sharding по experiment_id - Async artifact upload
2. Recommendation System (Google/YouTube style)¶
Pipeline:
User -> Candidate Generation -> Scoring -> Ranking -> Re-ranking
(FAISS/ANN) (ML Model) (Learning-to-Rank)
Components:
Candidate Generation: - Approximate Nearest Neighbor (ANN) - FAISS, ScaNN, Hnswlib - Embeddings from user history
Scoring: - Light ML model (XGBoost, neural network) - Feature engineering: user, item, context features
Ranking: - Learning-to-rank (LambdaMART, RankNet) - Optimizes for ranking metrics (NDCG, MAP)
Re-ranking: - Business logic application - Diversity, fairness constraints - Freshness boost
3. Real-time ML Inference System¶
Требования: - Low latency (<100ms p99) - High throughput (10K+ QPS) - Model versioning - A/B testing
Архитектура:
Оптимизации:
-
Batching -- Group requests for GPU utilization $$ \text{Efficiency} = \frac{\text{Batch Size}}{\text{Batch Size} + \text{Overhead}} $$
-
Model quantization -- FP32 -> INT8
- 4x reduction in model size
-
2-4x speedup
-
Caching -- Embeddings, frequent predictions
- Redis for hot data
-
TTL-based eviction
-
Feature precomputation -- Offline feature calculation
4. Feature Store¶
Компоненты:
graph TD
A["Feature Writing<br/>Batch + Streaming<br/>Offline & Online features"] --> B["Storage Layer<br/>Redis online<br/>Cassandra offline<br/>S3 historical"]
B --> C["Serving Layer<br/>Feature API<br/>Point lookups<br/>Batch joins"]
style A fill:#e8f5e9,stroke:#4caf50
style B fill:#e8eaf6,stroke:#3f51b5
style C fill:#fff3e0,stroke:#ef6c00
Trade-offs:
| Store | Latency | Throughput | Use Case |
|---|---|---|---|
| Redis | <1ms | High | Hot features |
| Cassandra | <10ms | Very High | Offline features |
| S3 | N/A | Low | Historical |
5. Model Monitoring & Observability¶
Метрики:
- Prediction drift
- PSI (Population Stability Index)
-
KL divergence
-
Model performance
- Accuracy, precision, recall over time
-
Calibration metrics
-
System health
- Latency (p50, p95, p99)
- Error rates
- Resource utilization
Architectural pattern:
graph LR
A["Inference"] --> B["Metrics Collector"]
B --> C["Time Series DB"]
C --> D["Alerting"]
A --> E["Predictions"]
B --> F["Prometheus"]
C --> G["Grafana"]
D --> H["PagerDuty"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#e8f5e9,stroke:#4caf50
style C fill:#f3e5f5,stroke:#9c27b0
style D fill:#fce4ec,stroke:#c62828
style E fill:#e8eaf6,stroke:#3f51b5
style F fill:#e8f5e9,stroke:#4caf50
style G fill:#f3e5f5,stroke:#9c27b0
style H fill:#fce4ec,stroke:#c62828
Alerting rules:
- alert: HighLatency
expr: p99_latency > 100ms
for: 5m
- alert: AccuracyDrop
expr: accuracy < 0.8
for: 10m
Формулы и расчеты¶
Capacity Planning¶
Пример: - QPS: 10,000 requests/second - Target latency: 50ms - Concurrency: 10 requests per instance
Model Sizing¶
GPU Memory: $$ \text{Required VRAM} = \text{Model} + \text{Optimizer} + \text{Gradients} + \text{Activations} $$
Latency Budget¶
SLO example: - Total: 100ms - Network: 20ms - Preprocessing: 10ms - Inference: 60ms - Postprocessing: 10ms
Вопросы для подготовки¶
General System Design¶
- "Design a system that does X"
- "How would you scale this to Y users?"
- "What are the bottlenecks?"
- "How would you add feature X?"
ML-Specific¶
- "Design an experiment tracking system"
- "Design a real-time inference system"
- "Design a feature store"
- "Design a recommendation system"
Follow-up¶
- "How would you A/B test this?"
- "How would you monitor this?"
- "What if component X fails?"
- "How would you migrate to a new model?"
Практические советы¶
Design Process¶
- Clarify requirements
- Scale (QPS, data size)
- Latency requirements
- Consistency needs
-
Budget constraints
-
High-level design
- Components and their interactions
- Data flow
-
Technology choices
-
Deep dive
- Each component in detail
- API design
-
Data models
-
Bottlenecks and scaling
- Identify bottlenecks
- Propose solutions
-
Calculate capacity
-
Trade-offs
- Discuss alternatives
- Justify choices
Common patterns¶
| Problem | Solution |
|---|---|
| High latency | Caching, batching, model quantization |
| Low throughput | Horizontal scaling, load balancing |
| Model staleness | Canary deployment, blue-green |
| Feature drift | Monitoring, retraining pipeline |
| Cold start | Pre-computation, hybrid models |
Заблуждение: system design = только архитектурная диаграмма
Интервьюеры оценивают 5 аспектов: (1) clarification questions, (2) capacity estimation с конкретными числами, (3) component design, (4) bottleneck analysis, (5) trade-off discussion. Кандидаты, которые сразу рисуют boxes, пропускают 60% оценки. Всегда начинай с «какой QPS ожидается?» и «каков latency SLO?».
Заблуждение: feature store -- это просто Redis
Feature store решает 3 проблемы: (1) train-serve skew (offline features вычисляются иначе чем online), (2) feature reuse между командами, (3) point-in-time correctness для исторических данных. Redis -- только online serving layer. Без offline store (Hive/S3) и feature registry это не feature store, а кэш.
Заблуждение: monitoring = логирование метрик
Production ML мониторинг включает 3 уровня: (1) system health (latency p99, error rate, CPU/GPU utilization), (2) data quality (feature drift через PSI > 0.2, missing values > 5%), (3) model performance (accuracy degradation > 2%, calibration drift). Без alerting на всех 3 уровнях модель может деградировать месяцами незаметно.
Интервью¶
"Design a real-time ML inference system для 10K QPS"¶
«Ставим модель за load balancer, добавляем GPU. Если медленно -- добавляем больше GPU.»
«Clarify: 10K QPS, p99 < 100ms, model size 500MB. Capacity: 10000 * 0.05 / 10 = 50 instances. Architecture: (1) Load balancer -> inference pods (horizontal auto-scaling). (2) Model store (S3 + local cache) с versioning для A/B tests. (3) Feature store (Redis online, <1ms lookup). (4) Optimizations: dynamic batching (batch_size=32, wait_time=5ms), model quantization FP32->INT8 (4x memory reduction, 2x speedup), embedding cache (Redis, TTL=1h, hit rate ~60%). (5) Monitoring: Prometheus для latency/throughput, custom drift detector (PSI каждые 15min). Bottleneck: GPU memory -- решение: model sharding across GPUs или distillation для уменьшения модели.»
"Как бы вы спроектировали A/B testing platform для ML моделей?"¶
«Случайно делим трафик 50/50, сравниваем метрики через неделю.»
«(1) Traffic splitting: consistent hashing по user_id (не random -- один пользователь всегда в одной группе). (2) Staged rollout: shadow mode (0% traffic, compare predictions) -> 1% canary -> 5% -> 50%. (3) Metrics: primary (CTR, revenue), guardrail (latency p99, error rate), secondary (engagement depth). (4) Statistical rigor: pre-computed sample size (MDE=1%, alpha=0.05, power=0.8 -> ~16K users per group), sequential testing (CUPED для variance reduction, 30-50% shorter experiments). (5) Automation: auto-rollback если guardrail metrics degrade > 2 sigma. (6) Logging: feature values + predictions + outcomes для offline analysis.»
"Что произойдет, если Feature Store упадет?"¶
«Система перестанет работать. Нужен backup.»
«Graceful degradation strategy: (1) Online features (Redis) -- fallback на default values (population mean) с logging для monitoring. Cache hit rate ~60%, потеря качества ~5-10% на default values. (2) Circuit breaker: если Redis latency > 10ms, переключаемся на local cache (stale features, TTL=5min). (3) Offline features (Cassandra) -- реплицированы across datacenters, failover < 1s. (4) Monitoring: alert на miss rate > 20%, degraded mode dashboard. (5) Recovery: Redis восстанавливается из Cassandra snapshot (warm-up ~10min). Ключевое: inference service НЕ должен падать из-за feature store -- always serve, sometimes degraded.»
Мои заметки¶
Key trends 2025-2026:
- AI reasoning -- Design questions now assume AI assistance
- LLM integration -- RAG, embeddings in feature stores
- Real-time personalization -- Lower latency expectations
- Cost optimization -- GPU resource sharing, spot instances
Critical skills: - Back-of-the-envelope calculations - Technology selection justification - Bottleneck identification - Trade-off articulation
Gaps remaining: - [ ] LLM-specific system design (RAG systems) - [ ] Multi-agent system design - [ ] Edge ML deployment - [ ] Privacy-preserving ML systems
Practice resources: - System Design Primer - Designing Data-Intensive Applications - LeetCode System Design