System Design паттерны для ML/AI¶

~6 минут чтения

Предварительно: Аудит основ ML, Подготовка к кодингу

System design -- самый весомый раунд на интервью в FAANG для ML-инженеров: в Meta это 2 из 5 раундов (40%), в Google -- 1-2 из 4-5. По данным interviewing.io, кандидаты, которые структурируют ответ по шаблону (Clarify -> High-Level -> Deep Dive -> Bottlenecks -> Trade-offs), получают offer в 3x чаще. Эта статья покрывает 5 ключевых паттернов (experiment tracking, recommendations, real-time inference, feature store, monitoring), которые встречаются в 80%+ вопросов на ML system design интервью 2025-2026.

Тип: system design guide Дата: 2025-2026 Основано на: Interview experiences, company blogs, architecture patterns

Ключевые паттерны¶

1. ML Experiment Tracking Platform¶

Требования: - Хранение миллионов экспериментов - Метрики, гиперпараметры, артефакты - Сравнение экспериментов - Интеграция с Git

Компоненты:

graph TD
    A["Frontend UI<br/>Experiment creation<br/>Visualization dashboards"] --> B["API Gateway<br/>Authentication<br/>Rate limiting"]
    B --> C["Experiment Service<br/>CRUD operations<br/>Metadata management"]
    C --> D["Metadata DB<br/>PostgreSQL"]
    C --> E["Artifact Store<br/>S3"]
    D --> F["Time Series DB<br/>metrics/logs"]
    E --> G["Query Engine<br/>comparison"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#e8f5e9,stroke:#4caf50
    style D fill:#e8eaf6,stroke:#3f51b5
    style E fill:#e8eaf6,stroke:#3f51b5
    style F fill:#f3e5f5,stroke:#9c27b0
    style G fill:#f3e5f5,stroke:#9c27b0

Scaling: - Horizontal scaling для API servers - Database sharding по experiment_id - Async artifact upload

2. Recommendation System (Google/YouTube style)¶

Pipeline:

User -> Candidate Generation -> Scoring -> Ranking -> Re-ranking
        (FAISS/ANN)        (ML Model)   (Learning-to-Rank)

Components:

Candidate Generation: - Approximate Nearest Neighbor (ANN) - FAISS, ScaNN, Hnswlib - Embeddings from user history

\[ \text{Candidates} = \text{ANN}(user\_embedding, K=1000) \]

Scoring: - Light ML model (XGBoost, neural network) - Feature engineering: user, item, context features

\[ \text{Score} = f(user, item, context) \]

Ranking: - Learning-to-rank (LambdaMART, RankNet) - Optimizes for ranking metrics (NDCG, MAP)

Re-ranking: - Business logic application - Diversity, fairness constraints - Freshness boost

3. Real-time ML Inference System¶

Требования: - Low latency (<100ms p99) - High throughput (10K+ QPS) - Model versioning - A/B testing

Архитектура:

Request -> Load Balancer -> Inference Service -> Model Store
                              |                    |
                         Feature Store        GPU Pool

Оптимизации:

Batching -- Group requests for GPU utilization $$ \text{Efficiency} = \frac{\text{Batch Size}}{\text{Batch Size} + \text{Overhead}} $$
Model quantization -- FP32 -> INT8
4x reduction in model size
2-4x speedup
Caching -- Embeddings, frequent predictions
Redis for hot data
TTL-based eviction
Feature precomputation -- Offline feature calculation

4. Feature Store¶

Компоненты:

graph TD
    A["Feature Writing<br/>Batch + Streaming<br/>Offline & Online features"] --> B["Storage Layer<br/>Redis online<br/>Cassandra offline<br/>S3 historical"]
    B --> C["Serving Layer<br/>Feature API<br/>Point lookups<br/>Batch joins"]

    style A fill:#e8f5e9,stroke:#4caf50
    style B fill:#e8eaf6,stroke:#3f51b5
    style C fill:#fff3e0,stroke:#ef6c00

Trade-offs:

Store	Latency	Throughput	Use Case
Redis	<1ms	High	Hot features
Cassandra	<10ms	Very High	Offline features
S3	N/A	Low	Historical

5. Model Monitoring & Observability¶

Метрики:

Prediction drift
PSI (Population Stability Index)
KL divergence
Model performance
Accuracy, precision, recall over time
Calibration metrics
System health
Latency (p50, p95, p99)
Error rates
Resource utilization

Architectural pattern:

graph LR
    A["Inference"] --> B["Metrics Collector"]
    B --> C["Time Series DB"]
    C --> D["Alerting"]
    A --> E["Predictions"]
    B --> F["Prometheus"]
    C --> G["Grafana"]
    D --> H["PagerDuty"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8f5e9,stroke:#4caf50
    style C fill:#f3e5f5,stroke:#9c27b0
    style D fill:#fce4ec,stroke:#c62828
    style E fill:#e8eaf6,stroke:#3f51b5
    style F fill:#e8f5e9,stroke:#4caf50
    style G fill:#f3e5f5,stroke:#9c27b0
    style H fill:#fce4ec,stroke:#c62828

Alerting rules:

- alert: HighLatency
  expr: p99_latency > 100ms
  for: 5m

- alert: AccuracyDrop
  expr: accuracy < 0.8
  for: 10m

Формулы и расчеты¶

Capacity Planning¶

\[ \text{Required Instances} = \frac{\text{QPS} \times \text{Latency}}{\text{Concurrency}} \]

Пример: - QPS: 10,000 requests/second - Target latency: 50ms - Concurrency: 10 requests per instance

\[ \text{Instances} = \frac{10000 \times 0.05}{10} = 50 \]

Model Sizing¶

\[ \text{Memory} = \text{Model Size} \times \text{Batch Size} \times \text{Overhead} \]

GPU Memory: $$ \text{Required VRAM} = \text{Model} + \text{Optimizer} + \text{Gradients} + \text{Activations} $$

Latency Budget¶

\[ \text{Total Latency} = \text{Network} + \text{Preprocessing} + \text{Inference} + \text{Postprocessing} \]

SLO example: - Total: 100ms - Network: 20ms - Preprocessing: 10ms - Inference: 60ms - Postprocessing: 10ms

Вопросы для подготовки¶

General System Design¶

"Design a system that does X"
"How would you scale this to Y users?"
"What are the bottlenecks?"
"How would you add feature X?"

ML-Specific¶

"Design an experiment tracking system"
"Design a real-time inference system"
"Design a feature store"
"Design a recommendation system"

Follow-up¶

"How would you A/B test this?"
"How would you monitor this?"
"What if component X fails?"
"How would you migrate to a new model?"

Практические советы¶

Design Process¶

Clarify requirements
Scale (QPS, data size)
Latency requirements
Consistency needs
Budget constraints
High-level design
Components and their interactions
Data flow
Technology choices
Deep dive
Each component in detail
API design
Data models
Bottlenecks and scaling
Identify bottlenecks
Propose solutions
Calculate capacity
Trade-offs
Discuss alternatives
Justify choices

Common patterns¶

Problem	Solution
High latency	Caching, batching, model quantization
Low throughput	Horizontal scaling, load balancing
Model staleness	Canary deployment, blue-green
Feature drift	Monitoring, retraining pipeline
Cold start	Pre-computation, hybrid models

Заблуждение: system design = только архитектурная диаграмма

Интервьюеры оценивают 5 аспектов: (1) clarification questions, (2) capacity estimation с конкретными числами, (3) component design, (4) bottleneck analysis, (5) trade-off discussion. Кандидаты, которые сразу рисуют boxes, пропускают 60% оценки. Всегда начинай с «какой QPS ожидается?» и «каков latency SLO?».

Заблуждение: feature store -- это просто Redis

Feature store решает 3 проблемы: (1) train-serve skew (offline features вычисляются иначе чем online), (2) feature reuse между командами, (3) point-in-time correctness для исторических данных. Redis -- только online serving layer. Без offline store (Hive/S3) и feature registry это не feature store, а кэш.

Заблуждение: monitoring = логирование метрик

Production ML мониторинг включает 3 уровня: (1) system health (latency p99, error rate, CPU/GPU utilization), (2) data quality (feature drift через PSI > 0.2, missing values > 5%), (3) model performance (accuracy degradation > 2%, calibration drift). Без alerting на всех 3 уровнях модель может деградировать месяцами незаметно.

Интервью¶

"Design a real-time ML inference system для 10K QPS"¶

«Ставим модель за load balancer, добавляем GPU. Если медленно -- добавляем больше GPU.»

«Clarify: 10K QPS, p99 < 100ms, model size 500MB. Capacity: 10000 * 0.05 / 10 = 50 instances. Architecture: (1) Load balancer -> inference pods (horizontal auto-scaling). (2) Model store (S3 + local cache) с versioning для A/B tests. (3) Feature store (Redis online, <1ms lookup). (4) Optimizations: dynamic batching (batch_size=32, wait_time=5ms), model quantization FP32->INT8 (4x memory reduction, 2x speedup), embedding cache (Redis, TTL=1h, hit rate ~60%). (5) Monitoring: Prometheus для latency/throughput, custom drift detector (PSI каждые 15min). Bottleneck: GPU memory -- решение: model sharding across GPUs или distillation для уменьшения модели.»

"Как бы вы спроектировали A/B testing platform для ML моделей?"¶

«Случайно делим трафик 50/50, сравниваем метрики через неделю.»

«(1) Traffic splitting: consistent hashing по user_id (не random -- один пользователь всегда в одной группе). (2) Staged rollout: shadow mode (0% traffic, compare predictions) -> 1% canary -> 5% -> 50%. (3) Metrics: primary (CTR, revenue), guardrail (latency p99, error rate), secondary (engagement depth). (4) Statistical rigor: pre-computed sample size (MDE=1%, alpha=0.05, power=0.8 -> ~16K users per group), sequential testing (CUPED для variance reduction, 30-50% shorter experiments). (5) Automation: auto-rollback если guardrail metrics degrade > 2 sigma. (6) Logging: feature values + predictions + outcomes для offline analysis.»

"Что произойдет, если Feature Store упадет?"¶

«Система перестанет работать. Нужен backup.»

«Graceful degradation strategy: (1) Online features (Redis) -- fallback на default values (population mean) с logging для monitoring. Cache hit rate ~60%, потеря качества ~5-10% на default values. (2) Circuit breaker: если Redis latency > 10ms, переключаемся на local cache (stale features, TTL=5min). (3) Offline features (Cassandra) -- реплицированы across datacenters, failover < 1s. (4) Monitoring: alert на miss rate > 20%, degraded mode dashboard. (5) Recovery: Redis восстанавливается из Cassandra snapshot (warm-up ~10min). Ключевое: inference service НЕ должен падать из-за feature store -- always serve, sometimes degraded.»

Мои заметки¶

Key trends 2025-2026:

AI reasoning -- Design questions now assume AI assistance
LLM integration -- RAG, embeddings in feature stores
Real-time personalization -- Lower latency expectations
Cost optimization -- GPU resource sharing, spot instances

Critical skills: - Back-of-the-envelope calculations - Technology selection justification - Bottleneck identification - Trade-off articulation

Gaps remaining: - [ ] LLM-specific system design (RAG systems) - [ ] Multi-agent system design - [ ] Edge ML deployment - [ ] Privacy-preserving ML systems

Practice resources: - System Design Primer - Designing Data-Intensive Applications - LeetCode System Design