Перейти к содержанию

System Design паттерны для ML/AI

~6 минут чтения

Предварительно: Аудит основ ML, Подготовка к кодингу

System design -- самый весомый раунд на интервью в FAANG для ML-инженеров: в Meta это 2 из 5 раундов (40%), в Google -- 1-2 из 4-5. По данным interviewing.io, кандидаты, которые структурируют ответ по шаблону (Clarify -> High-Level -> Deep Dive -> Bottlenecks -> Trade-offs), получают offer в 3x чаще. Эта статья покрывает 5 ключевых паттернов (experiment tracking, recommendations, real-time inference, feature store, monitoring), которые встречаются в 80%+ вопросов на ML system design интервью 2025-2026.

Тип: system design guide Дата: 2025-2026 Основано на: Interview experiences, company blogs, architecture patterns

Ключевые паттерны

1. ML Experiment Tracking Platform

Требования: - Хранение миллионов экспериментов - Метрики, гиперпараметры, артефакты - Сравнение экспериментов - Интеграция с Git

Компоненты:

graph TD
    A["Frontend UI<br/>Experiment creation<br/>Visualization dashboards"] --> B["API Gateway<br/>Authentication<br/>Rate limiting"]
    B --> C["Experiment Service<br/>CRUD operations<br/>Metadata management"]
    C --> D["Metadata DB<br/>PostgreSQL"]
    C --> E["Artifact Store<br/>S3"]
    D --> F["Time Series DB<br/>metrics/logs"]
    E --> G["Query Engine<br/>comparison"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#e8f5e9,stroke:#4caf50
    style D fill:#e8eaf6,stroke:#3f51b5
    style E fill:#e8eaf6,stroke:#3f51b5
    style F fill:#f3e5f5,stroke:#9c27b0
    style G fill:#f3e5f5,stroke:#9c27b0

Scaling: - Horizontal scaling для API servers - Database sharding по experiment_id - Async artifact upload

2. Recommendation System (Google/YouTube style)

Pipeline:

User -> Candidate Generation -> Scoring -> Ranking -> Re-ranking
        (FAISS/ANN)        (ML Model)   (Learning-to-Rank)

Components:

Candidate Generation: - Approximate Nearest Neighbor (ANN) - FAISS, ScaNN, Hnswlib - Embeddings from user history

\[ \text{Candidates} = \text{ANN}(user\_embedding, K=1000) \]

Scoring: - Light ML model (XGBoost, neural network) - Feature engineering: user, item, context features

\[ \text{Score} = f(user, item, context) \]

Ranking: - Learning-to-rank (LambdaMART, RankNet) - Optimizes for ranking metrics (NDCG, MAP)

Re-ranking: - Business logic application - Diversity, fairness constraints - Freshness boost

3. Real-time ML Inference System

Требования: - Low latency (<100ms p99) - High throughput (10K+ QPS) - Model versioning - A/B testing

Архитектура:

Request -> Load Balancer -> Inference Service -> Model Store
                              |                    |
                         Feature Store        GPU Pool

Оптимизации:

  1. Batching -- Group requests for GPU utilization $$ \text{Efficiency} = \frac{\text{Batch Size}}{\text{Batch Size} + \text{Overhead}} $$

  2. Model quantization -- FP32 -> INT8

  3. 4x reduction in model size
  4. 2-4x speedup

  5. Caching -- Embeddings, frequent predictions

  6. Redis for hot data
  7. TTL-based eviction

  8. Feature precomputation -- Offline feature calculation

4. Feature Store

Компоненты:

graph TD
    A["Feature Writing<br/>Batch + Streaming<br/>Offline & Online features"] --> B["Storage Layer<br/>Redis online<br/>Cassandra offline<br/>S3 historical"]
    B --> C["Serving Layer<br/>Feature API<br/>Point lookups<br/>Batch joins"]

    style A fill:#e8f5e9,stroke:#4caf50
    style B fill:#e8eaf6,stroke:#3f51b5
    style C fill:#fff3e0,stroke:#ef6c00

Trade-offs:

Store Latency Throughput Use Case
Redis <1ms High Hot features
Cassandra <10ms Very High Offline features
S3 N/A Low Historical

5. Model Monitoring & Observability

Метрики:

  1. Prediction drift
  2. PSI (Population Stability Index)
  3. KL divergence

  4. Model performance

  5. Accuracy, precision, recall over time
  6. Calibration metrics

  7. System health

  8. Latency (p50, p95, p99)
  9. Error rates
  10. Resource utilization

Architectural pattern:

graph LR
    A["Inference"] --> B["Metrics Collector"]
    B --> C["Time Series DB"]
    C --> D["Alerting"]
    A --> E["Predictions"]
    B --> F["Prometheus"]
    C --> G["Grafana"]
    D --> H["PagerDuty"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8f5e9,stroke:#4caf50
    style C fill:#f3e5f5,stroke:#9c27b0
    style D fill:#fce4ec,stroke:#c62828
    style E fill:#e8eaf6,stroke:#3f51b5
    style F fill:#e8f5e9,stroke:#4caf50
    style G fill:#f3e5f5,stroke:#9c27b0
    style H fill:#fce4ec,stroke:#c62828

Alerting rules:

- alert: HighLatency
  expr: p99_latency > 100ms
  for: 5m

- alert: AccuracyDrop
  expr: accuracy < 0.8
  for: 10m

Формулы и расчеты

Capacity Planning

\[ \text{Required Instances} = \frac{\text{QPS} \times \text{Latency}}{\text{Concurrency}} \]

Пример: - QPS: 10,000 requests/second - Target latency: 50ms - Concurrency: 10 requests per instance

\[ \text{Instances} = \frac{10000 \times 0.05}{10} = 50 \]

Model Sizing

\[ \text{Memory} = \text{Model Size} \times \text{Batch Size} \times \text{Overhead} \]

GPU Memory: $$ \text{Required VRAM} = \text{Model} + \text{Optimizer} + \text{Gradients} + \text{Activations} $$

Latency Budget

\[ \text{Total Latency} = \text{Network} + \text{Preprocessing} + \text{Inference} + \text{Postprocessing} \]

SLO example: - Total: 100ms - Network: 20ms - Preprocessing: 10ms - Inference: 60ms - Postprocessing: 10ms

Вопросы для подготовки

General System Design

  1. "Design a system that does X"
  2. "How would you scale this to Y users?"
  3. "What are the bottlenecks?"
  4. "How would you add feature X?"

ML-Specific

  1. "Design an experiment tracking system"
  2. "Design a real-time inference system"
  3. "Design a feature store"
  4. "Design a recommendation system"

Follow-up

  1. "How would you A/B test this?"
  2. "How would you monitor this?"
  3. "What if component X fails?"
  4. "How would you migrate to a new model?"

Практические советы

Design Process

  1. Clarify requirements
  2. Scale (QPS, data size)
  3. Latency requirements
  4. Consistency needs
  5. Budget constraints

  6. High-level design

  7. Components and their interactions
  8. Data flow
  9. Technology choices

  10. Deep dive

  11. Each component in detail
  12. API design
  13. Data models

  14. Bottlenecks and scaling

  15. Identify bottlenecks
  16. Propose solutions
  17. Calculate capacity

  18. Trade-offs

  19. Discuss alternatives
  20. Justify choices

Common patterns

Problem Solution
High latency Caching, batching, model quantization
Low throughput Horizontal scaling, load balancing
Model staleness Canary deployment, blue-green
Feature drift Monitoring, retraining pipeline
Cold start Pre-computation, hybrid models

Заблуждение: system design = только архитектурная диаграмма

Интервьюеры оценивают 5 аспектов: (1) clarification questions, (2) capacity estimation с конкретными числами, (3) component design, (4) bottleneck analysis, (5) trade-off discussion. Кандидаты, которые сразу рисуют boxes, пропускают 60% оценки. Всегда начинай с «какой QPS ожидается?» и «каков latency SLO?».

Заблуждение: feature store -- это просто Redis

Feature store решает 3 проблемы: (1) train-serve skew (offline features вычисляются иначе чем online), (2) feature reuse между командами, (3) point-in-time correctness для исторических данных. Redis -- только online serving layer. Без offline store (Hive/S3) и feature registry это не feature store, а кэш.

Заблуждение: monitoring = логирование метрик

Production ML мониторинг включает 3 уровня: (1) system health (latency p99, error rate, CPU/GPU utilization), (2) data quality (feature drift через PSI > 0.2, missing values > 5%), (3) model performance (accuracy degradation > 2%, calibration drift). Без alerting на всех 3 уровнях модель может деградировать месяцами незаметно.

Интервью

"Design a real-time ML inference system для 10K QPS"

❌ «Ставим модель за load balancer, добавляем GPU. Если медленно -- добавляем больше GPU.»

✅ «Clarify: 10K QPS, p99 < 100ms, model size 500MB. Capacity: 10000 * 0.05 / 10 = 50 instances. Architecture: (1) Load balancer -> inference pods (horizontal auto-scaling). (2) Model store (S3 + local cache) с versioning для A/B tests. (3) Feature store (Redis online, <1ms lookup). (4) Optimizations: dynamic batching (batch_size=32, wait_time=5ms), model quantization FP32->INT8 (4x memory reduction, 2x speedup), embedding cache (Redis, TTL=1h, hit rate ~60%). (5) Monitoring: Prometheus для latency/throughput, custom drift detector (PSI каждые 15min). Bottleneck: GPU memory -- решение: model sharding across GPUs или distillation для уменьшения модели.»

"Как бы вы спроектировали A/B testing platform для ML моделей?"

❌ «Случайно делим трафик 50/50, сравниваем метрики через неделю.»

✅ «(1) Traffic splitting: consistent hashing по user_id (не random -- один пользователь всегда в одной группе). (2) Staged rollout: shadow mode (0% traffic, compare predictions) -> 1% canary -> 5% -> 50%. (3) Metrics: primary (CTR, revenue), guardrail (latency p99, error rate), secondary (engagement depth). (4) Statistical rigor: pre-computed sample size (MDE=1%, alpha=0.05, power=0.8 -> ~16K users per group), sequential testing (CUPED для variance reduction, 30-50% shorter experiments). (5) Automation: auto-rollback если guardrail metrics degrade > 2 sigma. (6) Logging: feature values + predictions + outcomes для offline analysis.»

"Что произойдет, если Feature Store упадет?"

❌ «Система перестанет работать. Нужен backup.»

✅ «Graceful degradation strategy: (1) Online features (Redis) -- fallback на default values (population mean) с logging для monitoring. Cache hit rate ~60%, потеря качества ~5-10% на default values. (2) Circuit breaker: если Redis latency > 10ms, переключаемся на local cache (stale features, TTL=5min). (3) Offline features (Cassandra) -- реплицированы across datacenters, failover < 1s. (4) Monitoring: alert на miss rate > 20%, degraded mode dashboard. (5) Recovery: Redis восстанавливается из Cassandra snapshot (warm-up ~10min). Ключевое: inference service НЕ должен падать из-за feature store -- always serve, sometimes degraded.»

Мои заметки

Key trends 2025-2026:

  1. AI reasoning -- Design questions now assume AI assistance
  2. LLM integration -- RAG, embeddings in feature stores
  3. Real-time personalization -- Lower latency expectations
  4. Cost optimization -- GPU resource sharing, spot instances

Critical skills: - Back-of-the-envelope calculations - Technology selection justification - Bottleneck identification - Trade-off articulation

Gaps remaining: - [ ] LLM-specific system design (RAG systems) - [ ] Multi-agent system design - [ ] Edge ML deployment - [ ] Privacy-preserving ML systems

Practice resources: - System Design Primer - Designing Data-Intensive Applications - LeetCode System Design