Перейти к содержанию

Паттерны ML System Design интервью

~6 минут чтения

Предварительно: Метрики оценки LLM | Production Deploy LLM

ML System Design -- самый весомый раунд в FAANG ML-интервью (45 мин, вес 30-40% от оценки). Ключевое отличие от software system design: здесь оценивают не только архитектуру, но и выбор метрик, trade-off точности vs latency, и understanding data pipeline. Топ-5 вопросов 2026 года -- ChatGPT-like system, RAG at scale, LLM reranking, content moderation, model serving 1M QPS. RESHADED framework покрывает 8 фаз за 40 минут и используется в 70%+ подготовительных материалов.


Ключевые концепции

RESHADED Interview Framework

Phase Time Focus
R Requirements 5 min Functional, non-functional, scale
E Evaluation Metrics 3 min Business + ML + system metrics
S System Components 10 min Data pipeline, ML pipeline, API layer
H High-Level Architecture 5 min Components, data flow, bottlenecks
A Advanced Details 7 min Scaling, failures, cost optimization
D Data Management 5 min Storage (SQL, NoSQL, vector DB), privacy
E Edge Cases 3 min 10x traffic spike, model failure, data drift
D Deploy & Monitor 2 min Deployment strategy, A/B testing, alerting

2026 Top Questions

Question Difficulty Frequency
Design ChatGPT-like system Hard Very High
Design RAG at scale Medium-Hard Very High
Design recommendation with LLM reranking Hard High
Design content moderation pipeline Medium High
Design model serving for 1M QPS Very Hard Medium

1. Core ML Pipelines

Data Processing Pipeline

graph LR
    A["Raw Data"] --> B["Ingestion"]
    B --> C["Validation"]
    C --> D["Feature Engineering"]
    D --> E["Storage"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8eaf6,stroke:#3f51b5
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#f3e5f5,stroke:#9c27b0
    style E fill:#e8f5e9,stroke:#4caf50

Batch vs Stream processing, data quality checks, feature store integration (Tecton, Feast), data versioning (DVC).

Model Training Pipeline

graph LR
    A["Features"] --> B["Algorithm Selection"]
    B --> C["Training"]
    C --> D["Evaluation"]
    D --> E["Registry"]
    E --> F["Deployment"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8eaf6,stroke:#3f51b5
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#fff3e0,stroke:#ef6c00
    style E fill:#f3e5f5,stroke:#9c27b0
    style F fill:#e8f5e9,stroke:#4caf50

Experiment tracking (MLflow, W&B), hyperparameter optimization, model versioning, A/B testing framework.

Serving Infrastructure

graph LR
    A["Model"] --> B["Preprocessing"]
    B --> C["Inference"]
    C --> D["Monitoring"]
    D --> E["Feedback Loop"]
    E -.-> A

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#f3e5f5,stroke:#9c27b0
    style D fill:#e8f5e9,stroke:#4caf50
    style E fill:#fce4ec,stroke:#c62828

Batch vs Online inference, caching strategies, model monitoring (data drift, performance), horizontal/vertical scaling.


2. Design ChatGPT-like System

Requirements

  • Functional: multi-turn conversations, streaming (token-by-token), various tasks (Q&A, coding, creative)
  • Non-functional: TTFT < 500ms, TBT < 50ms, 99.9% availability
  • Scale: 1M DAU, avg 10 turns/conversation, 500 tokens/request, 50K peak QPS

Architecture

graph TD
    A["Client"] --> B["API Gateway"]
    B --> C["Rate Limiter<br/>(Redis)"]
    C --> D["Load Balancer"]
    D --> E["Orchestrator<br/>(Prompt Validation, Context Manager, Safety Filter)"]
    E --> F1["LLM Pool<br/>(70B)"]
    E --> F2["LLM Pool<br/>(70B)"]
    E --> F3["LLM Pool<br/>(7B small)"]
    F1 --> G["Response Handler<br/>(Streaming SSE, Output filtering, Logging)"]
    F2 --> G
    F3 --> G
    G --> H1["Redis<br/>(Cache)"]
    G --> H2["PostgreSQL<br/>(History)"]
    G --> H3["S3<br/>(Logs)"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8eaf6,stroke:#3f51b5
    style C fill:#fce4ec,stroke:#c62828
    style D fill:#e8eaf6,stroke:#3f51b5
    style E fill:#fff3e0,stroke:#ef6c00
    style F1 fill:#f3e5f5,stroke:#9c27b0
    style F2 fill:#f3e5f5,stroke:#9c27b0
    style F3 fill:#f3e5f5,stroke:#9c27b0
    style G fill:#e8f5e9,stroke:#4caf50
    style H1 fill:#e8eaf6,stroke:#3f51b5
    style H2 fill:#e8eaf6,stroke:#3f51b5
    style H3 fill:#e8eaf6,stroke:#3f51b5

Key Design Decisions

Component Decision Trade-off
Model serving vLLM with PagedAttention Memory efficiency vs complexity
Context KV cache + conversation store Memory vs latency
Scaling Horizontal + model sharding Complexity vs throughput
Streaming Server-Sent Events Simple vs WebSocket overhead
Safety Input/output filtering Latency vs safety

3. Design RAG at Scale

Indexing Pipeline (Offline)

graph LR
    A["Documents"] --> B["Chunker<br/>(512 tokens, overlap=50)"]
    B --> C["Embedder<br/>(1536-dim)"]
    C --> D["Vector DB"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#f3e5f5,stroke:#9c27b0
    style D fill:#e8f5e9,stroke:#4caf50

Query Pipeline (Online)

graph LR
    A["Query"] --> B["Query Rewrite<br/>(HyDE / Multi-query)"]
    B --> C["Embed"]
    C --> D["Retrieval<br/>(Top-100)"]
    D --> E["Rerank<br/>(Cross-Encoder, Top-20)"]
    E --> F["Context Assembly"]
    F --> G["LLM Generation"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#f3e5f5,stroke:#9c27b0
    style D fill:#e8f5e9,stroke:#4caf50
    style E fill:#fff3e0,stroke:#ef6c00
    style F fill:#e8eaf6,stroke:#3f51b5
    style G fill:#f3e5f5,stroke:#9c27b0

Scaling

Challenge Solution Details
Vector search latency HNSW + sharding Sub-100ms at 1B vectors
Index updates Async reindexing Zero-downtime updates
High QPS Query caching 50-80% cache hit rate
Context length Hierarchical retrieval Coarse -> fine search

Vector DB Selection

System Scale Latency Use Case
Pinecone 1B+ 50-100ms Managed, production
Milvus 10B+ 30-80ms Self-hosted, large scale
Weaviate 1B+ 40-100ms Hybrid search
Qdrant 1B+ 20-50ms Rust-based, fast

4. Design Recommendation with LLM Reranking

Stage 1: Candidate Generation (Traditional ML)

graph LR
    A["User Features"] --> B["Collaborative Filter<br/>(Matrix Factorization / LightFM)"]
    A --> C["Two-Tower NN<br/>(User Emb . Item Emb)"]
    B --> D["1000 items<br/>Latency: 10-50ms"]
    C --> D

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#f3e5f5,stroke:#9c27b0
    style D fill:#e8f5e9,stroke:#4caf50

Stage 2: LLM Reranking

graph LR
    A["Top 100 candidates<br/>+ User context"] --> B["LLM prompt"]
    B --> C["Reranked top 20<br/>Latency: 200-500ms"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#f3e5f5,stroke:#9c27b0
    style C fill:#e8f5e9,stroke:#4caf50

LLM4Rerank (WWW 2025): multi-objective (accuracy, fairness, diversity), batch inference, user preference retrieval (UR4Rec).

Traditional Recommendation Architecture

\[\text{Pipeline} = \text{Candidate Generation} \to \text{Scoring} \to \text{Ranking} \to \text{Re-ranking}\]
  • Candidate Generation: collaborative filtering, content-based, hybrid
  • Scoring: ML model for relevance
  • Ranking: learning-to-rank
  • Re-ranking: heuristic rules or second ML model
  • Retrieval: ANN (HNSW, Faiss), vector DB

5. Design Content Moderation Pipeline

Multi-Stage Classification

Stage Latency Coverage Method
Rule-based 1ms 10% blocked Blocked words, regex, length checks
Fast ML 10ms 80% auto-decided Small BERT (110M), multi-label, threshold 0.8
LLM judge 500ms 8% escalated Large LLM (confidence < 0.8)
Human Hours 2% edge cases Low confidence, appeals, policy updates

6. Search Ranking System

\[\text{Pipeline} = \text{Query} \to \text{Retrieval} \to \text{Ranking} \to \text{Re-ranking}\]
  • Indexing: inverted index, Elasticsearch
  • Retrieval: BM25, TF-IDF, neural retrieval
  • Ranking: Learning to Rank (LambdaMART, RankNet)
  • Personalization: user embeddings, click history
  • Re-ranking: diversity, freshness

Scale: index size and update frequency, query latency p50 < 100ms, freshness vs relevance trade-off.


7. Fraud Detection System

\[\text{Pipeline} = \text{Transaction} \to \text{Feature Extraction} \to \text{Model} \to \text{Rule Engine} \to \text{Decision}\]
  • Real-time scoring: online inference (< 50ms)
  • Rule Engine: known patterns, blacklists
  • Labeling pipeline: human review for suspicious cases
  • Feature Store: user behavior history, device fingerprinting
  • Monitoring: drift detection, performance degradation

Trade-offs: precision vs recall (false positives vs fraud loss), real-time vs batch, explainability vs accuracy.


8. Ad Click Prediction

  • Feature Engineering: user features, ad features, contextual features
  • Model: logistic regression, gradient boosting (XGBoost, LightGBM)
  • Serving: low latency prediction (online learning)
  • Exploration: multi-armed bandit (UCB, Thompson Sampling)

9. Model Serving for 1M QPS

Architecture

graph TD
    A["DNS"] --> B["Geo-LB"]
    B --> C["Regional LB"]
    C --> D["Service Mesh"]
    D --> E1["Region US<br/>(300K QPS)"]
    D --> E2["Region EU<br/>(400K QPS)"]
    D --> E3["Region AP<br/>(300K QPS)"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8eaf6,stroke:#3f51b5
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#f3e5f5,stroke:#9c27b0
    style E1 fill:#e8f5e9,stroke:#4caf50
    style E2 fill:#e8f5e9,stroke:#4caf50
    style E3 fill:#e8f5e9,stroke:#4caf50

Per-region: Request Router -> Semantic Cache (Redis + Embeddings, 50% hit rate) Cache miss -> Model Inference (GPU pools: A100 x100, H100 x50) Tiered serving: simple queries -> 7B model (10ms), complex -> 70B model (100ms)

Cost Optimization

Strategy Savings Trade-off
Semantic caching 50-70% Latency for cache lookup
Model tiering 40-60% Quality variance
Batch inference 30-50% Latency increase
Quantization 20-40% Quality loss
Spot instances 60-80% Availability risk

10. Capacity Planning

QPS Estimation

\[\text{Peak QPS} = \frac{\text{Users} \times \text{Requests/User/Day}}{\text{Peak Hours} \times 3600}\]

Example: 1M users, 10 req/user/day, 4-hour peak: 10M / (4 * 3600) = 694 QPS

Storage Estimation

\[\text{Storage} = \text{Model Size} \times \text{Replicas} \times \text{Redundancy Factor}\]

Example: 1GB model, 10 replicas, 3x redundancy = 30GB


Interview-Relevant Numbers

Latency Targets

System Target
Chatbot TTFT < 500ms
Chatbot TBT < 50ms
Search < 200ms
Recommendation < 100ms
Fraud detection < 50ms
Content moderation < 500ms

Scale Benchmarks

Scale QPS Infrastructure
Small 1K 10 GPUs
Medium 100K 100 GPUs + cache
Large 1M 1000 GPUs + geo-distribution

Cost Targets

Model Cost/1K Tokens Infrastructure
7B (quantized) $0.001 Single A10
70B (quantized) $0.01 4x A100
175B $0.03 8x H100

Cache Hit Rates

Cache Type Hit Rate Latency Reduction
Exact match 20-30% 100%
Semantic cache 50-70% 90%
KV cache (prefix) 40-60% 70%

Для интервью

Q: "Design a real-time recommendation system with LLM reranking."

Two-stage: (1) Candidate generation -- collaborative filtering / two-tower neural network, 10-50ms, 1000 candidates. (2) LLM reranking -- top 100 candidates + user context через LLM, 200-500ms, reranked top 20. Trade-off: LLM adds 100-500ms latency but improves CTR +10-20%. Scale: ANN retrieval (HNSW, Faiss), batch LLM inference, caching popular recommendations.

Q: "How do you handle model drift in production?"

(1) Monitoring: track prediction distribution, feature importance drift. (2) Alerting: thresholds on KL divergence, PSI (Population Stability Index). (3) Automated retraining: trigger when drift exceeds threshold. (4) A/B testing: compare new model vs champion. (5) Fallback: keep previous version for quick rollback.

Q: "Design content moderation for 10M posts/day."

4-stage pipeline: (1) Rule-based filter (1ms, 10% blocked). (2) Fast ML classifier (10ms, 80% auto-decided, small BERT, threshold 0.8). (3) LLM judge (500ms, 8% escalated, confidence < 0.8). (4) Human review (hours, 2% edge cases). Active learning from human decisions feeds back to ML classifier.

Common Pitfalls

  • Jumping to solution before clarifying requirements
  • Ignoring trade-offs (every decision has pros/cons)
  • Forgetting non-functional requirements (latency, cost > accuracy)
  • Over-engineering (start simple, iterate)
  • Not mentioning failure modes
  • Ignoring monitoring

Green Flags

Behavior Why
Asks clarifying questions Systematic thinking
Discusses trade-offs Experience
Starts simple, scales up Pragmatic
Considers failure modes Production mindset
Mentions monitoring Operational awareness

Заблуждение: ML System Design -- это software system design с ML-моделью

Главная ошибка кандидатов -- рисовать boxes-and-arrows без ML-специфики. Интервьюеры ожидают: выбор метрик (offline vs online, proxy vs true), data pipeline (feature engineering, label quality), trade-off accuracy vs latency, failure modes (data drift, model degradation). В FAANG 60%+ rejection'ов на этом раунде -- из-за отсутствия ML reasoning при наличии правильной архитектуры.

Заблуждение: всегда нужна самая точная модель

В production accuracy -- не главная метрика. Content moderation при 10M posts/day: 4-stage pipeline (rules -> small BERT -> LLM -> human) обрабатывает 90% за 1-10ms, а accuracy определяется трейдофом precision/recall. Fraud detection: false positive rate > 1% блокирует легитимных пользователей и стоит бизнесу больше, чем пропущенный фрод. Latency target <50ms исключает использование больших моделей.

Заблуждение: semantic cache решает проблему масштабирования LLM serving

Semantic cache с hit rate 50-70% сокращает нагрузку, но cache miss всё равно идёт на GPU inference. Для 1M QPS: даже при 70% cache hit остаётся 300K QPS на GPU -- это 1000+ A100 в geo-distributed setup. Реальная оптимизация -- model tiering (7B для 80% простых запросов, 70B для 20% сложных) + quantization + batch inference, что суммарно даёт 60-80% экономии.


Interview Questions

Q: Как вы начнёте проектирование ChatGPT-like системы на интервью?

❌ Red flag: "Берём GPT-4 API, добавляем Redis для кэша и PostgreSQL для истории"

✅ Strong answer: "Начинаю с Requirements (5 мин): функциональные (multi-turn, streaming, разные задачи), нефункциональные (TTFT <500ms, TBT <50ms, 99.9% availability), масштаб (1M DAU, 50K peak QPS). Затем Evaluation Metrics: online -- TTFT p95, user satisfaction rate; offline -- LLM-Judge quality. System Components: API Gateway -> Rate Limiter -> Orchestrator (prompt validation, context manager, safety filter) -> LLM Pool (tiered: 70B + 7B) -> Response Handler (SSE streaming). Ключевые trade-offs: vLLM с PagedAttention для memory efficiency, KV cache для conversation history, model tiering для cost optimization."

Q: Как масштабировать model serving до 1M QPS?

❌ Red flag: "Добавляем больше GPU"

✅ Strong answer: "Geo-distributed architecture: DNS -> Geo-LB -> 3 региона (US 300K, EU 400K, AP 300K). Per-region: (1) Semantic cache (Redis + embeddings, 50-70% hit rate) -- сразу убирает половину нагрузки. (2) Model tiering -- 7B quantized (10ms) для 80% простых запросов, 70B (100ms) для сложных. (3) GPU pools: A100 x100 + H100 x50 per region. (4) Batch inference для non-streaming. Cost optimization: caching 50-70%, tiering 40-60%, quantization 20-40%, spot instances 60-80%. Total infra cost: $2-5M/month при 1M QPS."

Q: Design content moderation для 10M posts/day -- как выбрать threshold?

❌ Red flag: "Ставим threshold 0.5 и смотрим accuracy"

✅ Strong answer: "4-stage cascade: (1) Rule-based (1ms, 10% blocked -- known bad words, regex). (2) Fast ML classifier (10ms, small BERT 110M params, multi-label) -- threshold 0.8 для auto-accept/reject, 80% traffic. (3) LLM judge (500ms) -- для 8% с confidence 0.5-0.8. (4) Human review -- 2% edge cases + appeals. Threshold выбирается по precision-recall trade-off: high precision (0.8+) для auto-block (минимум false positives), high recall (0.95+) для safety-critical content (не пропустить). Active learning: human decisions фидятся обратно в ML classifier, threshold пересматривается weekly."


Источники

  1. I Got An Offer -- "Generative AI System Design Interview" (Jan 2026)
  2. Exponent -- "ML System Design Interview Guide" (2026)
  3. Medium -- "ML System Design Interview Guide: Complete Framework" (Oct 2025)
  4. GeeksforGeeks -- "Top 25 ML System Design Interview Questions"
  5. Towards Data Science -- "Cracking ML System Design Interviews" (Nov 2025)
  6. InterviewNode -- "Generative AI System Design Interview Patterns" (Nov 2025)
  7. arXiv -- "LLM4Rerank: Auto-Reranking Framework" (2406.12433)
  8. Eugene Yan -- "Improving Recommendation Systems in the Age of LLMs" (Mar 2025)

See Also