Паттерны ML System Design интервью¶
~6 минут чтения
Предварительно: Метрики оценки LLM | Production Deploy LLM
ML System Design -- самый весомый раунд в FAANG ML-интервью (45 мин, вес 30-40% от оценки). Ключевое отличие от software system design: здесь оценивают не только архитектуру, но и выбор метрик, trade-off точности vs latency, и understanding data pipeline. Топ-5 вопросов 2026 года -- ChatGPT-like system, RAG at scale, LLM reranking, content moderation, model serving 1M QPS. RESHADED framework покрывает 8 фаз за 40 минут и используется в 70%+ подготовительных материалов.
Ключевые концепции¶
RESHADED Interview Framework¶
| Phase | Time | Focus |
|---|---|---|
| R Requirements | 5 min | Functional, non-functional, scale |
| E Evaluation Metrics | 3 min | Business + ML + system metrics |
| S System Components | 10 min | Data pipeline, ML pipeline, API layer |
| H High-Level Architecture | 5 min | Components, data flow, bottlenecks |
| A Advanced Details | 7 min | Scaling, failures, cost optimization |
| D Data Management | 5 min | Storage (SQL, NoSQL, vector DB), privacy |
| E Edge Cases | 3 min | 10x traffic spike, model failure, data drift |
| D Deploy & Monitor | 2 min | Deployment strategy, A/B testing, alerting |
2026 Top Questions¶
| Question | Difficulty | Frequency |
|---|---|---|
| Design ChatGPT-like system | Hard | Very High |
| Design RAG at scale | Medium-Hard | Very High |
| Design recommendation with LLM reranking | Hard | High |
| Design content moderation pipeline | Medium | High |
| Design model serving for 1M QPS | Very Hard | Medium |
1. Core ML Pipelines¶
Data Processing Pipeline¶
graph LR
A["Raw Data"] --> B["Ingestion"]
B --> C["Validation"]
C --> D["Feature Engineering"]
D --> E["Storage"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#e8eaf6,stroke:#3f51b5
style C fill:#fff3e0,stroke:#ef6c00
style D fill:#f3e5f5,stroke:#9c27b0
style E fill:#e8f5e9,stroke:#4caf50
Batch vs Stream processing, data quality checks, feature store integration (Tecton, Feast), data versioning (DVC).
Model Training Pipeline¶
graph LR
A["Features"] --> B["Algorithm Selection"]
B --> C["Training"]
C --> D["Evaluation"]
D --> E["Registry"]
E --> F["Deployment"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#e8eaf6,stroke:#3f51b5
style C fill:#fff3e0,stroke:#ef6c00
style D fill:#fff3e0,stroke:#ef6c00
style E fill:#f3e5f5,stroke:#9c27b0
style F fill:#e8f5e9,stroke:#4caf50
Experiment tracking (MLflow, W&B), hyperparameter optimization, model versioning, A/B testing framework.
Serving Infrastructure¶
graph LR
A["Model"] --> B["Preprocessing"]
B --> C["Inference"]
C --> D["Monitoring"]
D --> E["Feedback Loop"]
E -.-> A
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#fff3e0,stroke:#ef6c00
style C fill:#f3e5f5,stroke:#9c27b0
style D fill:#e8f5e9,stroke:#4caf50
style E fill:#fce4ec,stroke:#c62828
Batch vs Online inference, caching strategies, model monitoring (data drift, performance), horizontal/vertical scaling.
2. Design ChatGPT-like System¶
Requirements¶
- Functional: multi-turn conversations, streaming (token-by-token), various tasks (Q&A, coding, creative)
- Non-functional: TTFT < 500ms, TBT < 50ms, 99.9% availability
- Scale: 1M DAU, avg 10 turns/conversation, 500 tokens/request, 50K peak QPS
Architecture¶
graph TD
A["Client"] --> B["API Gateway"]
B --> C["Rate Limiter<br/>(Redis)"]
C --> D["Load Balancer"]
D --> E["Orchestrator<br/>(Prompt Validation, Context Manager, Safety Filter)"]
E --> F1["LLM Pool<br/>(70B)"]
E --> F2["LLM Pool<br/>(70B)"]
E --> F3["LLM Pool<br/>(7B small)"]
F1 --> G["Response Handler<br/>(Streaming SSE, Output filtering, Logging)"]
F2 --> G
F3 --> G
G --> H1["Redis<br/>(Cache)"]
G --> H2["PostgreSQL<br/>(History)"]
G --> H3["S3<br/>(Logs)"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#e8eaf6,stroke:#3f51b5
style C fill:#fce4ec,stroke:#c62828
style D fill:#e8eaf6,stroke:#3f51b5
style E fill:#fff3e0,stroke:#ef6c00
style F1 fill:#f3e5f5,stroke:#9c27b0
style F2 fill:#f3e5f5,stroke:#9c27b0
style F3 fill:#f3e5f5,stroke:#9c27b0
style G fill:#e8f5e9,stroke:#4caf50
style H1 fill:#e8eaf6,stroke:#3f51b5
style H2 fill:#e8eaf6,stroke:#3f51b5
style H3 fill:#e8eaf6,stroke:#3f51b5
Key Design Decisions¶
| Component | Decision | Trade-off |
|---|---|---|
| Model serving | vLLM with PagedAttention | Memory efficiency vs complexity |
| Context | KV cache + conversation store | Memory vs latency |
| Scaling | Horizontal + model sharding | Complexity vs throughput |
| Streaming | Server-Sent Events | Simple vs WebSocket overhead |
| Safety | Input/output filtering | Latency vs safety |
3. Design RAG at Scale¶
Indexing Pipeline (Offline)¶
graph LR
A["Documents"] --> B["Chunker<br/>(512 tokens, overlap=50)"]
B --> C["Embedder<br/>(1536-dim)"]
C --> D["Vector DB"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#fff3e0,stroke:#ef6c00
style C fill:#f3e5f5,stroke:#9c27b0
style D fill:#e8f5e9,stroke:#4caf50
Query Pipeline (Online)¶
graph LR
A["Query"] --> B["Query Rewrite<br/>(HyDE / Multi-query)"]
B --> C["Embed"]
C --> D["Retrieval<br/>(Top-100)"]
D --> E["Rerank<br/>(Cross-Encoder, Top-20)"]
E --> F["Context Assembly"]
F --> G["LLM Generation"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#fff3e0,stroke:#ef6c00
style C fill:#f3e5f5,stroke:#9c27b0
style D fill:#e8f5e9,stroke:#4caf50
style E fill:#fff3e0,stroke:#ef6c00
style F fill:#e8eaf6,stroke:#3f51b5
style G fill:#f3e5f5,stroke:#9c27b0
Scaling¶
| Challenge | Solution | Details |
|---|---|---|
| Vector search latency | HNSW + sharding | Sub-100ms at 1B vectors |
| Index updates | Async reindexing | Zero-downtime updates |
| High QPS | Query caching | 50-80% cache hit rate |
| Context length | Hierarchical retrieval | Coarse -> fine search |
Vector DB Selection¶
| System | Scale | Latency | Use Case |
|---|---|---|---|
| Pinecone | 1B+ | 50-100ms | Managed, production |
| Milvus | 10B+ | 30-80ms | Self-hosted, large scale |
| Weaviate | 1B+ | 40-100ms | Hybrid search |
| Qdrant | 1B+ | 20-50ms | Rust-based, fast |
4. Design Recommendation with LLM Reranking¶
Stage 1: Candidate Generation (Traditional ML)¶
graph LR
A["User Features"] --> B["Collaborative Filter<br/>(Matrix Factorization / LightFM)"]
A --> C["Two-Tower NN<br/>(User Emb . Item Emb)"]
B --> D["1000 items<br/>Latency: 10-50ms"]
C --> D
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#fff3e0,stroke:#ef6c00
style C fill:#f3e5f5,stroke:#9c27b0
style D fill:#e8f5e9,stroke:#4caf50
Stage 2: LLM Reranking¶
graph LR
A["Top 100 candidates<br/>+ User context"] --> B["LLM prompt"]
B --> C["Reranked top 20<br/>Latency: 200-500ms"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#f3e5f5,stroke:#9c27b0
style C fill:#e8f5e9,stroke:#4caf50
LLM4Rerank (WWW 2025): multi-objective (accuracy, fairness, diversity), batch inference, user preference retrieval (UR4Rec).
Traditional Recommendation Architecture¶
- Candidate Generation: collaborative filtering, content-based, hybrid
- Scoring: ML model for relevance
- Ranking: learning-to-rank
- Re-ranking: heuristic rules or second ML model
- Retrieval: ANN (HNSW, Faiss), vector DB
5. Design Content Moderation Pipeline¶
Multi-Stage Classification¶
| Stage | Latency | Coverage | Method |
|---|---|---|---|
| Rule-based | 1ms | 10% blocked | Blocked words, regex, length checks |
| Fast ML | 10ms | 80% auto-decided | Small BERT (110M), multi-label, threshold 0.8 |
| LLM judge | 500ms | 8% escalated | Large LLM (confidence < 0.8) |
| Human | Hours | 2% edge cases | Low confidence, appeals, policy updates |
6. Search Ranking System¶
- Indexing: inverted index, Elasticsearch
- Retrieval: BM25, TF-IDF, neural retrieval
- Ranking: Learning to Rank (LambdaMART, RankNet)
- Personalization: user embeddings, click history
- Re-ranking: diversity, freshness
Scale: index size and update frequency, query latency p50 < 100ms, freshness vs relevance trade-off.
7. Fraud Detection System¶
- Real-time scoring: online inference (< 50ms)
- Rule Engine: known patterns, blacklists
- Labeling pipeline: human review for suspicious cases
- Feature Store: user behavior history, device fingerprinting
- Monitoring: drift detection, performance degradation
Trade-offs: precision vs recall (false positives vs fraud loss), real-time vs batch, explainability vs accuracy.
8. Ad Click Prediction¶
- Feature Engineering: user features, ad features, contextual features
- Model: logistic regression, gradient boosting (XGBoost, LightGBM)
- Serving: low latency prediction (online learning)
- Exploration: multi-armed bandit (UCB, Thompson Sampling)
9. Model Serving for 1M QPS¶
Architecture¶
graph TD
A["DNS"] --> B["Geo-LB"]
B --> C["Regional LB"]
C --> D["Service Mesh"]
D --> E1["Region US<br/>(300K QPS)"]
D --> E2["Region EU<br/>(400K QPS)"]
D --> E3["Region AP<br/>(300K QPS)"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#e8eaf6,stroke:#3f51b5
style C fill:#fff3e0,stroke:#ef6c00
style D fill:#f3e5f5,stroke:#9c27b0
style E1 fill:#e8f5e9,stroke:#4caf50
style E2 fill:#e8f5e9,stroke:#4caf50
style E3 fill:#e8f5e9,stroke:#4caf50
Per-region: Request Router -> Semantic Cache (Redis + Embeddings, 50% hit rate) Cache miss -> Model Inference (GPU pools: A100 x100, H100 x50) Tiered serving: simple queries -> 7B model (10ms), complex -> 70B model (100ms)
Cost Optimization¶
| Strategy | Savings | Trade-off |
|---|---|---|
| Semantic caching | 50-70% | Latency for cache lookup |
| Model tiering | 40-60% | Quality variance |
| Batch inference | 30-50% | Latency increase |
| Quantization | 20-40% | Quality loss |
| Spot instances | 60-80% | Availability risk |
10. Capacity Planning¶
QPS Estimation¶
Example: 1M users, 10 req/user/day, 4-hour peak: 10M / (4 * 3600) = 694 QPS
Storage Estimation¶
Example: 1GB model, 10 replicas, 3x redundancy = 30GB
Interview-Relevant Numbers¶
Latency Targets¶
| System | Target |
|---|---|
| Chatbot TTFT | < 500ms |
| Chatbot TBT | < 50ms |
| Search | < 200ms |
| Recommendation | < 100ms |
| Fraud detection | < 50ms |
| Content moderation | < 500ms |
Scale Benchmarks¶
| Scale | QPS | Infrastructure |
|---|---|---|
| Small | 1K | 10 GPUs |
| Medium | 100K | 100 GPUs + cache |
| Large | 1M | 1000 GPUs + geo-distribution |
Cost Targets¶
| Model | Cost/1K Tokens | Infrastructure |
|---|---|---|
| 7B (quantized) | $0.001 | Single A10 |
| 70B (quantized) | $0.01 | 4x A100 |
| 175B | $0.03 | 8x H100 |
Cache Hit Rates¶
| Cache Type | Hit Rate | Latency Reduction |
|---|---|---|
| Exact match | 20-30% | 100% |
| Semantic cache | 50-70% | 90% |
| KV cache (prefix) | 40-60% | 70% |
Для интервью¶
Q: "Design a real-time recommendation system with LLM reranking."¶
Two-stage: (1) Candidate generation -- collaborative filtering / two-tower neural network, 10-50ms, 1000 candidates. (2) LLM reranking -- top 100 candidates + user context через LLM, 200-500ms, reranked top 20. Trade-off: LLM adds 100-500ms latency but improves CTR +10-20%. Scale: ANN retrieval (HNSW, Faiss), batch LLM inference, caching popular recommendations.
Q: "How do you handle model drift in production?"¶
(1) Monitoring: track prediction distribution, feature importance drift. (2) Alerting: thresholds on KL divergence, PSI (Population Stability Index). (3) Automated retraining: trigger when drift exceeds threshold. (4) A/B testing: compare new model vs champion. (5) Fallback: keep previous version for quick rollback.
Q: "Design content moderation for 10M posts/day."¶
4-stage pipeline: (1) Rule-based filter (1ms, 10% blocked). (2) Fast ML classifier (10ms, 80% auto-decided, small BERT, threshold 0.8). (3) LLM judge (500ms, 8% escalated, confidence < 0.8). (4) Human review (hours, 2% edge cases). Active learning from human decisions feeds back to ML classifier.
Common Pitfalls¶
- Jumping to solution before clarifying requirements
- Ignoring trade-offs (every decision has pros/cons)
- Forgetting non-functional requirements (latency, cost > accuracy)
- Over-engineering (start simple, iterate)
- Not mentioning failure modes
- Ignoring monitoring
Green Flags¶
| Behavior | Why |
|---|---|
| Asks clarifying questions | Systematic thinking |
| Discusses trade-offs | Experience |
| Starts simple, scales up | Pragmatic |
| Considers failure modes | Production mindset |
| Mentions monitoring | Operational awareness |
Заблуждение: ML System Design -- это software system design с ML-моделью
Главная ошибка кандидатов -- рисовать boxes-and-arrows без ML-специфики. Интервьюеры ожидают: выбор метрик (offline vs online, proxy vs true), data pipeline (feature engineering, label quality), trade-off accuracy vs latency, failure modes (data drift, model degradation). В FAANG 60%+ rejection'ов на этом раунде -- из-за отсутствия ML reasoning при наличии правильной архитектуры.
Заблуждение: всегда нужна самая точная модель
В production accuracy -- не главная метрика. Content moderation при 10M posts/day: 4-stage pipeline (rules -> small BERT -> LLM -> human) обрабатывает 90% за 1-10ms, а accuracy определяется трейдофом precision/recall. Fraud detection: false positive rate > 1% блокирует легитимных пользователей и стоит бизнесу больше, чем пропущенный фрод. Latency target <50ms исключает использование больших моделей.
Заблуждение: semantic cache решает проблему масштабирования LLM serving
Semantic cache с hit rate 50-70% сокращает нагрузку, но cache miss всё равно идёт на GPU inference. Для 1M QPS: даже при 70% cache hit остаётся 300K QPS на GPU -- это 1000+ A100 в geo-distributed setup. Реальная оптимизация -- model tiering (7B для 80% простых запросов, 70B для 20% сложных) + quantization + batch inference, что суммарно даёт 60-80% экономии.
Interview Questions¶
Q: Как вы начнёте проектирование ChatGPT-like системы на интервью?
Red flag: "Берём GPT-4 API, добавляем Redis для кэша и PostgreSQL для истории"
Strong answer: "Начинаю с Requirements (5 мин): функциональные (multi-turn, streaming, разные задачи), нефункциональные (TTFT <500ms, TBT <50ms, 99.9% availability), масштаб (1M DAU, 50K peak QPS). Затем Evaluation Metrics: online -- TTFT p95, user satisfaction rate; offline -- LLM-Judge quality. System Components: API Gateway -> Rate Limiter -> Orchestrator (prompt validation, context manager, safety filter) -> LLM Pool (tiered: 70B + 7B) -> Response Handler (SSE streaming). Ключевые trade-offs: vLLM с PagedAttention для memory efficiency, KV cache для conversation history, model tiering для cost optimization."
Q: Как масштабировать model serving до 1M QPS?
Red flag: "Добавляем больше GPU"
Strong answer: "Geo-distributed architecture: DNS -> Geo-LB -> 3 региона (US 300K, EU 400K, AP 300K). Per-region: (1) Semantic cache (Redis + embeddings, 50-70% hit rate) -- сразу убирает половину нагрузки. (2) Model tiering -- 7B quantized (10ms) для 80% простых запросов, 70B (100ms) для сложных. (3) GPU pools: A100 x100 + H100 x50 per region. (4) Batch inference для non-streaming. Cost optimization: caching 50-70%, tiering 40-60%, quantization 20-40%, spot instances 60-80%. Total infra cost: $2-5M/month при 1M QPS."
Q: Design content moderation для 10M posts/day -- как выбрать threshold?
Red flag: "Ставим threshold 0.5 и смотрим accuracy"
Strong answer: "4-stage cascade: (1) Rule-based (1ms, 10% blocked -- known bad words, regex). (2) Fast ML classifier (10ms, small BERT 110M params, multi-label) -- threshold 0.8 для auto-accept/reject, 80% traffic. (3) LLM judge (500ms) -- для 8% с confidence 0.5-0.8. (4) Human review -- 2% edge cases + appeals. Threshold выбирается по precision-recall trade-off: high precision (0.8+) для auto-block (минимум false positives), high recall (0.95+) для safety-critical content (не пропустить). Active learning: human decisions фидятся обратно в ML classifier, threshold пересматривается weekly."
Источники¶
- I Got An Offer -- "Generative AI System Design Interview" (Jan 2026)
- Exponent -- "ML System Design Interview Guide" (2026)
- Medium -- "ML System Design Interview Guide: Complete Framework" (Oct 2025)
- GeeksforGeeks -- "Top 25 ML System Design Interview Questions"
- Towards Data Science -- "Cracking ML System Design Interviews" (Nov 2025)
- InterviewNode -- "Generative AI System Design Interview Patterns" (Nov 2025)
- arXiv -- "LLM4Rerank: Auto-Reranking Framework" (2406.12433)
- Eugene Yan -- "Improving Recommendation Systems in the Age of LLMs" (Mar 2025)
See Also¶
- RAG архитектуры -- Design RAG at Scale: vector DB selection, chunking strategies, reranking pipeline
- Каскадная маршрутизация LLM -- model routing для 1M QPS serving, cascade escalation, cost optimization
- Ценообразование API LLM -- cost targets из capacity planning: \(0.001-\)0.03 per 1K tokens
- Production Deploy LLM -- vLLM, PagedAttention, model serving infrastructure
- Гардрейлы оценки LLM -- safety guardrails для content moderation pipeline design