Паттерны ML System Design интервью¶

~6 минут чтения

Предварительно: Метрики оценки LLM | Production Deploy LLM

ML System Design -- самый весомый раунд в FAANG ML-интервью (45 мин, вес 30-40% от оценки). Ключевое отличие от software system design: здесь оценивают не только архитектуру, но и выбор метрик, trade-off точности vs latency, и understanding data pipeline. Топ-5 вопросов 2026 года -- ChatGPT-like system, RAG at scale, LLM reranking, content moderation, model serving 1M QPS. RESHADED framework покрывает 8 фаз за 40 минут и используется в 70%+ подготовительных материалов.

Ключевые концепции¶

RESHADED Interview Framework¶

Phase	Time	Focus
R Requirements	5 min	Functional, non-functional, scale
E Evaluation Metrics	3 min	Business + ML + system metrics
S System Components	10 min	Data pipeline, ML pipeline, API layer
H High-Level Architecture	5 min	Components, data flow, bottlenecks
A Advanced Details	7 min	Scaling, failures, cost optimization
D Data Management	5 min	Storage (SQL, NoSQL, vector DB), privacy
E Edge Cases	3 min	10x traffic spike, model failure, data drift
D Deploy & Monitor	2 min	Deployment strategy, A/B testing, alerting

2026 Top Questions¶

Question	Difficulty	Frequency
Design ChatGPT-like system	Hard	Very High
Design RAG at scale	Medium-Hard	Very High
Design recommendation with LLM reranking	Hard	High
Design content moderation pipeline	Medium	High
Design model serving for 1M QPS	Very Hard	Medium

1. Core ML Pipelines¶

Data Processing Pipeline¶

graph LR
    A["Raw Data"] --> B["Ingestion"]
    B --> C["Validation"]
    C --> D["Feature Engineering"]
    D --> E["Storage"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8eaf6,stroke:#3f51b5
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#f3e5f5,stroke:#9c27b0
    style E fill:#e8f5e9,stroke:#4caf50

Batch vs Stream processing, data quality checks, feature store integration (Tecton, Feast), data versioning (DVC).

Model Training Pipeline¶

graph LR
    A["Features"] --> B["Algorithm Selection"]
    B --> C["Training"]
    C --> D["Evaluation"]
    D --> E["Registry"]
    E --> F["Deployment"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8eaf6,stroke:#3f51b5
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#fff3e0,stroke:#ef6c00
    style E fill:#f3e5f5,stroke:#9c27b0
    style F fill:#e8f5e9,stroke:#4caf50

Experiment tracking (MLflow, W&B), hyperparameter optimization, model versioning, A/B testing framework.

Serving Infrastructure¶

graph LR
    A["Model"] --> B["Preprocessing"]
    B --> C["Inference"]
    C --> D["Monitoring"]
    D --> E["Feedback Loop"]
    E -.-> A

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#f3e5f5,stroke:#9c27b0
    style D fill:#e8f5e9,stroke:#4caf50
    style E fill:#fce4ec,stroke:#c62828

Batch vs Online inference, caching strategies, model monitoring (data drift, performance), horizontal/vertical scaling.

2. Design ChatGPT-like System¶

Requirements¶

Functional: multi-turn conversations, streaming (token-by-token), various tasks (Q&A, coding, creative)
Non-functional: TTFT < 500ms, TBT < 50ms, 99.9% availability
Scale: 1M DAU, avg 10 turns/conversation, 500 tokens/request, 50K peak QPS

Architecture¶

graph TD
    A["Client"] --> B["API Gateway"]
    B --> C["Rate Limiter<br/>(Redis)"]
    C --> D["Load Balancer"]
    D --> E["Orchestrator<br/>(Prompt Validation, Context Manager, Safety Filter)"]
    E --> F1["LLM Pool<br/>(70B)"]
    E --> F2["LLM Pool<br/>(70B)"]
    E --> F3["LLM Pool<br/>(7B small)"]
    F1 --> G["Response Handler<br/>(Streaming SSE, Output filtering, Logging)"]
    F2 --> G
    F3 --> G
    G --> H1["Redis<br/>(Cache)"]
    G --> H2["PostgreSQL<br/>(History)"]
    G --> H3["S3<br/>(Logs)"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8eaf6,stroke:#3f51b5
    style C fill:#fce4ec,stroke:#c62828
    style D fill:#e8eaf6,stroke:#3f51b5
    style E fill:#fff3e0,stroke:#ef6c00
    style F1 fill:#f3e5f5,stroke:#9c27b0
    style F2 fill:#f3e5f5,stroke:#9c27b0
    style F3 fill:#f3e5f5,stroke:#9c27b0
    style G fill:#e8f5e9,stroke:#4caf50
    style H1 fill:#e8eaf6,stroke:#3f51b5
    style H2 fill:#e8eaf6,stroke:#3f51b5
    style H3 fill:#e8eaf6,stroke:#3f51b5

Key Design Decisions¶

Component	Decision	Trade-off
Model serving	vLLM with PagedAttention	Memory efficiency vs complexity
Context	KV cache + conversation store	Memory vs latency
Scaling	Horizontal + model sharding	Complexity vs throughput
Streaming	Server-Sent Events	Simple vs WebSocket overhead
Safety	Input/output filtering	Latency vs safety

3. Design RAG at Scale¶

Indexing Pipeline (Offline)¶

graph LR
    A["Documents"] --> B["Chunker<br/>(512 tokens, overlap=50)"]
    B --> C["Embedder<br/>(1536-dim)"]
    C --> D["Vector DB"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#f3e5f5,stroke:#9c27b0
    style D fill:#e8f5e9,stroke:#4caf50

Query Pipeline (Online)¶

graph LR
    A["Query"] --> B["Query Rewrite<br/>(HyDE / Multi-query)"]
    B --> C["Embed"]
    C --> D["Retrieval<br/>(Top-100)"]
    D --> E["Rerank<br/>(Cross-Encoder, Top-20)"]
    E --> F["Context Assembly"]
    F --> G["LLM Generation"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#f3e5f5,stroke:#9c27b0
    style D fill:#e8f5e9,stroke:#4caf50
    style E fill:#fff3e0,stroke:#ef6c00
    style F fill:#e8eaf6,stroke:#3f51b5
    style G fill:#f3e5f5,stroke:#9c27b0

Scaling¶

Challenge	Solution	Details
Vector search latency	HNSW + sharding	Sub-100ms at 1B vectors
Index updates	Async reindexing	Zero-downtime updates
High QPS	Query caching	50-80% cache hit rate
Context length	Hierarchical retrieval	Coarse -> fine search

Vector DB Selection¶

System	Scale	Latency	Use Case
Pinecone	1B+	50-100ms	Managed, production
Milvus	10B+	30-80ms	Self-hosted, large scale
Weaviate	1B+	40-100ms	Hybrid search
Qdrant	1B+	20-50ms	Rust-based, fast

4. Design Recommendation with LLM Reranking¶

Stage 1: Candidate Generation (Traditional ML)¶

graph LR
    A["User Features"] --> B["Collaborative Filter<br/>(Matrix Factorization / LightFM)"]
    A --> C["Two-Tower NN<br/>(User Emb . Item Emb)"]
    B --> D["1000 items<br/>Latency: 10-50ms"]
    C --> D

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#f3e5f5,stroke:#9c27b0
    style D fill:#e8f5e9,stroke:#4caf50

Stage 2: LLM Reranking¶

graph LR
    A["Top 100 candidates<br/>+ User context"] --> B["LLM prompt"]
    B --> C["Reranked top 20<br/>Latency: 200-500ms"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#f3e5f5,stroke:#9c27b0
    style C fill:#e8f5e9,stroke:#4caf50

LLM4Rerank (WWW 2025): multi-objective (accuracy, fairness, diversity), batch inference, user preference retrieval (UR4Rec).

Traditional Recommendation Architecture¶

\[\text{Pipeline} = \text{Candidate Generation} \to \text{Scoring} \to \text{Ranking} \to \text{Re-ranking}\]

Candidate Generation: collaborative filtering, content-based, hybrid
Scoring: ML model for relevance
Ranking: learning-to-rank
Re-ranking: heuristic rules or second ML model
Retrieval: ANN (HNSW, Faiss), vector DB

5. Design Content Moderation Pipeline¶

Multi-Stage Classification¶

Stage	Latency	Coverage	Method
Rule-based	1ms	10% blocked	Blocked words, regex, length checks
Fast ML	10ms	80% auto-decided	Small BERT (110M), multi-label, threshold 0.8
LLM judge	500ms	8% escalated	Large LLM (confidence < 0.8)
Human	Hours	2% edge cases	Low confidence, appeals, policy updates

6. Search Ranking System¶

\[\text{Pipeline} = \text{Query} \to \text{Retrieval} \to \text{Ranking} \to \text{Re-ranking}\]

Indexing: inverted index, Elasticsearch
Retrieval: BM25, TF-IDF, neural retrieval
Ranking: Learning to Rank (LambdaMART, RankNet)
Personalization: user embeddings, click history
Re-ranking: diversity, freshness

Scale: index size and update frequency, query latency p50 < 100ms, freshness vs relevance trade-off.

7. Fraud Detection System¶

\[\text{Pipeline} = \text{Transaction} \to \text{Feature Extraction} \to \text{Model} \to \text{Rule Engine} \to \text{Decision}\]

Real-time scoring: online inference (< 50ms)
Rule Engine: known patterns, blacklists
Labeling pipeline: human review for suspicious cases
Feature Store: user behavior history, device fingerprinting
Monitoring: drift detection, performance degradation

Trade-offs: precision vs recall (false positives vs fraud loss), real-time vs batch, explainability vs accuracy.

8. Ad Click Prediction¶

Feature Engineering: user features, ad features, contextual features
Model: logistic regression, gradient boosting (XGBoost, LightGBM)
Serving: low latency prediction (online learning)
Exploration: multi-armed bandit (UCB, Thompson Sampling)

9. Model Serving for 1M QPS¶

Architecture¶

graph TD
    A["DNS"] --> B["Geo-LB"]
    B --> C["Regional LB"]
    C --> D["Service Mesh"]
    D --> E1["Region US<br/>(300K QPS)"]
    D --> E2["Region EU<br/>(400K QPS)"]
    D --> E3["Region AP<br/>(300K QPS)"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8eaf6,stroke:#3f51b5
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#f3e5f5,stroke:#9c27b0
    style E1 fill:#e8f5e9,stroke:#4caf50
    style E2 fill:#e8f5e9,stroke:#4caf50
    style E3 fill:#e8f5e9,stroke:#4caf50

Per-region: Request Router -> Semantic Cache (Redis + Embeddings, 50% hit rate) Cache miss -> Model Inference (GPU pools: A100 x100, H100 x50) Tiered serving: simple queries -> 7B model (10ms), complex -> 70B model (100ms)

Cost Optimization¶

Strategy	Savings	Trade-off
Semantic caching	50-70%	Latency for cache lookup
Model tiering	40-60%	Quality variance
Batch inference	30-50%	Latency increase
Quantization	20-40%	Quality loss
Spot instances	60-80%	Availability risk

10. Capacity Planning¶

QPS Estimation¶

\[\text{Peak QPS} = \frac{\text{Users} \times \text{Requests/User/Day}}{\text{Peak Hours} \times 3600}\]

Example: 1M users, 10 req/user/day, 4-hour peak: 10M / (4 * 3600) = 694 QPS

Storage Estimation¶

\[\text{Storage} = \text{Model Size} \times \text{Replicas} \times \text{Redundancy Factor}\]

Example: 1GB model, 10 replicas, 3x redundancy = 30GB

Interview-Relevant Numbers¶

Latency Targets¶

System	Target
Chatbot TTFT	< 500ms
Chatbot TBT	< 50ms
Search	< 200ms
Recommendation	< 100ms
Fraud detection	< 50ms
Content moderation	< 500ms

Scale Benchmarks¶

Scale	QPS	Infrastructure
Small	1K	10 GPUs
Medium	100K	100 GPUs + cache
Large	1M	1000 GPUs + geo-distribution

Cost Targets¶

Model	Cost/1K Tokens	Infrastructure
7B (quantized)	$0.001	Single A10
70B (quantized)	$0.01	4x A100
175B	$0.03	8x H100

Cache Hit Rates¶

Cache Type	Hit Rate	Latency Reduction
Exact match	20-30%	100%
Semantic cache	50-70%	90%
KV cache (prefix)	40-60%	70%

Для интервью¶

Q: "Design a real-time recommendation system with LLM reranking."¶

Two-stage: (1) Candidate generation -- collaborative filtering / two-tower neural network, 10-50ms, 1000 candidates. (2) LLM reranking -- top 100 candidates + user context через LLM, 200-500ms, reranked top 20. Trade-off: LLM adds 100-500ms latency but improves CTR +10-20%. Scale: ANN retrieval (HNSW, Faiss), batch LLM inference, caching popular recommendations.

Q: "How do you handle model drift in production?"¶

(1) Monitoring: track prediction distribution, feature importance drift. (2) Alerting: thresholds on KL divergence, PSI (Population Stability Index). (3) Automated retraining: trigger when drift exceeds threshold. (4) A/B testing: compare new model vs champion. (5) Fallback: keep previous version for quick rollback.

Q: "Design content moderation for 10M posts/day."¶

4-stage pipeline: (1) Rule-based filter (1ms, 10% blocked). (2) Fast ML classifier (10ms, 80% auto-decided, small BERT, threshold 0.8). (3) LLM judge (500ms, 8% escalated, confidence < 0.8). (4) Human review (hours, 2% edge cases). Active learning from human decisions feeds back to ML classifier.

Common Pitfalls¶

Jumping to solution before clarifying requirements
Ignoring trade-offs (every decision has pros/cons)
Forgetting non-functional requirements (latency, cost > accuracy)
Over-engineering (start simple, iterate)
Not mentioning failure modes
Ignoring monitoring

Green Flags¶

Behavior	Why
Asks clarifying questions	Systematic thinking
Discusses trade-offs	Experience
Starts simple, scales up	Pragmatic
Considers failure modes	Production mindset
Mentions monitoring	Operational awareness

Заблуждение: ML System Design -- это software system design с ML-моделью

Главная ошибка кандидатов -- рисовать boxes-and-arrows без ML-специфики. Интервьюеры ожидают: выбор метрик (offline vs online, proxy vs true), data pipeline (feature engineering, label quality), trade-off accuracy vs latency, failure modes (data drift, model degradation). В FAANG 60%+ rejection'ов на этом раунде -- из-за отсутствия ML reasoning при наличии правильной архитектуры.

Заблуждение: всегда нужна самая точная модель

В production accuracy -- не главная метрика. Content moderation при 10M posts/day: 4-stage pipeline (rules -> small BERT -> LLM -> human) обрабатывает 90% за 1-10ms, а accuracy определяется трейдофом precision/recall. Fraud detection: false positive rate > 1% блокирует легитимных пользователей и стоит бизнесу больше, чем пропущенный фрод. Latency target <50ms исключает использование больших моделей.

Заблуждение: semantic cache решает проблему масштабирования LLM serving

Semantic cache с hit rate 50-70% сокращает нагрузку, но cache miss всё равно идёт на GPU inference. Для 1M QPS: даже при 70% cache hit остаётся 300K QPS на GPU -- это 1000+ A100 в geo-distributed setup. Реальная оптимизация -- model tiering (7B для 80% простых запросов, 70B для 20% сложных) + quantization + batch inference, что суммарно даёт 60-80% экономии.

Interview Questions¶

Q: Как вы начнёте проектирование ChatGPT-like системы на интервью?

Red flag: "Берём GPT-4 API, добавляем Redis для кэша и PostgreSQL для истории"

Strong answer: "Начинаю с Requirements (5 мин): функциональные (multi-turn, streaming, разные задачи), нефункциональные (TTFT <500ms, TBT <50ms, 99.9% availability), масштаб (1M DAU, 50K peak QPS). Затем Evaluation Metrics: online -- TTFT p95, user satisfaction rate; offline -- LLM-Judge quality. System Components: API Gateway -> Rate Limiter -> Orchestrator (prompt validation, context manager, safety filter) -> LLM Pool (tiered: 70B + 7B) -> Response Handler (SSE streaming). Ключевые trade-offs: vLLM с PagedAttention для memory efficiency, KV cache для conversation history, model tiering для cost optimization."

Q: Как масштабировать model serving до 1M QPS?

Red flag: "Добавляем больше GPU"

Strong answer: "Geo-distributed architecture: DNS -> Geo-LB -> 3 региона (US 300K, EU 400K, AP 300K). Per-region: (1) Semantic cache (Redis + embeddings, 50-70% hit rate) -- сразу убирает половину нагрузки. (2) Model tiering -- 7B quantized (10ms) для 80% простых запросов, 70B (100ms) для сложных. (3) GPU pools: A100 x100 + H100 x50 per region. (4) Batch inference для non-streaming. Cost optimization: caching 50-70%, tiering 40-60%, quantization 20-40%, spot instances 60-80%. Total infra cost: $2-5M/month при 1M QPS."

Q: Design content moderation для 10M posts/day -- как выбрать threshold?

Red flag: "Ставим threshold 0.5 и смотрим accuracy"

Strong answer: "4-stage cascade: (1) Rule-based (1ms, 10% blocked -- known bad words, regex). (2) Fast ML classifier (10ms, small BERT 110M params, multi-label) -- threshold 0.8 для auto-accept/reject, 80% traffic. (3) LLM judge (500ms) -- для 8% с confidence 0.5-0.8. (4) Human review -- 2% edge cases + appeals. Threshold выбирается по precision-recall trade-off: high precision (0.8+) для auto-block (минимум false positives), high recall (0.95+) для safety-critical content (не пропустить). Active learning: human decisions фидятся обратно в ML classifier, threshold пересматривается weekly."

Источники¶

I Got An Offer -- "Generative AI System Design Interview" (Jan 2026)
Exponent -- "ML System Design Interview Guide" (2026)
Medium -- "ML System Design Interview Guide: Complete Framework" (Oct 2025)
GeeksforGeeks -- "Top 25 ML System Design Interview Questions"
Towards Data Science -- "Cracking ML System Design Interviews" (Nov 2025)
InterviewNode -- "Generative AI System Design Interview Patterns" (Nov 2025)
arXiv -- "LLM4Rerank: Auto-Reranking Framework" (2406.12433)
Eugene Yan -- "Improving Recommendation Systems in the Age of LLMs" (Mar 2025)

Паттерны ML System Design интервью¶

Ключевые концепции¶

RESHADED Interview Framework¶

2026 Top Questions¶

1. Core ML Pipelines¶

Data Processing Pipeline¶

Model Training Pipeline¶

Serving Infrastructure¶

2. Design ChatGPT-like System¶

Requirements¶

Architecture¶

Key Design Decisions¶

3. Design RAG at Scale¶

Indexing Pipeline (Offline)¶

Query Pipeline (Online)¶

Scaling¶

Vector DB Selection¶

4. Design Recommendation with LLM Reranking¶

Stage 1: Candidate Generation (Traditional ML)¶

Stage 2: LLM Reranking¶

Traditional Recommendation Architecture¶

5. Design Content Moderation Pipeline¶

Multi-Stage Classification¶

6. Search Ranking System¶

7. Fraud Detection System¶

8. Ad Click Prediction¶

9. Model Serving for 1M QPS¶

Architecture¶

Cost Optimization¶

10. Capacity Planning¶

QPS Estimation¶

Storage Estimation¶

Interview-Relevant Numbers¶

Latency Targets¶

Scale Benchmarks¶

Cost Targets¶

Cache Hit Rates¶

Для интервью¶

Q: "Design a real-time recommendation system with LLM reranking."¶

Q: "How do you handle model drift in production?"¶

Q: "Design content moderation for 10M posts/day."¶

Common Pitfalls¶

Green Flags¶

Interview Questions¶

Источники¶

See Also¶