Шпаргалка: ML System Design¶
~9 минут чтения
Предварительно: Мастер-гайд для подготовки | Шпаргалка инференс LLM
Тип: synthesis / interview cheat sheet Дата: Февраль 2026 Synthesis of: Meta/DeepMind interviews, RAG, Observability, LLMOps, Edge ML, Security
ML System Design -- самый частый тип интервью в Meta, Google, DeepMind и стартапах (60-70% technical rounds). Ожидается не только знание ML-моделей, но умение проектировать production системы: capacity planning (QPS = DAU x Requests / 86400), latency budgets (200ms total = 50ms model + 30ms features + 20ms network...), database selection, caching (semantic cache = 50-80% hit rate, 70-90% cost savings), и multi-tier architecture (candidate generation -> scoring -> ranking). Эта шпаргалка покрывает RESHADED framework, 8 production patterns и формулы для capacity estimation.
Quick Reference: Key Numbers¶
| Metric | Value | Context |
|---|---|---|
| Model serving latency | 50-500ms | Typical LLM generation |
| Cache hit rate (semantic) | 50-80% | Production systems |
| Cost reduction (caching) | 70-90% | Semantic caching |
| QPS formula | DAU × Requests / 86400 | Capacity planning |
| Availability target | 99.9% | 8.76 hours downtime/year |
| Inference GPU memory | 2-4× model size | KV cache + activations |
1. System Design Interview Framework¶
RESHADED Framework¶
| Letter | Aspect | Questions |
|---|---|---|
| R | Requirements | Functional, non-functional, scale |
| E | Estimation | QPS, storage, latency budget |
| S | Storage | Database choice, schema, caching |
| H | High-level | Architecture diagram, components |
| A | APIs | Endpoints, contracts |
| D | Deep dive | 2-3 critical components |
| E | Edge cases | Failures, scaling, security |
| D | Discussion | Trade-offs, alternatives |
Time Allocation (45 min)¶
Requirements & Clarification: 5 min
High-Level Design: 10 min
Component Deep Dive: 20 min
Scalability & Trade-offs: 10 min
2. Capacity Planning Formulas¶
QPS Calculation¶
Example: - 10M DAU - 10 requests/user/day - QPS = 10M × 10 / 86400 = 1,157 QPS - Peak QPS = 1,157 × 3 = ~3,500 QPS
Storage Estimation¶
Memory for Caching¶
Typical overhead: 20-30% for Redis
Latency Budget¶
Total Latency Budget: 200ms
Breakdown:
- Network: 20-50ms
- Authentication: 5-10ms
- Feature fetch: 10-30ms
- Model inference: 50-100ms
- Post-processing: 5-10ms
3. ML System Design Patterns¶
Pattern 1: Recommendation System¶
graph LR
A["Candidate Generation<br/>FAISS/ANN<br/>millions -> 1000"] --> B["Scoring Model<br/>Light NN<br/>1000 -> 100"]
B --> C["Ranking Model<br/>Heavy Transformer<br/>100 -> 10"]
C --> D["Re-ranking<br/>Business rules<br/>10 -> Final"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#fff3e0,stroke:#ef6c00
style C fill:#fce4ec,stroke:#c62828
style D fill:#e8f5e9,stroke:#4caf50
Stages: 1. Candidate Generation: FAISS, ANN, collaborative filtering → 1M → 1000 2. Scoring: Light model (logistic regression, small NN) → 1000 → 100 3. Ranking: Heavy model (transformer, deep NN) → 100 → 10 4. Re-ranking: Business rules, diversity → 10 → Final
Latency Budget: - Candidate Gen: 10-20ms - Scoring: 20-30ms - Ranking: 30-50ms
Pattern 2: Search Ranking¶
Query → Query Understanding → Retrieval → Ranking → Re-ranking → Results
│ │ │ │
▼ ▼ ▼ ▼
Spell check BM25/ Learning- Diversity,
Synonyms Vector DB to-Rank freshness
Retrieval Methods: | Method | Recall | Latency | Use Case | |--------|--------|---------|----------| | BM25 | Medium | <5ms | Keyword matching | | Vector Search | High | 10-30ms | Semantic similarity | | Hybrid | Highest | 30-50ms | Best of both |
Pattern 3: Fraud Detection¶
graph TD
A["Fraud Detection System"] --> B["Real-time Layer<br/>Online Scoring"]
A --> C["Batch Layer<br/>Pattern Detection"]
B --> D["Rule Engine<br/>instant"]
B --> E["ML Model<br/>less than 50ms"]
B --> F["Anomaly Detection"]
C --> G["Feature Engineering"]
C --> H["Model Retraining"]
C --> I["Graph Analysis"]
D --> J["Human Review<br/>edge cases"]
E --> J
F --> J
style A fill:#f3e5f5,stroke:#9c27b0
style B fill:#fce4ec,stroke:#c62828
style C fill:#e8eaf6,stroke:#3f51b5
style D fill:#e8f5e9,stroke:#4caf50
style E fill:#e8f5e9,stroke:#4caf50
style F fill:#e8f5e9,stroke:#4caf50
style G fill:#fff3e0,stroke:#ef6c00
style H fill:#fff3e0,stroke:#ef6c00
style I fill:#fff3e0,stroke:#ef6c00
style J fill:#f3e5f5,stroke:#9c27b0
Tier-based Approach: 1. Tier 1: Rule-based (instant, high precision) 2. Tier 2: ML model (50-100ms, balanced) 3. Tier 3: Human review (slow, high value)
Pattern 4: Ad Click Prediction¶
graph LR
A["User + Context"] --> B["Feature Engineering"]
B --> C["Model"]
C --> D["Bid Decision"]
B --> E["User features"]
B --> F["Ad features"]
B --> G["Context"]
B --> H["Cross features"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#fff3e0,stroke:#ef6c00
style C fill:#e8f5e9,stroke:#4caf50
style D fill:#f3e5f5,stroke:#9c27b0
style E fill:#e8eaf6,stroke:#3f51b5
style F fill:#e8eaf6,stroke:#3f51b5
style G fill:#e8eaf6,stroke:#3f51b5
style H fill:#e8eaf6,stroke:#3f51b5
Exploration: Multi-armed bandit (ε-greedy, UCB)
Pattern 5: Content Moderation¶
Content → Tier 1 (Keywords) → Tier 2 (ML) → Tier 3 (Human)
│ │ │
▼ ▼ ▼
Regex CNN/Transformer Review queue
Hash matching Multimodal Escalation
Latency Budget: - Tier 1: <10ms (block obvious violations) - Tier 2: 100-500ms (nuanced classification) - Tier 3: Hours-days (edge cases)
4. LLM-Specific Patterns¶
Pattern 6: LLM Serving at Scale¶
graph TD
LB["Load Balancer"] --> V1["vLLM/SGLang<br/>+ EAGLE-3"]
LB --> V2["vLLM/SGLang<br/>+ EAGLE-3"]
LB --> V3["vLLM/SGLang<br/>+ EAGLE-3"]
V1 --> RC["Redis Cache<br/>Semantic"]
V2 --> RC
V3 --> RC
style LB fill:#f3e5f5,stroke:#9c27b0
style V1 fill:#e8eaf6,stroke:#3f51b5
style V2 fill:#e8eaf6,stroke:#3f51b5
style V3 fill:#e8eaf6,stroke:#3f51b5
style RC fill:#e8f5e9,stroke:#4caf50
Cost Optimization Stack: | Layer | Technique | Savings | |-------|-----------|---------| | Request | Semantic cache | 50-80% | | Model | Model routing | 30-50% | | Inference | Speculative decoding | 2-3× faster | | Memory | Quantization (4-bit) | 75% memory |
Pattern 7: RAG at Scale¶
graph LR
A["Query"] --> B["Embedding"]
B --> C["Vector DB<br/>HNSW Index<br/>10M+ docs"]
C --> D["Top-K"]
D --> E["Cross-Encoder<br/>Reranker"]
E --> F["LLM"]
F --> G["Response"]
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#e8f5e9,stroke:#4caf50
style C fill:#fff3e0,stroke:#ef6c00
style D fill:#fff3e0,stroke:#ef6c00
style E fill:#f3e5f5,stroke:#9c27b0
style F fill:#fce4ec,stroke:#c62828
style G fill:#e8f5e9,stroke:#4caf50
Latency Breakdown: | Stage | Time | Optimization | |-------|------|--------------| | Embedding | 5-10ms | Batch, smaller model | | Vector search | 5-20ms | HNSW, smaller dimension | | Reranking | 20-50ms | Skip for simple queries | | LLM generation | 50-500ms | Speculative, quantization |
Pattern 8: Multi-Model Routing¶
graph LR
R["Router Model"] --> A["Complexity<br/>GPT-4o-mini / 8B"]
R --> B["Reasoning<br/>o1 / Claude Sonnet"]
R --> C["Code<br/>GPT-4 / Claude"]
R --> D["Simple<br/>Local 7B quantized"]
style R fill:#f3e5f5,stroke:#9c27b0
style A fill:#e8f5e9,stroke:#4caf50
style B fill:#fce4ec,stroke:#c62828
style C fill:#fff3e0,stroke:#ef6c00
style D fill:#e8eaf6,stroke:#3f51b5
Router Decision: $\(\text{Model} = \arg\min_m \{ \text{Cost}(m) : \text{Quality}(m, q) > \theta \}\)$
5. Database Selection¶
Decision Matrix¶
| Data Type | Read-heavy | Write-heavy | Real-time | Choice |
|---|---|---|---|---|
| Structured | ✓ | - | - | PostgreSQL |
| Time-series | - | ✓ | ✓ | TimescaleDB |
| Key-value | ✓ | ✓ | ✓ | Redis |
| Vector | ✓ | - | - | Qdrant/Milvus |
| Graph | ✓ | - | - | Neo4j |
| Document | ✓ | ✓ | - | MongoDB |
Vector Database Comparison¶
| DB | Scale | Latency | Best For |
|---|---|---|---|
| Milvus | 1B+ | <5ms | Enterprise |
| Qdrant | 100M+ | <10ms | Self-hosted |
| Pinecone | 10M+ | 5-20ms | Managed |
| pgvector | 10M | 20-100ms | Existing Postgres |
6. Observability & Monitoring¶
Metrics to Track¶
| Category | Metrics |
|---|---|
| Performance | Latency (P50, P95, P99), throughput |
| Quality | Accuracy, hallucination rate, relevance |
| Cost | Tokens, GPU hours, API calls |
| System | CPU, memory, GPU utilization |
Alert Thresholds¶
| Metric | Warning | Critical |
|---|---|---|
| Latency P95 | >1s | >3s |
| Error rate | >1% | >5% |
| Cost spike | >2x daily | >5x daily |
| Hallucination | >5% | >10% |
Tools Comparison¶
| Tool | Type | Focus |
|---|---|---|
| Langfuse | Open source | Tracing + cost |
| Arize Phoenix | Open source | RAG evaluation |
| Helicone | Proxy-based | API monitoring |
| DeepEval | Framework | Metrics |
7. Security Patterns¶
OWASP LLM Top 10 (2025)¶
| Rank | Risk | Mitigation |
|---|---|---|
| LLM01 | Prompt Injection | Input filtering, intent classification |
| LLM02 | Sensitive Disclosure | PII detection, output filtering |
| LLM03 | Supply Chain | Verify model provenance |
| LLM04 | Model Poisoning | Data validation, monitoring |
| LLM05 | Output Handling | Sanitize, validate outputs |
Multi-Layer Defense¶
graph TD
A["Layer 1: Input Filtering<br/>regex + ML"] --> B["Layer 2: Intent Classification<br/>safe / unsafe / ambiguous"]
B --> C["Layer 3: Model Inference<br/>with guardrails"]
C --> D["Layer 4: Output Filtering<br/>PII, harmful content"]
D --> E["Layer 5: Anomaly Detection<br/>behavior monitoring"]
style A fill:#e8f5e9,stroke:#4caf50
style B fill:#e8eaf6,stroke:#3f51b5
style C fill:#fff3e0,stroke:#ef6c00
style D fill:#fce4ec,stroke:#c62828
style E fill:#f3e5f5,stroke:#9c27b0
Defense Effectiveness¶
Good target: >90% effectiveness
8. LLMOps vs MLOps¶
Key Differences¶
| Aspect | MLOps | LLMOps |
|---|---|---|
| Primary metric | Accuracy | Quality + latency + cost |
| Evaluation | Test set | LLM-as-judge + human |
| Deployment | Model versioning | Model + prompt + adapter |
| Monitoring | Data drift | Hallucination, relevance |
| Iteration | Weeks | Hours-days |
Prompt Management¶
Prompt Registry
├── prompt_versions/
│ ├── v1.0.0.yaml
│ ├── v1.1.0.yaml
│ └── v2.0.0.yaml
├── experiments/
│ ├── A_test_a/
│ └── A_test_b/
└── production/
└── current -> v1.1.0.yaml
Evaluation Pipeline¶
def evaluate_prompt(prompt_version):
results = []
for test_case in test_set:
response = llm.generate(prompt_version, test_case.input)
# LLM-as-judge
quality = judge.evaluate(response, test_case.expected)
# Metrics
results.append({
'input': test_case.input,
'response': response,
'quality': quality,
'latency': response.latency,
'tokens': response.tokens
})
return aggregate(results)
9. Edge / On-Device ML¶
Memory Constraints¶
| Device | RAM | Max Model (4-bit) |
|---|---|---|
| iPhone 15 Pro | 8GB | ~3B params |
| Pixel 9 Pro | 16GB | ~7B params |
| Jetson Orin AGX | 64GB | ~32B params |
Quantization for Edge¶
| Method | Quality (INT4) | Best For |
|---|---|---|
| GGUF Q4_K_M | ~92% | Mobile CPU |
| AWQ | ~98% | Edge GPU |
| GPTQ | ~97% | NVIDIA |
Model Size Formula¶
Example: 7B @ 4-bit = 7B × 4 / 8 = 3.5 GB
Типичные заблуждения¶
Заблуждение: 'QPS = DAU x Requests / 86400 -- этого достаточно для capacity planning'
Это average QPS. Peak QPS обычно в 3-5x больше (lunch hour, evening). Для LLM serving нужно учитывать: (1) generation latency 50-500ms per request, (2) concurrent requests = QPS x avg_latency, (3) GPU memory per request (KV cache). Реальная формула: GPU_count = peak_QPS x avg_latency / batch_size. Для 3500 peak QPS при 200ms latency и batch=32 нужно ~22 GPU instances.
Заблуждение: 'Recommendation system -- это candidate generation + ranking'
Пропущены критические компоненты: (1) feature store для consistency между training и serving, (2) cold start handling для новых users/items (content-based fallback), (3) diversity/exploration (epsilon-greedy или Thompson Sampling), (4) feedback loop для retraining. Без них система быстро деградирует из-за popularity bias.
Заблуждение: 'Tier-1 rules в fraud detection -- это примитивно'
Rules обрабатывают 60-70% очевидных случаев за <10ms и стоят ~0 в compute. ML-модель нужна только для оставшихся 30-40%. Без tier-1 все запросы идут через ML -- это 10-50x дороже и медленнее. В production 3-tier architecture (rules -> ML -> human) -- стандарт, не компромисс.
10. Интервью¶
Вопрос: Спроектируйте recommendation system¶
"Нужна collaborative filtering + content-based модель"
"Multi-stage pipeline: (1) Candidate generation через FAISS/ANN -- 1M items -> 1000 за 10-20ms, (2) Scoring с light model (logistic regression) -- 1000 -> 100 за 20-30ms, (3) Ranking с heavy model (DCN/DIN) -- 100 -> 10 за 30-50ms, (4) Re-ranking с business rules (diversity, freshness). Total latency budget 100ms. Feature store для consistency train/serve. Cold start: content-based fallback + exploration (epsilon-greedy)."
Вопрос: Как обслуживать LLM на масштабе?¶
"Поставить vLLM за load balancer"
"5-layer stack: (1) Semantic cache (Redis + vector similarity, 50-80% hit rate, 70-90% cost savings), (2) Model router (simple queries -> 7B quantized, complex -> GPT-4), (3) SGLang/vLLM instances + EAGLE-3 speculative decoding (2x speedup), (4) AWQ 4-bit quantization (75% memory savings), (5) monitoring: latency P95, cost per query, hallucination rate. Formula: GPU_count = peak_QPS / (batch_size / avg_latency)."
Вопрос: Design ChatGPT-like system¶
"Frontend, backend, LLM API -- три компонента"
"Frontend -> API Gateway (auth + rate limiter: 10 req/min free, 60 req/min paid) -> Model Router (complexity classifier: simple->7B, reasoning->o1, code->Claude) -> SGLang cluster (EAGLE-3, RadixAttention for multi-turn) -> Streaming response (SSE). Caching: semantic cache before router + prefix cache в SGLang. Storage: conversation history в PostgreSQL, embeddings в Qdrant. Monitoring: Langfuse for tracing, Arize для RAG eval."
Вопрос: Спроектируйте fraud detection¶
"ML-модель которая классифицирует транзакции"
"3-tier architecture: Tier-1 rules (<10ms, 60-70% cases: amount >$10K, velocity checks, blocklist), Tier-2 ML model (50-100ms, gradient boosted trees on 100+ features: user history, device fingerprint, graph features), Tier-3 human review (async, edge cases with uncertainty 0.3-0.7). Lambda architecture: real-time scoring + batch feature engineering. Feature store для consistency. Monitoring: concept drift detection, false positive rate tracking."
Вопрос: RAG vs Long Context -- когда что?¶
"Long context лучше потому что проще"
"Cost ratio 100:1 в пользу RAG. RAG: дешевле, свежие данные (instant update), но зависит от качества retrieval. Long Context: проще, но 100x дороже, requires retraining для новых данных, страдает от 'lost in the middle'. Production best practice: RAG для retrieval (embed -> HNSW -> top-K -> rerank) + long context для reasoning (process retrieved chunks). Hybrid снижает cost на 90% при сохранении качества."
11. Formulas Quick Reference¶
QPS¶
Storage¶
Availability¶
99.9% = 8.76 hours downtime/year
Cache Hit Rate¶
Cost per Query¶
UCB (Exploration)¶
12. Sources Synthesized¶
meta-ml-interview-2025.md— System design patterns, capacity planningdeepmind-interview-2025-2026.md— Research-focused system designopenai-anthropic-interviews-2025-2026.md— Company-specific patternsrag-system-design-2025-2026.md— RAG architecturellm-observability-2025.md— Monitoring, metricsllmops-vs-mlops-2025-2026.md— Operational patternsedge-ml-on-device-2025.md— Edge deploymentllm-security-2025.md— OWASP, defense patternsadvanced-rag-patterns-2025.md— RAG evolutioninference-engines-comparison-2025-2026.md— Serving patterns
Extended coverage in ml-practice-prep/ml-system-design/materials.md (sections 11-16): - Section 11: Multi-Armed Bandits (UCB, Thompson Sampling, LinUCB) - Section 12: Online Learning / Streaming ML (FTRL-Proximal, Concept Drift, Hoeffding Trees) - Section 13: Multi-Stage Recommender Systems (Two-Tower, ANN, MMR, DIN/DCN) - Section 14: Causal Inference (ATE/ATT/CATE, Propensity Matching, DiD, IV, Uplift) - Section 15: Vector Databases (HNSW, IVF, Pinecone/Milvus/Qdrant/pgvector, Hybrid Search, RRF) - Section 16: Cost Optimization (GPU utilization, spot instances, model right-sizing, semantic caching, auto-scaling) - Section 17: Multi-Model Serving (weighted round-robin, cascade, confidence-based, latency-based, circuit breaker, A/B testing) - Section 18: Data Quality for ML (Great Expectations, TFDV, schema evolution, data lineage tracking, 7 quality dimensions) - Section 19: Foundation Models in Production (multi-tenant K8s, prompt caching economics, multi-layer caching, fallback chains) - Section 20: AI Agents in Production (OWASP LLM Top 10, defence-in-depth 6 layers, HITL patterns, LLM-as-Judge, OTel GenAI conventions)