Шпаргалка: ML System Design¶

~9 минут чтения

Предварительно: Мастер-гайд для подготовки | Шпаргалка инференс LLM

Тип: synthesis / interview cheat sheet Дата: Февраль 2026 Synthesis of: Meta/DeepMind interviews, RAG, Observability, LLMOps, Edge ML, Security

ML System Design -- самый частый тип интервью в Meta, Google, DeepMind и стартапах (60-70% technical rounds). Ожидается не только знание ML-моделей, но умение проектировать production системы: capacity planning (QPS = DAU x Requests / 86400), latency budgets (200ms total = 50ms model + 30ms features + 20ms network...), database selection, caching (semantic cache = 50-80% hit rate, 70-90% cost savings), и multi-tier architecture (candidate generation -> scoring -> ranking). Эта шпаргалка покрывает RESHADED framework, 8 production patterns и формулы для capacity estimation.

Quick Reference: Key Numbers¶

Metric	Value	Context
Model serving latency	50-500ms	Typical LLM generation
Cache hit rate (semantic)	50-80%	Production systems
Cost reduction (caching)	70-90%	Semantic caching
QPS formula	DAU × Requests / 86400	Capacity planning
Availability target	99.9%	8.76 hours downtime/year
Inference GPU memory	2-4× model size	KV cache + activations

1. System Design Interview Framework¶

RESHADED Framework¶

Letter	Aspect	Questions
R	Requirements	Functional, non-functional, scale
E	Estimation	QPS, storage, latency budget
S	Storage	Database choice, schema, caching
H	High-level	Architecture diagram, components
A	APIs	Endpoints, contracts
D	Deep dive	2-3 critical components
E	Edge cases	Failures, scaling, security
D	Discussion	Trade-offs, alternatives

Time Allocation (45 min)¶

Requirements & Clarification:    5 min
High-Level Design:              10 min
Component Deep Dive:            20 min
Scalability & Trade-offs:       10 min

2. Capacity Planning Formulas¶

QPS Calculation¶

\[\text{QPS} = \frac{\text{DAU} \times \text{Requests per User}}{86400}\]

Example: - 10M DAU - 10 requests/user/day - QPS = 10M × 10 / 86400 = 1,157 QPS - Peak QPS = 1,157 × 3 = ~3,500 QPS

Storage Estimation¶

\[\text{Storage} = \text{Data per User} \times \text{Users} \times \text{Growth Factor}\]

\[\text{Storage}_{\text{total}} = \text{Active Data} + \text{Archive} \times \text{Replication}\]

Memory for Caching¶

\[\text{Cache Size} = \text{Working Set} \times (1 + \text{Overhead})\]

Typical overhead: 20-30% for Redis

Latency Budget¶

Total Latency Budget: 200ms

Breakdown:
- Network:          20-50ms
- Authentication:    5-10ms
- Feature fetch:    10-30ms
- Model inference:  50-100ms
- Post-processing:   5-10ms

3. ML System Design Patterns¶

Pattern 1: Recommendation System¶

graph LR
    A["Candidate Generation<br/>FAISS/ANN<br/>millions -> 1000"] --> B["Scoring Model<br/>Light NN<br/>1000 -> 100"]
    B --> C["Ranking Model<br/>Heavy Transformer<br/>100 -> 10"]
    C --> D["Re-ranking<br/>Business rules<br/>10 -> Final"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#fce4ec,stroke:#c62828
    style D fill:#e8f5e9,stroke:#4caf50

Stages: 1. Candidate Generation: FAISS, ANN, collaborative filtering → 1M → 1000 2. Scoring: Light model (logistic regression, small NN) → 1000 → 100 3. Ranking: Heavy model (transformer, deep NN) → 100 → 10 4. Re-ranking: Business rules, diversity → 10 → Final

Latency Budget: - Candidate Gen: 10-20ms - Scoring: 20-30ms - Ranking: 30-50ms

Pattern 2: Search Ranking¶

Query → Query Understanding → Retrieval → Ranking → Re-ranking → Results
              │                   │           │           │
              ▼                   ▼           ▼           ▼
         Spell check          BM25/       Learning-   Diversity,
         Synonyms            Vector DB    to-Rank     freshness

Retrieval Methods: | Method | Recall | Latency | Use Case | |--------|--------|---------|----------| | BM25 | Medium | <5ms | Keyword matching | | Vector Search | High | 10-30ms | Semantic similarity | | Hybrid | Highest | 30-50ms | Best of both |

Pattern 3: Fraud Detection¶

graph TD
    A["Fraud Detection System"] --> B["Real-time Layer<br/>Online Scoring"]
    A --> C["Batch Layer<br/>Pattern Detection"]
    B --> D["Rule Engine<br/>instant"]
    B --> E["ML Model<br/>less than 50ms"]
    B --> F["Anomaly Detection"]
    C --> G["Feature Engineering"]
    C --> H["Model Retraining"]
    C --> I["Graph Analysis"]
    D --> J["Human Review<br/>edge cases"]
    E --> J
    F --> J

    style A fill:#f3e5f5,stroke:#9c27b0
    style B fill:#fce4ec,stroke:#c62828
    style C fill:#e8eaf6,stroke:#3f51b5
    style D fill:#e8f5e9,stroke:#4caf50
    style E fill:#e8f5e9,stroke:#4caf50
    style F fill:#e8f5e9,stroke:#4caf50
    style G fill:#fff3e0,stroke:#ef6c00
    style H fill:#fff3e0,stroke:#ef6c00
    style I fill:#fff3e0,stroke:#ef6c00
    style J fill:#f3e5f5,stroke:#9c27b0

Tier-based Approach: 1. Tier 1: Rule-based (instant, high precision) 2. Tier 2: ML model (50-100ms, balanced) 3. Tier 3: Human review (slow, high value)

Pattern 4: Ad Click Prediction¶

graph LR
    A["User + Context"] --> B["Feature Engineering"]
    B --> C["Model"]
    C --> D["Bid Decision"]
    B --> E["User features"]
    B --> F["Ad features"]
    B --> G["Context"]
    B --> H["Cross features"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#e8f5e9,stroke:#4caf50
    style D fill:#f3e5f5,stroke:#9c27b0
    style E fill:#e8eaf6,stroke:#3f51b5
    style F fill:#e8eaf6,stroke:#3f51b5
    style G fill:#e8eaf6,stroke:#3f51b5
    style H fill:#e8eaf6,stroke:#3f51b5

Exploration: Multi-armed bandit (ε-greedy, UCB)

\[\text{UCB}(a) = \bar{X}_a + \sqrt{\frac{2 \ln N}{n_a}}\]

Pattern 5: Content Moderation¶

Content → Tier 1 (Keywords) → Tier 2 (ML) → Tier 3 (Human)
              │                    │              │
              ▼                    ▼              ▼
          Regex              CNN/Transformer   Review queue
          Hash matching      Multimodal       Escalation

Latency Budget: - Tier 1: <10ms (block obvious violations) - Tier 2: 100-500ms (nuanced classification) - Tier 3: Hours-days (edge cases)

4. LLM-Specific Patterns¶

Pattern 6: LLM Serving at Scale¶

graph TD
    LB["Load Balancer"] --> V1["vLLM/SGLang<br/>+ EAGLE-3"]
    LB --> V2["vLLM/SGLang<br/>+ EAGLE-3"]
    LB --> V3["vLLM/SGLang<br/>+ EAGLE-3"]
    V1 --> RC["Redis Cache<br/>Semantic"]
    V2 --> RC
    V3 --> RC

    style LB fill:#f3e5f5,stroke:#9c27b0
    style V1 fill:#e8eaf6,stroke:#3f51b5
    style V2 fill:#e8eaf6,stroke:#3f51b5
    style V3 fill:#e8eaf6,stroke:#3f51b5
    style RC fill:#e8f5e9,stroke:#4caf50

Cost Optimization Stack: | Layer | Technique | Savings | |-------|-----------|---------| | Request | Semantic cache | 50-80% | | Model | Model routing | 30-50% | | Inference | Speculative decoding | 2-3× faster | | Memory | Quantization (4-bit) | 75% memory |

Pattern 7: RAG at Scale¶

graph LR
    A["Query"] --> B["Embedding"]
    B --> C["Vector DB<br/>HNSW Index<br/>10M+ docs"]
    C --> D["Top-K"]
    D --> E["Cross-Encoder<br/>Reranker"]
    E --> F["LLM"]
    F --> G["Response"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8f5e9,stroke:#4caf50
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#fff3e0,stroke:#ef6c00
    style E fill:#f3e5f5,stroke:#9c27b0
    style F fill:#fce4ec,stroke:#c62828
    style G fill:#e8f5e9,stroke:#4caf50

Latency Breakdown: | Stage | Time | Optimization | |-------|------|--------------| | Embedding | 5-10ms | Batch, smaller model | | Vector search | 5-20ms | HNSW, smaller dimension | | Reranking | 20-50ms | Skip for simple queries | | LLM generation | 50-500ms | Speculative, quantization |

Pattern 8: Multi-Model Routing¶

graph LR
    R["Router Model"] --> A["Complexity<br/>GPT-4o-mini / 8B"]
    R --> B["Reasoning<br/>o1 / Claude Sonnet"]
    R --> C["Code<br/>GPT-4 / Claude"]
    R --> D["Simple<br/>Local 7B quantized"]

    style R fill:#f3e5f5,stroke:#9c27b0
    style A fill:#e8f5e9,stroke:#4caf50
    style B fill:#fce4ec,stroke:#c62828
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#e8eaf6,stroke:#3f51b5

Router Decision: $$\text{Model} = \arg\min_m \{ \text{Cost}(m) : \text{Quality}(m, q) > \theta \}$$

5. Database Selection¶

Decision Matrix¶

Data Type	Read-heavy	Write-heavy	Real-time	Choice
Structured	✓	-	-	PostgreSQL
Time-series	-	✓	✓	TimescaleDB
Key-value	✓	✓	✓	Redis
Vector	✓	-	-	Qdrant/Milvus
Graph	✓	-	-	Neo4j
Document	✓	✓	-	MongoDB

Vector Database Comparison¶

DB	Scale	Latency	Best For
Milvus	1B+	<5ms	Enterprise
Qdrant	100M+	<10ms	Self-hosted
Pinecone	10M+	5-20ms	Managed
pgvector	10M	20-100ms	Existing Postgres

6. Observability & Monitoring¶

Metrics to Track¶

Category	Metrics
Performance	Latency (P50, P95, P99), throughput
Quality	Accuracy, hallucination rate, relevance
Cost	Tokens, GPU hours, API calls
System	CPU, memory, GPU utilization

Alert Thresholds¶

Metric	Warning	Critical
Latency P95	>1s	>3s
Error rate	>1%	>5%
Cost spike	>2x daily	>5x daily
Hallucination	>5%	>10%

Tools Comparison¶

Tool	Type	Focus
Langfuse	Open source	Tracing + cost
Arize Phoenix	Open source	RAG evaluation
Helicone	Proxy-based	API monitoring
DeepEval	Framework	Metrics

7. Security Patterns¶

OWASP LLM Top 10 (2025)¶

Rank	Risk	Mitigation
LLM01	Prompt Injection	Input filtering, intent classification
LLM02	Sensitive Disclosure	PII detection, output filtering
LLM03	Supply Chain	Verify model provenance
LLM04	Model Poisoning	Data validation, monitoring
LLM05	Output Handling	Sanitize, validate outputs

Multi-Layer Defense¶

graph TD
    A["Layer 1: Input Filtering<br/>regex + ML"] --> B["Layer 2: Intent Classification<br/>safe / unsafe / ambiguous"]
    B --> C["Layer 3: Model Inference<br/>with guardrails"]
    C --> D["Layer 4: Output Filtering<br/>PII, harmful content"]
    D --> E["Layer 5: Anomaly Detection<br/>behavior monitoring"]

    style A fill:#e8f5e9,stroke:#4caf50
    style B fill:#e8eaf6,stroke:#3f51b5
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#fce4ec,stroke:#c62828
    style E fill:#f3e5f5,stroke:#9c27b0

Defense Effectiveness¶

\[\text{Effectiveness} = 1 - \frac{\text{ASR}_{\text{defended}}}{\text{ASR}_{\text{baseline}}}\]

Good target: >90% effectiveness

8. LLMOps vs MLOps¶

Key Differences¶

Aspect	MLOps	LLMOps
Primary metric	Accuracy	Quality + latency + cost
Evaluation	Test set	LLM-as-judge + human
Deployment	Model versioning	Model + prompt + adapter
Monitoring	Data drift	Hallucination, relevance
Iteration	Weeks	Hours-days

Prompt Management¶

Prompt Registry
├── prompt_versions/
│   ├── v1.0.0.yaml
│   ├── v1.1.0.yaml
│   └── v2.0.0.yaml
├── experiments/
│   ├── A_test_a/
│   └── A_test_b/
└── production/
    └── current -> v1.1.0.yaml

Evaluation Pipeline¶

def evaluate_prompt(prompt_version):
    results = []
    for test_case in test_set:
        response = llm.generate(prompt_version, test_case.input)

        # LLM-as-judge
        quality = judge.evaluate(response, test_case.expected)

        # Metrics
        results.append({
            'input': test_case.input,
            'response': response,
            'quality': quality,
            'latency': response.latency,
            'tokens': response.tokens
        })

    return aggregate(results)

9. Edge / On-Device ML¶

Memory Constraints¶

Device	RAM	Max Model (4-bit)
iPhone 15 Pro	8GB	~3B params
Pixel 9 Pro	16GB	~7B params
Jetson Orin AGX	64GB	~32B params

Quantization for Edge¶

Method	Quality (INT4)	Best For
GGUF Q4_K_M	~92%	Mobile CPU
AWQ	~98%	Edge GPU
GPTQ	~97%	NVIDIA

Model Size Formula¶

\[\text{Size}_{GB} = \frac{\text{Params} \times \text{Bits}}{8 \times 10^9}\]

Example: 7B @ 4-bit = 7B × 4 / 8 = 3.5 GB

Типичные заблуждения¶

Заблуждение: 'QPS = DAU x Requests / 86400 -- этого достаточно для capacity planning'

Это average QPS. Peak QPS обычно в 3-5x больше (lunch hour, evening). Для LLM serving нужно учитывать: (1) generation latency 50-500ms per request, (2) concurrent requests = QPS x avg_latency, (3) GPU memory per request (KV cache). Реальная формула: GPU_count = peak_QPS x avg_latency / batch_size. Для 3500 peak QPS при 200ms latency и batch=32 нужно ~22 GPU instances.

Заблуждение: 'Recommendation system -- это candidate generation + ranking'

Пропущены критические компоненты: (1) feature store для consistency между training и serving, (2) cold start handling для новых users/items (content-based fallback), (3) diversity/exploration (epsilon-greedy или Thompson Sampling), (4) feedback loop для retraining. Без них система быстро деградирует из-за popularity bias.

Заблуждение: 'Tier-1 rules в fraud detection -- это примитивно'

Rules обрабатывают 60-70% очевидных случаев за <10ms и стоят ~0 в compute. ML-модель нужна только для оставшихся 30-40%. Без tier-1 все запросы идут через ML -- это 10-50x дороже и медленнее. В production 3-tier architecture (rules -> ML -> human) -- стандарт, не компромисс.

10. Интервью¶

Вопрос: Спроектируйте recommendation system¶

"Нужна collaborative filtering + content-based модель"

"Multi-stage pipeline: (1) Candidate generation через FAISS/ANN -- 1M items -> 1000 за 10-20ms, (2) Scoring с light model (logistic regression) -- 1000 -> 100 за 20-30ms, (3) Ranking с heavy model (DCN/DIN) -- 100 -> 10 за 30-50ms, (4) Re-ranking с business rules (diversity, freshness). Total latency budget 100ms. Feature store для consistency train/serve. Cold start: content-based fallback + exploration (epsilon-greedy)."

Вопрос: Как обслуживать LLM на масштабе?¶

"Поставить vLLM за load balancer"

"5-layer stack: (1) Semantic cache (Redis + vector similarity, 50-80% hit rate, 70-90% cost savings), (2) Model router (simple queries -> 7B quantized, complex -> GPT-4), (3) SGLang/vLLM instances + EAGLE-3 speculative decoding (2x speedup), (4) AWQ 4-bit quantization (75% memory savings), (5) monitoring: latency P95, cost per query, hallucination rate. Formula: GPU_count = peak_QPS / (batch_size / avg_latency)."

Вопрос: Design ChatGPT-like system¶

"Frontend, backend, LLM API -- три компонента"

"Frontend -> API Gateway (auth + rate limiter: 10 req/min free, 60 req/min paid) -> Model Router (complexity classifier: simple->7B, reasoning->o1, code->Claude) -> SGLang cluster (EAGLE-3, RadixAttention for multi-turn) -> Streaming response (SSE). Caching: semantic cache before router + prefix cache в SGLang. Storage: conversation history в PostgreSQL, embeddings в Qdrant. Monitoring: Langfuse for tracing, Arize для RAG eval."

Вопрос: Спроектируйте fraud detection¶

"ML-модель которая классифицирует транзакции"

"3-tier architecture: Tier-1 rules (<10ms, 60-70% cases: amount >$10K, velocity checks, blocklist), Tier-2 ML model (50-100ms, gradient boosted trees on 100+ features: user history, device fingerprint, graph features), Tier-3 human review (async, edge cases with uncertainty 0.3-0.7). Lambda architecture: real-time scoring + batch feature engineering. Feature store для consistency. Monitoring: concept drift detection, false positive rate tracking."

Вопрос: RAG vs Long Context -- когда что?¶

"Long context лучше потому что проще"

"Cost ratio 100:1 в пользу RAG. RAG: дешевле, свежие данные (instant update), но зависит от качества retrieval. Long Context: проще, но 100x дороже, requires retraining для новых данных, страдает от 'lost in the middle'. Production best practice: RAG для retrieval (embed -> HNSW -> top-K -> rerank) + long context для reasoning (process retrieved chunks). Hybrid снижает cost на 90% при сохранении качества."

11. Formulas Quick Reference¶

QPS¶

\[\text{QPS} = \frac{\text{DAU} \times \text{Requests per User}}{86400}\]

Storage¶

\[\text{Storage}_{\text{total}} = \text{Data} \times \text{Replication Factor} \times (1 + \text{Growth})\]

Availability¶

\[\text{Availability} = \frac{\text{Uptime}}{\text{Total Time}} \times 100\%\]

99.9% = 8.76 hours downtime/year

Cache Hit Rate¶

\[\text{Hit Rate} = \frac{\text{Cache Hits}}{\text{Total Requests}}\]

Cost per Query¶

\[\text{Cost} = \frac{\text{Input} \times \text{Price}_{in} + \text{Output} \times \text{Price}_{out}}{1,000,000}\]

UCB (Exploration)¶

\[\text{UCB}(a) = \bar{X}_a + \sqrt{\frac{2 \ln N}{n_a}}\]

12. Sources Synthesized¶

meta-ml-interview-2025.md — System design patterns, capacity planning
deepmind-interview-2025-2026.md — Research-focused system design
openai-anthropic-interviews-2025-2026.md — Company-specific patterns
rag-system-design-2025-2026.md — RAG architecture
llm-observability-2025.md — Monitoring, metrics
llmops-vs-mlops-2025-2026.md — Operational patterns
edge-ml-on-device-2025.md — Edge deployment
llm-security-2025.md — OWASP, defense patterns
advanced-rag-patterns-2025.md — RAG evolution
inference-engines-comparison-2025-2026.md — Serving patterns

Extended coverage in ml-practice-prep/ml-system-design/materials.md (sections 11-16): - Section 11: Multi-Armed Bandits (UCB, Thompson Sampling, LinUCB) - Section 12: Online Learning / Streaming ML (FTRL-Proximal, Concept Drift, Hoeffding Trees) - Section 13: Multi-Stage Recommender Systems (Two-Tower, ANN, MMR, DIN/DCN) - Section 14: Causal Inference (ATE/ATT/CATE, Propensity Matching, DiD, IV, Uplift) - Section 15: Vector Databases (HNSW, IVF, Pinecone/Milvus/Qdrant/pgvector, Hybrid Search, RRF) - Section 16: Cost Optimization (GPU utilization, spot instances, model right-sizing, semantic caching, auto-scaling) - Section 17: Multi-Model Serving (weighted round-robin, cascade, confidence-based, latency-based, circuit breaker, A/B testing) - Section 18: Data Quality for ML (Great Expectations, TFDV, schema evolution, data lineage tracking, 7 quality dimensions) - Section 19: Foundation Models in Production (multi-tenant K8s, prompt caching economics, multi-layer caching, fallback chains) - Section 20: AI Agents in Production (OWASP LLM Top 10, defence-in-depth 6 layers, HITL patterns, LLM-as-Judge, OTel GenAI conventions)