Перейти к содержанию

Шпаргалка: ML System Design

~9 минут чтения

Предварительно: Мастер-гайд для подготовки | Шпаргалка инференс LLM

Тип: synthesis / interview cheat sheet Дата: Февраль 2026 Synthesis of: Meta/DeepMind interviews, RAG, Observability, LLMOps, Edge ML, Security

ML System Design -- самый частый тип интервью в Meta, Google, DeepMind и стартапах (60-70% technical rounds). Ожидается не только знание ML-моделей, но умение проектировать production системы: capacity planning (QPS = DAU x Requests / 86400), latency budgets (200ms total = 50ms model + 30ms features + 20ms network...), database selection, caching (semantic cache = 50-80% hit rate, 70-90% cost savings), и multi-tier architecture (candidate generation -> scoring -> ranking). Эта шпаргалка покрывает RESHADED framework, 8 production patterns и формулы для capacity estimation.


Quick Reference: Key Numbers

Metric Value Context
Model serving latency 50-500ms Typical LLM generation
Cache hit rate (semantic) 50-80% Production systems
Cost reduction (caching) 70-90% Semantic caching
QPS formula DAU × Requests / 86400 Capacity planning
Availability target 99.9% 8.76 hours downtime/year
Inference GPU memory 2-4× model size KV cache + activations

1. System Design Interview Framework

RESHADED Framework

Letter Aspect Questions
R Requirements Functional, non-functional, scale
E Estimation QPS, storage, latency budget
S Storage Database choice, schema, caching
H High-level Architecture diagram, components
A APIs Endpoints, contracts
D Deep dive 2-3 critical components
E Edge cases Failures, scaling, security
D Discussion Trade-offs, alternatives

Time Allocation (45 min)

Requirements & Clarification:    5 min
High-Level Design:              10 min
Component Deep Dive:            20 min
Scalability & Trade-offs:       10 min

2. Capacity Planning Formulas

QPS Calculation

\[\text{QPS} = \frac{\text{DAU} \times \text{Requests per User}}{86400}\]

Example: - 10M DAU - 10 requests/user/day - QPS = 10M × 10 / 86400 = 1,157 QPS - Peak QPS = 1,157 × 3 = ~3,500 QPS

Storage Estimation

\[\text{Storage} = \text{Data per User} \times \text{Users} \times \text{Growth Factor}\]
\[\text{Storage}_{\text{total}} = \text{Active Data} + \text{Archive} \times \text{Replication}\]

Memory for Caching

\[\text{Cache Size} = \text{Working Set} \times (1 + \text{Overhead})\]

Typical overhead: 20-30% for Redis

Latency Budget

Total Latency Budget: 200ms

Breakdown:
- Network:          20-50ms
- Authentication:    5-10ms
- Feature fetch:    10-30ms
- Model inference:  50-100ms
- Post-processing:   5-10ms

3. ML System Design Patterns

Pattern 1: Recommendation System

graph LR
    A["Candidate Generation<br/>FAISS/ANN<br/>millions -> 1000"] --> B["Scoring Model<br/>Light NN<br/>1000 -> 100"]
    B --> C["Ranking Model<br/>Heavy Transformer<br/>100 -> 10"]
    C --> D["Re-ranking<br/>Business rules<br/>10 -> Final"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#fce4ec,stroke:#c62828
    style D fill:#e8f5e9,stroke:#4caf50

Stages: 1. Candidate Generation: FAISS, ANN, collaborative filtering → 1M → 1000 2. Scoring: Light model (logistic regression, small NN) → 1000 → 100 3. Ranking: Heavy model (transformer, deep NN) → 100 → 10 4. Re-ranking: Business rules, diversity → 10 → Final

Latency Budget: - Candidate Gen: 10-20ms - Scoring: 20-30ms - Ranking: 30-50ms

Pattern 2: Search Ranking

Query → Query Understanding → Retrieval → Ranking → Re-ranking → Results
              │                   │           │           │
              ▼                   ▼           ▼           ▼
         Spell check          BM25/       Learning-   Diversity,
         Synonyms            Vector DB    to-Rank     freshness

Retrieval Methods: | Method | Recall | Latency | Use Case | |--------|--------|---------|----------| | BM25 | Medium | <5ms | Keyword matching | | Vector Search | High | 10-30ms | Semantic similarity | | Hybrid | Highest | 30-50ms | Best of both |

Pattern 3: Fraud Detection

graph TD
    A["Fraud Detection System"] --> B["Real-time Layer<br/>Online Scoring"]
    A --> C["Batch Layer<br/>Pattern Detection"]
    B --> D["Rule Engine<br/>instant"]
    B --> E["ML Model<br/>less than 50ms"]
    B --> F["Anomaly Detection"]
    C --> G["Feature Engineering"]
    C --> H["Model Retraining"]
    C --> I["Graph Analysis"]
    D --> J["Human Review<br/>edge cases"]
    E --> J
    F --> J

    style A fill:#f3e5f5,stroke:#9c27b0
    style B fill:#fce4ec,stroke:#c62828
    style C fill:#e8eaf6,stroke:#3f51b5
    style D fill:#e8f5e9,stroke:#4caf50
    style E fill:#e8f5e9,stroke:#4caf50
    style F fill:#e8f5e9,stroke:#4caf50
    style G fill:#fff3e0,stroke:#ef6c00
    style H fill:#fff3e0,stroke:#ef6c00
    style I fill:#fff3e0,stroke:#ef6c00
    style J fill:#f3e5f5,stroke:#9c27b0

Tier-based Approach: 1. Tier 1: Rule-based (instant, high precision) 2. Tier 2: ML model (50-100ms, balanced) 3. Tier 3: Human review (slow, high value)

Pattern 4: Ad Click Prediction

graph LR
    A["User + Context"] --> B["Feature Engineering"]
    B --> C["Model"]
    C --> D["Bid Decision"]
    B --> E["User features"]
    B --> F["Ad features"]
    B --> G["Context"]
    B --> H["Cross features"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#e8f5e9,stroke:#4caf50
    style D fill:#f3e5f5,stroke:#9c27b0
    style E fill:#e8eaf6,stroke:#3f51b5
    style F fill:#e8eaf6,stroke:#3f51b5
    style G fill:#e8eaf6,stroke:#3f51b5
    style H fill:#e8eaf6,stroke:#3f51b5

Exploration: Multi-armed bandit (ε-greedy, UCB)

\[\text{UCB}(a) = \bar{X}_a + \sqrt{\frac{2 \ln N}{n_a}}\]

Pattern 5: Content Moderation

Content → Tier 1 (Keywords) → Tier 2 (ML) → Tier 3 (Human)
              │                    │              │
              ▼                    ▼              ▼
          Regex              CNN/Transformer   Review queue
          Hash matching      Multimodal       Escalation

Latency Budget: - Tier 1: <10ms (block obvious violations) - Tier 2: 100-500ms (nuanced classification) - Tier 3: Hours-days (edge cases)


4. LLM-Specific Patterns

Pattern 6: LLM Serving at Scale

graph TD
    LB["Load Balancer"] --> V1["vLLM/SGLang<br/>+ EAGLE-3"]
    LB --> V2["vLLM/SGLang<br/>+ EAGLE-3"]
    LB --> V3["vLLM/SGLang<br/>+ EAGLE-3"]
    V1 --> RC["Redis Cache<br/>Semantic"]
    V2 --> RC
    V3 --> RC

    style LB fill:#f3e5f5,stroke:#9c27b0
    style V1 fill:#e8eaf6,stroke:#3f51b5
    style V2 fill:#e8eaf6,stroke:#3f51b5
    style V3 fill:#e8eaf6,stroke:#3f51b5
    style RC fill:#e8f5e9,stroke:#4caf50

Cost Optimization Stack: | Layer | Technique | Savings | |-------|-----------|---------| | Request | Semantic cache | 50-80% | | Model | Model routing | 30-50% | | Inference | Speculative decoding | 2-3× faster | | Memory | Quantization (4-bit) | 75% memory |

Pattern 7: RAG at Scale

graph LR
    A["Query"] --> B["Embedding"]
    B --> C["Vector DB<br/>HNSW Index<br/>10M+ docs"]
    C --> D["Top-K"]
    D --> E["Cross-Encoder<br/>Reranker"]
    E --> F["LLM"]
    F --> G["Response"]

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#e8f5e9,stroke:#4caf50
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#fff3e0,stroke:#ef6c00
    style E fill:#f3e5f5,stroke:#9c27b0
    style F fill:#fce4ec,stroke:#c62828
    style G fill:#e8f5e9,stroke:#4caf50

Latency Breakdown: | Stage | Time | Optimization | |-------|------|--------------| | Embedding | 5-10ms | Batch, smaller model | | Vector search | 5-20ms | HNSW, smaller dimension | | Reranking | 20-50ms | Skip for simple queries | | LLM generation | 50-500ms | Speculative, quantization |

Pattern 8: Multi-Model Routing

graph LR
    R["Router Model"] --> A["Complexity<br/>GPT-4o-mini / 8B"]
    R --> B["Reasoning<br/>o1 / Claude Sonnet"]
    R --> C["Code<br/>GPT-4 / Claude"]
    R --> D["Simple<br/>Local 7B quantized"]

    style R fill:#f3e5f5,stroke:#9c27b0
    style A fill:#e8f5e9,stroke:#4caf50
    style B fill:#fce4ec,stroke:#c62828
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#e8eaf6,stroke:#3f51b5

Router Decision: $\(\text{Model} = \arg\min_m \{ \text{Cost}(m) : \text{Quality}(m, q) > \theta \}\)$


5. Database Selection

Decision Matrix

Data Type Read-heavy Write-heavy Real-time Choice
Structured - - PostgreSQL
Time-series - TimescaleDB
Key-value Redis
Vector - - Qdrant/Milvus
Graph - - Neo4j
Document - MongoDB

Vector Database Comparison

DB Scale Latency Best For
Milvus 1B+ <5ms Enterprise
Qdrant 100M+ <10ms Self-hosted
Pinecone 10M+ 5-20ms Managed
pgvector 10M 20-100ms Existing Postgres

6. Observability & Monitoring

Metrics to Track

Category Metrics
Performance Latency (P50, P95, P99), throughput
Quality Accuracy, hallucination rate, relevance
Cost Tokens, GPU hours, API calls
System CPU, memory, GPU utilization

Alert Thresholds

Metric Warning Critical
Latency P95 >1s >3s
Error rate >1% >5%
Cost spike >2x daily >5x daily
Hallucination >5% >10%

Tools Comparison

Tool Type Focus
Langfuse Open source Tracing + cost
Arize Phoenix Open source RAG evaluation
Helicone Proxy-based API monitoring
DeepEval Framework Metrics

7. Security Patterns

OWASP LLM Top 10 (2025)

Rank Risk Mitigation
LLM01 Prompt Injection Input filtering, intent classification
LLM02 Sensitive Disclosure PII detection, output filtering
LLM03 Supply Chain Verify model provenance
LLM04 Model Poisoning Data validation, monitoring
LLM05 Output Handling Sanitize, validate outputs

Multi-Layer Defense

graph TD
    A["Layer 1: Input Filtering<br/>regex + ML"] --> B["Layer 2: Intent Classification<br/>safe / unsafe / ambiguous"]
    B --> C["Layer 3: Model Inference<br/>with guardrails"]
    C --> D["Layer 4: Output Filtering<br/>PII, harmful content"]
    D --> E["Layer 5: Anomaly Detection<br/>behavior monitoring"]

    style A fill:#e8f5e9,stroke:#4caf50
    style B fill:#e8eaf6,stroke:#3f51b5
    style C fill:#fff3e0,stroke:#ef6c00
    style D fill:#fce4ec,stroke:#c62828
    style E fill:#f3e5f5,stroke:#9c27b0

Defense Effectiveness

\[\text{Effectiveness} = 1 - \frac{\text{ASR}_{\text{defended}}}{\text{ASR}_{\text{baseline}}}\]

Good target: >90% effectiveness


8. LLMOps vs MLOps

Key Differences

Aspect MLOps LLMOps
Primary metric Accuracy Quality + latency + cost
Evaluation Test set LLM-as-judge + human
Deployment Model versioning Model + prompt + adapter
Monitoring Data drift Hallucination, relevance
Iteration Weeks Hours-days

Prompt Management

Prompt Registry
├── prompt_versions/
│   ├── v1.0.0.yaml
│   ├── v1.1.0.yaml
│   └── v2.0.0.yaml
├── experiments/
│   ├── A_test_a/
│   └── A_test_b/
└── production/
    └── current -> v1.1.0.yaml

Evaluation Pipeline

def evaluate_prompt(prompt_version):
    results = []
    for test_case in test_set:
        response = llm.generate(prompt_version, test_case.input)

        # LLM-as-judge
        quality = judge.evaluate(response, test_case.expected)

        # Metrics
        results.append({
            'input': test_case.input,
            'response': response,
            'quality': quality,
            'latency': response.latency,
            'tokens': response.tokens
        })

    return aggregate(results)

9. Edge / On-Device ML

Memory Constraints

Device RAM Max Model (4-bit)
iPhone 15 Pro 8GB ~3B params
Pixel 9 Pro 16GB ~7B params
Jetson Orin AGX 64GB ~32B params

Quantization for Edge

Method Quality (INT4) Best For
GGUF Q4_K_M ~92% Mobile CPU
AWQ ~98% Edge GPU
GPTQ ~97% NVIDIA

Model Size Formula

\[\text{Size}_{GB} = \frac{\text{Params} \times \text{Bits}}{8 \times 10^9}\]

Example: 7B @ 4-bit = 7B × 4 / 8 = 3.5 GB


Типичные заблуждения

Заблуждение: 'QPS = DAU x Requests / 86400 -- этого достаточно для capacity planning'

Это average QPS. Peak QPS обычно в 3-5x больше (lunch hour, evening). Для LLM serving нужно учитывать: (1) generation latency 50-500ms per request, (2) concurrent requests = QPS x avg_latency, (3) GPU memory per request (KV cache). Реальная формула: GPU_count = peak_QPS x avg_latency / batch_size. Для 3500 peak QPS при 200ms latency и batch=32 нужно ~22 GPU instances.

Заблуждение: 'Recommendation system -- это candidate generation + ranking'

Пропущены критические компоненты: (1) feature store для consistency между training и serving, (2) cold start handling для новых users/items (content-based fallback), (3) diversity/exploration (epsilon-greedy или Thompson Sampling), (4) feedback loop для retraining. Без них система быстро деградирует из-за popularity bias.

Заблуждение: 'Tier-1 rules в fraud detection -- это примитивно'

Rules обрабатывают 60-70% очевидных случаев за <10ms и стоят ~0 в compute. ML-модель нужна только для оставшихся 30-40%. Без tier-1 все запросы идут через ML -- это 10-50x дороже и медленнее. В production 3-tier architecture (rules -> ML -> human) -- стандарт, не компромисс.


10. Интервью

Вопрос: Спроектируйте recommendation system

❌ "Нужна collaborative filtering + content-based модель"

✅ "Multi-stage pipeline: (1) Candidate generation через FAISS/ANN -- 1M items -> 1000 за 10-20ms, (2) Scoring с light model (logistic regression) -- 1000 -> 100 за 20-30ms, (3) Ranking с heavy model (DCN/DIN) -- 100 -> 10 за 30-50ms, (4) Re-ranking с business rules (diversity, freshness). Total latency budget 100ms. Feature store для consistency train/serve. Cold start: content-based fallback + exploration (epsilon-greedy)."

Вопрос: Как обслуживать LLM на масштабе?

❌ "Поставить vLLM за load balancer"

✅ "5-layer stack: (1) Semantic cache (Redis + vector similarity, 50-80% hit rate, 70-90% cost savings), (2) Model router (simple queries -> 7B quantized, complex -> GPT-4), (3) SGLang/vLLM instances + EAGLE-3 speculative decoding (2x speedup), (4) AWQ 4-bit quantization (75% memory savings), (5) monitoring: latency P95, cost per query, hallucination rate. Formula: GPU_count = peak_QPS / (batch_size / avg_latency)."

Вопрос: Design ChatGPT-like system

❌ "Frontend, backend, LLM API -- три компонента"

✅ "Frontend -> API Gateway (auth + rate limiter: 10 req/min free, 60 req/min paid) -> Model Router (complexity classifier: simple->7B, reasoning->o1, code->Claude) -> SGLang cluster (EAGLE-3, RadixAttention for multi-turn) -> Streaming response (SSE). Caching: semantic cache before router + prefix cache в SGLang. Storage: conversation history в PostgreSQL, embeddings в Qdrant. Monitoring: Langfuse for tracing, Arize для RAG eval."

Вопрос: Спроектируйте fraud detection

❌ "ML-модель которая классифицирует транзакции"

✅ "3-tier architecture: Tier-1 rules (<10ms, 60-70% cases: amount >$10K, velocity checks, blocklist), Tier-2 ML model (50-100ms, gradient boosted trees on 100+ features: user history, device fingerprint, graph features), Tier-3 human review (async, edge cases with uncertainty 0.3-0.7). Lambda architecture: real-time scoring + batch feature engineering. Feature store для consistency. Monitoring: concept drift detection, false positive rate tracking."

Вопрос: RAG vs Long Context -- когда что?

❌ "Long context лучше потому что проще"

✅ "Cost ratio 100:1 в пользу RAG. RAG: дешевле, свежие данные (instant update), но зависит от качества retrieval. Long Context: проще, но 100x дороже, requires retraining для новых данных, страдает от 'lost in the middle'. Production best practice: RAG для retrieval (embed -> HNSW -> top-K -> rerank) + long context для reasoning (process retrieved chunks). Hybrid снижает cost на 90% при сохранении качества."


11. Formulas Quick Reference

QPS

\[\text{QPS} = \frac{\text{DAU} \times \text{Requests per User}}{86400}\]

Storage

\[\text{Storage}_{\text{total}} = \text{Data} \times \text{Replication Factor} \times (1 + \text{Growth})\]

Availability

\[\text{Availability} = \frac{\text{Uptime}}{\text{Total Time}} \times 100\%\]

99.9% = 8.76 hours downtime/year

Cache Hit Rate

\[\text{Hit Rate} = \frac{\text{Cache Hits}}{\text{Total Requests}}\]

Cost per Query

\[\text{Cost} = \frac{\text{Input} \times \text{Price}_{in} + \text{Output} \times \text{Price}_{out}}{1,000,000}\]

UCB (Exploration)

\[\text{UCB}(a) = \bar{X}_a + \sqrt{\frac{2 \ln N}{n_a}}\]

12. Sources Synthesized

  1. meta-ml-interview-2025.md — System design patterns, capacity planning
  2. deepmind-interview-2025-2026.md — Research-focused system design
  3. openai-anthropic-interviews-2025-2026.md — Company-specific patterns
  4. rag-system-design-2025-2026.md — RAG architecture
  5. llm-observability-2025.md — Monitoring, metrics
  6. llmops-vs-mlops-2025-2026.md — Operational patterns
  7. edge-ml-on-device-2025.md — Edge deployment
  8. llm-security-2025.md — OWASP, defense patterns
  9. advanced-rag-patterns-2025.md — RAG evolution
  10. inference-engines-comparison-2025-2026.md — Serving patterns

Extended coverage in ml-practice-prep/ml-system-design/materials.md (sections 11-16): - Section 11: Multi-Armed Bandits (UCB, Thompson Sampling, LinUCB) - Section 12: Online Learning / Streaming ML (FTRL-Proximal, Concept Drift, Hoeffding Trees) - Section 13: Multi-Stage Recommender Systems (Two-Tower, ANN, MMR, DIN/DCN) - Section 14: Causal Inference (ATE/ATT/CATE, Propensity Matching, DiD, IV, Uplift) - Section 15: Vector Databases (HNSW, IVF, Pinecone/Milvus/Qdrant/pgvector, Hybrid Search, RRF) - Section 16: Cost Optimization (GPU utilization, spot instances, model right-sizing, semantic caching, auto-scaling) - Section 17: Multi-Model Serving (weighted round-robin, cascade, confidence-based, latency-based, circuit breaker, A/B testing) - Section 18: Data Quality for ML (Great Expectations, TFDV, schema evolution, data lineage tracking, 7 quality dimensions) - Section 19: Foundation Models in Production (multi-tenant K8s, prompt caching economics, multi-layer caching, fallback chains) - Section 20: AI Agents in Production (OWASP LLM Top 10, defence-in-depth 6 layers, HITL patterns, LLM-as-Judge, OTel GenAI conventions)