ML System Design: Подготовка к интервью¶
~9 минут чтения
Предварительно: Материалы MLSD | Обновления MLSD | Кейсы
ML System Design -- это не алгоритмы, а инженерия компромиссов: latency vs throughput, freshness vs cost, accuracy vs interpretability. В этом файле -- 50+ вопросов по 8 ключевым темам, от model serving до LLM production. Каждый вопрос классифицирован по уровню: Basic (Junior), Medium (Middle), Killer (Senior+). Формат: Q/A с конкретными числами и формулами, не абстрактные рассуждения.
Вопросы с собеседований для 8 задач ML System Design Уровни: Basic, Medium, Killer Обновлено: 2026-02-11
1. Model Serving & Latency¶
Basic¶
Q: Что такое latency P50, P99, P99.9?
A: P50 = медиана (50% запросов быстрее), P99 = 99-й перцентиль (важно для SLA), P99.9 = worst-case outliers.
Q: Зачем dynamic batching?
A: Объединяет запросы для эффективного GPU использования. Уменьшает per-request overhead.
Medium¶
Q: Как снизить inference latency в 2x?
A: (1) Quantization (FP16/INT8), (2) Batching, (3) Model distillation, (4) ONNX optimization, (5) Caching, (6) Async processing.
Q: Отличие online vs batch inference?
A: Online = real-time (<100ms), для user-facing. Batch = deferred, для analytics/offline. Batch дешевле, online требует low-latency optimization.
Killer¶
Q: Спроектируйте inference system для 100K RPS с P99 < 50ms.
A: (1) Model quantization INT8, (2) Request batching с max_wait=5ms, (3) GPU inference с TensorRT, (4) Load balancer + auto-scaling, (5) Request coalescing, (6) Regional endpoints, (7) Cache для популярных queries.
2. A/B Testing¶
Basic¶
Q: Зачем нужен minimum detectable effect (MDE)?
A: Определяет минимальное изменение, которое мы хотим detect. Влияет на sample size: меньше MDE → больше выборка.
Q: Что такое p-value?
A: Вероятность получить такие же или более экстремальные результаты при H0 (нет разницы). p < 0.05 = statistically significant.
Medium¶
Q: Формула sample size для A/B теста?
A: n = 16 * sigma^2 / delta^2 для 95% confidence, 80% power. Где sigma^2 = p(1-p), delta = MDE.
Q: В чём проблема multiple comparisons?
A: Больше тестов → выше шанс false positive. Решение: Bonferroni correction (alpha/m), False Discovery Rate.
Killer¶
Q: Как провести A/B тест для ML модели с network effects?
A: Network effects = поведение user A влияет на user B. Решения: (1) Cluster-based randomization, (2) Geo-based split, (3) Time-based A/B, (4) Counterfactual evaluation.
3. Drift Detection¶
Basic¶
Q: Типы drift?
A: Data drift (распределение X меняется), Concept drift (P(Y|X) меняется), Label drift (распределение Y меняется).
Q: Что такое PSI?
A: Population Stability Index = мера изменения распределения. PSI < 0.1 = OK, 0.1-0.25 = moderate, > 0.25 = significant.
Medium¶
Q: PSI vs KS-test vs Wasserstein?
A: PSI = для binned distributions, интерпретируем. KS-test = для continuous, statistical significance. Wasserstein = robust, geometric interpretation.
Q: Как настроить alerting для drift?
A: (1) Define baseline window, (2) Calculate metrics hourly/daily, (3) Set thresholds (PSI > 0.25), (4) Multi-feature monitoring, (5) Business metric correlation.
Killer¶
Q: Drift detected. Ваши действия?
A: (1) Investigate root cause (data pipeline, feature, business change), (2) Check if label drift vs feature drift, (3) Evaluate model on new data, (4) Retrain если accuracy drop > threshold, (5) Consider incremental learning, (6) Update monitoring thresholds.
4. Model Calibration¶
Basic¶
Q: Что такое calibrated model?
A: Предсказанная вероятность = эмпирическая частота. Если модель предсказывает 0.7, то в 70% случаев это true.
Q: Когда нужна калибровка?
A: Когда нужны точные вероятности: medical diagnosis, risk scoring, cost-sensitive decisions.
Medium¶
Q: Platt Scaling vs Isotonic Regression?
A: Platt = parametric (logistic), хорош для sigmoid curves, 2 параметра. Isotonic = non-parametric, гибче, но нужно больше данных, risk of overfitting.
Q: Как оценить calibration?
A: (1) Calibration curve (reliability diagram), (2) Brier score (lower = better), (3) Expected Calibration Error (ECE).
Killer¶
Q: Модель хорошо calibrated, но низкая accuracy. Проблема?
A: Calibration != discrimination. Модель может быть calibrated но бесполезной (предсказывает baseline probabilities). Нужно проверять оба: calibration + ROC-AUC/accuracy.
5. Ranking Metrics¶
Basic¶
Q: Что такое NDCG?
A: Normalized Discounted Cumulative Gain. Учитывает позицию и relevance. DCG суммирует relevance / log2(position). NDCG = DCG / ideal DCG.
Q: Precision@k vs Recall@k?
A: P@k = релевантных в топ-k / k. R@k = релевантных в топ-k / всего релевантных.
Medium¶
Q: MRR vs MAP?
A: MRR = Mean Reciprocal Rank, учитывает только первый релевантный. MAP = Mean Average Precision, учитывает все релевантные позиции.
Q: Когда использовать NDCG vs MAP?
A: NDCG = когда есть graded relevance (0,1,2,3...). MAP = binary relevance (0 или 1).
Killer¶
Q: Как оптимизировать ranking метрики в обучении?
A: (1) Listwise loss (LambdaLoss), (2) Pairwise loss (RankNet), (3) Approximate NDCG loss, (4) Learning to Rank frameworks (XGBoost ranker, TF-Ranking).
6. Recommendation Systems¶
Basic¶
Q: Two-Tower модель?
A: Две нейросети: user tower и item tower. Каждая создаёт embedding. Similarity = dot product. Efficient для large catalogs.
Q: Cold start problem?
A: Новые пользователи/items без истории. Решения: content-based, popularity, exploration (bandits), cross-domain transfer.
Medium¶
Q: Collaborative Filtering vs Content-Based?
A: CF = основано на поведении похожих users/items. Content-based = основано на features. Hybrid = комбинация.
Q: Matrix Factorization vs Two-Tower?
A: MF = линейная декомпозиция, cold start problem. Two-Tower = нелинейная, handles features, scalable.
Killer¶
Q: Спроектируйте RecSys для 100M users, 10M items.
A: (1) Two-stage: retrieval (ANN, FAISS) + ranking (neural), (2) Real-time features через feature store, (3) User/item embeddings обновляются batch, (4) Cold start: content features + exploration, (5) A/B testing framework, (6) Real-time personalization через session features.
7. ML Trade-offs¶
Basic¶
Q: Accuracy vs Latency trade-off?
A: Сложные модели = выше accuracy, но медленнее. Решение: model distillation, quantization, caching.
Q: Precision vs Recall?
A: Precision = TP/(TP+FP), Recall = TP/(TP+FN). Trade-off зависит от business cost false positives vs false negatives.
Medium¶
Q: Online vs Batch learning trade-offs?
A: Online = свежая модель, но сложнее debugging, potential instability. Batch = стабильность, но stale model.
Q: Interpretability vs Performance?
A: Complex models (deep learning) vs interpretable (decision trees). Решение: SHAP, LIME для объяснения сложных моделей.
Killer¶
Q: 15 trade-off сценариев — выбери правильный подход: 1. Модель для medical diagnosis: Interpretability > accuracy (regulatory) 2. Real-time bidding ad system: Latency > accuracy (budget constraints) 3. Fraud detection с rare events: Recall > precision (missing fraud is costly) 4. Content moderation: Precision > recall (false positives bad UX) 5. Recommendation для new users: Exploration > exploitation 6. Feature selection для production: Simplicity > marginal gains 7. Model retraining frequency: Cost vs freshness 8. Ensemble vs single model: Maintenance vs accuracy 9. Custom loss vs standard: Complexity vs business alignment 10. GPU vs CPU inference: Cost vs latency 11. Real-time vs batch features: Freshness vs stability 12. Deep learning vs GBM: Data vs interpretability 13. Multi-task vs single-task: Shared knowledge vs task conflict 14. Online A/B vs offline eval: Confidence vs cost 15. Feature store vs direct queries: Latency vs freshness
8. LLM Production¶
Basic¶
Q: Что такое prompt injection?
A: Атака через user input, которая меняет поведение LLM. "Ignore previous instructions and..."
Q: Как защититься от prompt injection?
A: (1) Input sanitization, (2) System prompt separation, (3) Output validation, (4) Guardrails, (5) Rate limiting.
Medium¶
Q: OWASP Top 10 для LLM?
A: Prompt injection, Insecure output handling, Training data poisoning, Model DoS, Supply chain, Sensitive info disclosure, Insecure plugins, Excessive agency, Overreliance, Model theft.
Q: Как организовать guardrails?
A: Input guardrails (sanitization, PII detection), Output guardrails (format validation, content policy), Tool guardrails (permission checks).
Killer¶
Q: Спроектируйте LLM систему с enterprise security.
A: (1) Input/output guardrails pipeline, (2) PII detection и redaction, (3) Audit logging, (4) Rate limiting per user, (5) Content policy enforcement, (6) Model access control, (7) Fallback models, (8) Human-in-the-loop для risky operations, (9) Red team testing schedule, (10) Incident response plan.
9. Feature Stores¶
Basic¶
Q: Что такое feature store?
A: Centralized repository для ML features: (1) Storage — batch и real-time, (2) Serving — low-latency retrieval, (3) Registry — metadata, lineage, versioning, (4) Computation — transformation pipelines. Examples: Feast (OSS), Tecton (managed), Databricks Feature Store.
Q: Зачем нужен feature store?
A: (1) Training-serving skew prevention — одинаковые features при train и inference, (2) Feature reuse — не пересчитывать для разных моделей, (3) Point-in-time correctness — исторические features без leakage, (4) Real-time serving — низкая latency для online inference.
Medium¶
Q: Feast vs Tecton — когда что?
A:
| Feature | Feast (OSS) | Tecton (Managed) |
|---|---|---|
| Cost | Free | $$$ |
| Setup | Self-managed | Managed |
| Real-time | Redis integration | Built-in |
| Transformations | Limited | Rich (Spark, Pandas) |
| Monitoring | Basic | Advanced |
| Enterprise | DIY | Full support |
Feast: Startups, learning, budget constraints. Tecton: Enterprise, scale, team velocity > cost.
Q: Что такое point-in-time join?
A: Проблема: при обучении нельзя использовать features из будущего. Point-in-time join гарантирует, что для каждой training example используются features, которые существовали на момент event.
# Without PIT join: LEAKAGE!
features = feature_store.get_features(user_id) # Current features
# With PIT join: CORRECT
features = feature_store.get_features(
entity=user_id,
timestamp=event_timestamp # Features as of event time
)
Q: Online vs Offline feature store?
A:
| Offline | Online |
|---|---|
| Batch computation | Real-time updates |
| S3, BigQuery, Delta | Redis, DynamoDB |
| Training | Inference |
| Low cost | Low latency |
| Historical data | Latest values only |
Architecture: - Offline: Spark jobs → Parquet/Delta → Training - Online: Stream → Redis → Inference (<10ms)
Killer¶
Q: Спроектируйте feature store для fraud detection.
A:
Requirements: - 10M predictions/day - <50ms latency - Real-time features (last 5 min transactions) - Historical features (30-day aggregates)
Architecture:
Layer 1: Batch Pipeline (Daily) - Spark ETL → Aggregated features (30-day stats) - Store в Delta Lake + Sync to Redis - Examples: avg_transaction_amount, distinct_merchants_30d
Layer 2: Stream Pipeline (Real-time) - Kafka → Flink → Redis - Windowed aggregations (5 min tumbling) - Examples: tx_count_5m, velocity_score
Layer 3: Feature Registry - Metadata: name, type, owner, freshness SLA - Lineage: source → transformation → feature - Monitoring: staleness alerts, distribution drift
Layer 4: Serving API - Feature Server: gRPC endpoint - Request: (user_id, merchant_id, timestamp) - Response: feature vector (<10ms) - Fallback: cached features if upstream fails
Cost: ~$15K/month (Spark cluster + Redis cluster + storage)
10. Recommendation Systems¶
Basic¶
Q: Collaborative Filtering vs Content-Based?
A: - Collaborative Filtering: Рекомендации на основе похожих users/items. "Люди, которые купили X, также купили Y". Matrix: User × Item interactions. - Content-Based: Рекомендации на основе признаков item. "Похож на то, что вы смотрели". Features: genre, tags, description.
CF лучше для discovery, CB для explainability. Hybrid = best of both.
Q: User-based vs Item-based CF?
A: - User-based: Найти похожих users → рекомендовать их items. Проблема: users меняются, scaling для миллионов. - Item-based: Найти похожие items → рекомендовать. Стабильнее, можно pre-compute item similarity.
Production обычно item-based (Amazon, Netflix).
Medium¶
Q: Matrix Factorization для RecSys?
A: Разложение User-Item матрицы R на User factors U и Item factors V: $\(R \approx U \times V^T\)$
SGD update: $\(e_{ui} = r_{ui} - u_i \cdot v_j\)$
\[u_i \leftarrow u_i + \eta (e_{ui} v_j - \lambda u_i)\]Advantages: Handles sparsity, latent factors capture preferences. Libraries: Implicit, LightFM, Surprise.
Q: Cold Start Problem — решения?
A: 1. New User: Content-based first, ask preferences, popularity baseline 2. New Item: Content features, exploitation after K interactions 3. Hybrid: Combine CF + CB, smooth transition 4. Bandits: Explore-exploit для new items 5. Side information: Demographics, context, metadata
Q: Two-Tower Architecture для RecSys?
A:
Architecture: - User Tower: User features → MLP → User embedding - Item Tower: Item features → MLP → Item embedding - Score: Dot product or cosine similarity
Training: - Loss: Cross-entropy или BPR - Negatives: In-batch или sampled
Advantages: - Decoupled inference (pre-compute item embeddings) - Scalable to millions of items - ANN search for retrieval (FAISS, ScaNN)
Production: YouTube, Pinterest, Amazon используют two-tower.
Killer¶
Q: Спроектируйте RecSys для e-commerce с 10M users, 1M items.
A:
Requirements: Real-time personalization, 100ms latency, 10% CTR improvement.
Architecture:
Stage 1: Retrieval (Candidates) - Two-tower model → ANN search (FAISS) - Input: User features + context - Output: 1000 candidates - Latency: ~20ms
Stage 2: Ranking (Scoring) - Gradient Boosted Trees (LightGBM) or Deep Ranking - Features: user, item, cross, context - Output: Top 100 scored items - Latency: ~50ms
Stage 3: Re-ranking (Business Logic) - Diversity (MMR) - Freshness boost - Business rules (promotions, inventory) - Output: Final 20 items - Latency: ~5ms
Infrastructure: - Feature Store: Redis (online) + BigQuery (offline) - ANN Index: FAISS IVF-PQ, refreshed hourly - Model serving: TensorFlow Serving + gRPC
Training Pipeline: - Data: Click logs, purchases, impressions - Features: User history, item embeddings, context - Labels: Click/purchase (weighted) - Frequency: Daily retraining
Metrics: - Offline: NDCG@20, Recall@100, AUC - Online: CTR, CVR, Revenue per user
Q: Как оценить RecSys offline vs online?
A:
Offline Metrics: - Ranking: NDCG, MAP, MRR - Classification: AUC, Precision@K, Recall@K - Coverage: % items ever recommended - Diversity: Intra-list similarity
Problem: Offline ≠ Online performance (top-K mismatch, position bias)
Online Metrics: - CTR, Conversion Rate - Revenue per session - Engagement time - A/B test vs baseline
Best practice: Offline screening → Online A/B → Ship if +business metric.
11. MLOps & CI/CD¶
Basic¶
Q: Что такое MLOps?
A: MLOps = DevOps принципы для ML. Автоматизация всего lifecycle: data prep → training → deployment → monitoring. Цель: reproducibility, reliability, scalability.
Q: MLOps vs DevOps?
A: DevOps = code-centric, MLOps = data + model-centric. MLOps добавляет: experiment tracking, model versioning, data validation, drift detection, continuous training.
Medium¶
Q: Ключевые компоненты MLOps pipeline?
A: (1) Data Collection & Validation, (2) Feature Engineering, (3) Model Training & Evaluation, (4) Model Registry & Versioning, (5) Model Deployment (batch/real-time), (6) Monitoring & Alerting, (7) Retraining triggers.
Q: Что такое CI/CD для ML?
A: CI (Continuous Integration): validate data, test code, validate models. CD (Continuous Deployment): deploy model + infrastructure. CT (Continuous Training): auto-retrain on new data.
Killer¶
Q: Спроектируйте end-to-end MLOps pipeline для fraud detection.
A: (1) Feature Store: Redis (real-time) + BigQuery (batch), (2) Training: daily Airflow job + drift-triggered retraining, (3) Model Registry: MLflow with staging/production stages, (4) Deployment: Kubernetes + canary rollout, (5) Monitoring: Evidently for drift, Prometheus for latency, (6) Alerting: PagerDuty for accuracy drop > 2%, (7) Rollback: Blue-green deployment.
12. Experiment Tracking¶
Basic¶
Q: Зачем нужен experiment tracking?
A: (1) Reproducibility — можно повторить любой эксперимент, (2) Comparison — сравнить метрики между runs, (3) Collaboration — команда видит все эксперименты, (4) Debugging — найти что пошло не так.
Q: MLflow vs W&B (Weights & Biases)?
A: MLflow = open-source, self-hosted, basic UI. W&B = SaaS, rich visualizations, team collaboration, но платный. MLflow для on-prem/privacy, W&B для скорости разработки.
Medium¶
Q: Что логировать в experiment tracking?
A: - Parameters: learning_rate, batch_size, architecture - Metrics: loss, accuracy (train/val), per-epoch - Artifacts: model weights, checkpoints, config files - Code: git commit, branch, diff - Data: dataset version (DVC), feature stats - Environment: requirements.txt, Docker image
import mlflow
with mlflow.start_run():
mlflow.log_param("learning_rate", 0.001)
mlflow.log_param("batch_size", 64)
for epoch in range(epochs):
train_loss, val_loss = train_epoch(model, train_loader, val_loader)
mlflow.log_metric("train_loss", train_loss, step=epoch)
mlflow.log_metric("val_loss", val_loss, step=epoch)
mlflow.log_artifact("model.pth")
mlflow.sklearn.log_model(model, "model")
Q: Model Registry — зачем и как?
A: Central repository для моделей с lifecycle management: - Versioning: model v1, v2, v3... - Stages: None → Staging → Production → Archived - Metadata: metrics, tags, lineage - Transition approvals для production
Killer¶
Q: Как организовать MLflow в production с высокой доступностью?
A: (1) Backend store: PostgreSQL (не SQLite!), (2) Artifact store: S3/MinIO с versioning, (3) Tracking server: load-balanced, (4) Authentication: basic auth или OIDC, (5) Model serving: MLflow Models + Docker/Kubernetes, (6) Backup: daily DB dump + artifact sync, (7) Monitoring: health checks на tracking server.
13. Data Validation & Quality¶
Basic¶
Q: Что такое data validation в ML?
A: Проверка что incoming data соответствует ожиданиям: schema, range, completeness, uniqueness. Предотвращает "garbage in, garbage out".
Q: Great Expectations — что это?
A: Open-source library для data validation. Определяет "expectations" (rules) и проверяет data batches. Генерирует docs с validation results.
Medium¶
Q: Типы data quality checks?
A: - Schema validation: correct columns, types - Completeness: null % < threshold - Uniqueness: no duplicate IDs - Range checks: age 0-120, price > 0 - Distribution checks: mean/std within bounds - Referential integrity: foreign keys exist
import great_expectations as gx
# Define expectation suite
expectation_suite = gx.ExpectationSuite("data_quality")
expectation_suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeNotNull(column="user_id")
)
expectation_suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeBetween(
column="age", min_value=0, max_value=120
)
)
expectation_suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeUnique(column="email")
)
# Validate
results = validator.validate(expectation_suite)
Q: Как валидировать data в production pipeline?
A: (1) Define expectations на training data, (2) Apply при inference, (3) Alert на failures, (4) Quarantine bad data, (5) Trigger re-training если distribution shift.
Killer¶
Q: Data contract между Data Engineering и ML командами?
A: Формальный договор: - Schema: column names, types, nullable - Freshness: SLA на data arrival (e.g., data by 6am) - Quality: max null % = 1%, uniqueness = 100% - Volume: expected rows per day ± 10% - Semantic: what each column means - Ownership: who to contact for issues - Change process: 7-day notice for schema changes
14. Model Monitoring & Observability¶
Basic¶
Q: Что мониторить в ML модели?
A: (1) Prediction metrics: accuracy, F1 (если есть ground truth), (2) Data drift: feature distributions, (3) Prediction drift: output distributions, (4) System metrics: latency, throughput, errors, (5) Business metrics: revenue, CTR.
Q: Что такое training-serving skew?
A: Разница между training и inference pipeline: разные preprocessing, feature engineering, library versions. Приводит к silent accuracy degradation.
Medium¶
Q: Monitoring tools comparison?
A: | Tool | Type | Best For | |------|------|----------| | Evidently | OSS | Data drift, visual reports | | NannyML | OSS | CBPE (no ground truth) | | WhyLabs | SaaS | Enterprise observability | | Prometheus + Grafana | OSS | System metrics | | Datadog | SaaS | Full-stack observability |
Q: Как мониторить без ground truth?
A: 1. Confidence-based: Monitor prediction confidence distribution 2. CBPE (Confidence-Based Performance Estimation): NannyML approach 3. Drift-only: Alert на significant distribution shift 4. Proxy metrics: Business KPIs (revenue, engagement) 5. Sampling: Label small % for delayed validation
Killer¶
Q: Спроектируйте monitoring system для 100K predictions/day.
A:
Architecture: 1. Log layer: Async logging → Kafka → ClickHouse (predictions + features) 2. Compute layer: Hourly batch jobs (drift metrics, aggregation) 3. Alert layer: Prometheus alerts → PagerDuty 4. Visualization: Grafana dashboards
Metrics tracked: - Data drift: PSI per feature (alert > 0.25) - Prediction distribution: KS-test vs baseline - Confidence: mean, std, % low confidence - Latency: P50, P99 - Volume: predictions/hour
Alerting strategy: - Critical: accuracy drop > 5%, service down → immediate - Warning: drift detected, latency high → Slack - Info: daily summary → email
Cost: ~$500/month (ClickHouse + Grafana + storage)
11. Distributed Training¶
Basic¶
Q: В чём разница Data Parallel vs Model Parallel?
A: - Data Parallel: Модель реплицируется на все GPU, данные разбиваются. Простой, но каждый GPU хранит полную модель. - Model Parallel: Модель разбивается по слоям или тензорам. Каждый GPU хранит часть модели. Сложнее, но позволяет тренировать модели > GPU memory.
Q: Что такое DDP (DistributedDataParallel)?
A: PyTorch's data parallelism — каждый GPU имеет копию модели, gradients синхронизируются через all-reduce. Эффективен для моделей, влезающих в память.
Medium¶
Q: ZeRO (Zero Redundancy Optimizer) — stages и memory savings?
A:
Stage Что шардингуется Memory savings ZeRO-1 Optimizer states ~4x ZeRO-2 Optimizer + gradients ~8x ZeRO-3 Optimizer + gradients + parameters ~N× (N = GPUs) Memory для 7B модели:
Q: FSDP vs DeepSpeed — когда что использовать?
A:
Criterion FSDP DeepSpeed Memory Excellent (90%) Outstanding (95%) Setup Low High Ecosystem PyTorch native Microsoft Features Basic sharding Pipeline parallel, MoE Best for Most transformers >10B params Use FSDP when: PyTorch-only stack, balanced simplicity/performance Use DeepSpeed when: Maximum memory efficiency, >10B params, dedicated ML infra team
Q: Как работает gradient accumulation?
A: Симулирует больший batch без увеличения памяти:
Эффективный batch = actual_batch × accum_steps
Killer¶
Q: Спроектируйте distributed training для 70B LLM на 8×A100 80GB.
A:
Memory analysis:
70B params × 2 bytes (FP16) = 140GB params 70B params × 2 bytes = 140GB gradients 70B params × 12 bytes = 840GB optimizer (Adam: m, v, master) Total: 1120GB → 140GB per GPU minimumSolution stack: 1. ZeRO-3 + CPU offload: Shards everything, offloads optimizer to CPU 2. Activation checkpointing: 50% activation memory reduction 3. Mixed precision (BF16): 2x memory savings 4. Flash Attention: O(N) vs O(N²) attention memory
Configuration:
# DeepSpeed config { "zero_optimization": { "stage": 3, "offload_optimizer": {"device": "cpu"}, "offload_param": {"device": "cpu"}, "overlap_comm": true }, "activation_checkpointing": { "partition_activations": true } }Expected per-GPU memory: ~40GB (fits in 80GB A100)
Q: Pipeline Parallelism — как работает и когда использовать?
A:
Concept: Разбить модель на последовательные стадии, каждая на своём GPU.
Micro-batching: Несколько микро-батчей обрабатываются конвейером, заполняя "пузыри".
When to use: - Models too large for data parallel alone - Combined with ZeRO (3D parallelism) - Very deep models (depth > 100 layers)
Trade-offs: - Pipeline bubbles reduce efficiency - Complex to implement - Best combined with tensor parallelism
12. Feature Stores¶
Basic¶
Q: Что такое Feature Store?
A: Централизованное хранилище для ML features, обеспечивающее: - Consistency: Одинаковые features для training и serving - Reusability: Features переиспользуются между моделями - Time-travel: Point-in-time correct joins - Freshness: Real-time feature serving
Q: Offline vs Online Feature Store?
A:
Store Use case Latency Storage Offline Training Hours-minutes Parquet, Delta Lake Online Real-time inference <10ms Redis, DynamoDB Features materialize: Offline → Online (batch or streaming)
Medium¶
Q: Point-in-Time Join — зачем нужен?
A: Предотвращает data leakage при обучении.
Problem: Feature value на момент serving может отличаться от training time.
Solution: Join feature по entity_id + timestamp, получая значение, которое существовало на момент события.
-- Point-in-time join
SELECT e.entity_id, e.event_time, f.feature_value
FROM events e
JOIN feature_history f
ON e.entity_id = f.entity_id
AND f.feature_time <= e.event_time
AND f.feature_time > e.event_time - INTERVAL '1 day'
Q: Feast vs Tecton vs Hopsworks — сравнение?
A:
Criterion Feast Tecton Hopsworks Type OSS Managed SaaS Hybrid Real-time Limited Excellent Excellent Setup Medium Low Medium Cost Free $$$$$Best for Startups Enterprise Mid-market
Killer¶
Q: Спроектируйте Feature Store для 1000 features, 1M users, <10ms latency.
A:
Architecture:
Sources → Stream (Kafka) → Feature Computation → Online Store (Redis) ↓ Batch (Spark) → Offline Store (Delta Lake)Key decisions: 1. Storage: Redis Cluster for online (sub-ms), Delta Lake for offline 2. Materialization: Every 5 min for hot features, hourly for others 3. Feature groups: Group by freshness requirements 4. Monitoring: Feature freshness alerts, latency P99
Code pattern:
13. Causal Inference & Uplift Modeling¶
Basic¶
Q: Correlation vs Causation — в чём разница?
A: - Correlation: X и Y связаны, но не обязательно причина-следствие - Causation: X вызывает Y
Example: Ice cream sales correlate with drowning. Cause? Heat → both increase. Not ice cream → drowning.
Q: Что такое Uplift Modeling?
A: Техника для оценки incremental effect воздействия на конкретного пользователя.
\[\text{Uplift} = P(Y|T=1) - P(Y|T=0)\]Где T = treatment (воздействие), Y = outcome.
Medium¶
Q: Segment users by treatment response?
A:
Segment Treatment response Strategy Persuadables Positive uplift Target! Sure Things Buy anyway Don't waste treatment Lost Causes Won't buy anyway Don't target Sleeping Dogs Negative uplift (treatment hurts) Avoid!
Q: T-Learner vs S-Learner vs X-Learner?
A:
T-Learner (Two-model): - Train separate models for treatment and control - Uplift = Model_T(x) - Model_C(x)
S-Learner (Single-model): - One model with treatment as feature - Uplift = Model(x, T=1) - Model(x, T=0)
X-Learner: - Step 1: T-Learner base models - Step 2: Compute individual treatment effects - Step 3: Propensity-weighted combination - Better when treatment/control sizes differ
# T-Learner implementation
from sklearn.ensemble import GradientBoostingClassifier
def t_learner(X_train, y_train, treatment_train):
# Split by treatment
X_treat = X_train[treatment_train == 1]
y_treat = y_train[treatment_train == 1]
X_ctrl = X_train[treatment_train == 0]
y_ctrl = y_train[treatment_train == 0]
# Train separate models
model_t = GradientBoostingClassifier().fit(X_treat, y_treat)
model_c = GradientBoostingClassifier().fit(X_ctrl, y_ctrl)
return lambda X: model_t.predict_proba(X)[:,1] - model_c.predict_proba(X)[:,1]
Killer¶
Q: Как оценить uplift model без ground truth?
A:
Problem: Для одного пользователя знаем только T=1 ИЛИ T=0, но не оба.
Solutions:
- AUUC (Area Under Uplift Curve):
- Rank users by predicted uplift
Compare cumulative treatment effect vs random
Qini Coefficient: $\(Q = \sum_i (Y_{T,i} - Y_{C,i} \cdot \frac{n_T}{n_C})\)$
Uplift-at-k:
- Evaluate treatment effect in top-k predicted uplift
Requires held-out A/B test data
Counterfactual estimation:
- Use causal inference methods (IPW, Doubly Robust)
Q: Propensity Score Matching — зачем и как?
A:
Goal: Сравнить treatment и control groups с одинаковыми characteristics.
Method: 1. Estimate \(P(T=1|X)\) = propensity score (logistic regression) 2. Match treated and control units with similar scores 3. Compare outcomes within matched pairs
from sklearn.linear_model import LogisticRegression from sklearn.neighbors import NearestNeighbors # 1. Estimate propensity scores ps_model = LogisticRegression().fit(X, treatment) propensity_scores = ps_model.predict_proba(X)[:, 1] # 2. Match nearest neighbors nn = NearestNeighbors(n_neighbors=1).fit( propensity_scores[treatment == 0].reshape(-1, 1) ) distances, indices = nn.kneighbors( propensity_scores[treatment == 1].reshape(-1, 1) ) # 3. Compare matched pairs ate = y[treatment == 1].mean() - y[treatment == 0][indices.flatten()].mean()
14. Multi-Armed Bandits¶
Источники: GeeksforGeeks A/B Testing vs MAB, Analytics Vidhya MLOps Questions
Basic¶
Q: Что такое Multi-Armed Bandit (MAB)?
A: Алгоритм reinforcement learning, который балансирует exploration (пробовать новые варианты) и exploitation (использовать лучший известный вариант).
Название: От "one-armed bandit" (slot machine). K arms = K вариантов.
Q: MAB vs A/B Testing — в чём разница?
A:
Aspect A/B Testing Multi-Armed Bandit Allocation Fixed (50/50) Dynamic (adapt to winners) Goal Statistical significance Maximize cumulative reward Duration Fixed period Continuous Regret Wastes traffic on losers Minimizes regret Speed Slow convergence Fast adaptation
Medium¶
Q: Epsilon-Greedy vs UCB vs Thompson Sampling?
A:
Epsilon-Greedy: - With probability ε: explore random arm - With probability 1-ε: exploit best arm - Simple but fixed exploration rate
UCB (Upper Confidence Bound): - Select arm with highest upper confidence bound - \(UCB_i = \bar{r}_i + \sqrt{\frac{2 \ln n}{n_i}}\) - Balances exploration/exploitation automatically
Thompson Sampling: - Bayesian: sample from posterior distribution - Select arm with highest sample - Works well for Bernoulli and Gaussian rewards
Q: Когда использовать MAB вместо A/B?
A:
Use MAB when: - Need to maximize reward during experiment (ads, recommendations) - Non-stationary environment (preferences change) - Many variants to test - Short experiment duration acceptable
Use A/B when: - Need statistical rigor (scientific conclusions) - Regulation requires definitive proof - Learning about user behavior (not just optimizing) - Potential for negative impact from exploration
Killer¶
Q: Спроектируйте MAB system для ad selection с 1000+ creatives.
A:
Challenges: - 1000+ arms → slow convergence - New creatives added constantly (cold start) - Non-stationary CTR (seasonality, fatigue)
Solution:
# Hierarchical approach # 1. Contextual bandit with features (not just arm ID) # 2. Clustering: similar creatives share learning # 3. Thompson Sampling with warm start for new creatives class ContextualBandit: def __init__(self, n_arms, context_dim): self.models = [BayesianLinear(context_dim) for _ in range(n_arms)] def select(self, context): samples = [m.sample(context) for m in self.models] return np.argmax(samples)Key components: 1. Contextual features: user, placement, time 2. Thompson Sampling: natural exploration 3. Cold start: use creative metadata for initial priors 4. Non-stationarity: decay old observations 5. Fallback: guaranteed exploration for new creatives
Q: Как оценить MAB алгоритм оффлайн?
A:
Problem: Can't run multiple bandits on same traffic.
Solutions:
- Counterfactual Evaluation:
- Use logged data with propensity scores
\(V = \frac{1}{N} \sum_{i} \frac{r_i \cdot \mathbb{1}(a_i = a)}{p(a_i | x_i)}\)
Replay Method:
- Simulate bandit on historical data
Only count reward when bandit action = logged action
Off-policy Evaluation (OPE):
- IPS (Inverse Propensity Scoring)
- Doubly Robust estimator
# IPS Estimator def ips_estimate(logs, policy): total = 0 for log in logs: if policy.select(log.context) == log.action: total += log.reward / log.logging_prob return total / len(logs)Best practice: Online A/B test final candidates after offline filtering.
15. Model Drift Detection (Advanced)¶
Источник: Model Drift in Production (2026)
Basic¶
Q: Data drift vs Concept drift vs Label drift?
A:
Type What changes Example Data drift P(X) New user demographics, new devices Concept drift P(Y X) Label drift P(Y) Class balance changes, policy changes
Q: What is PSI (Population Stability Index)?
A: Measure of distribution shift between baseline and current.
\[PSI = \sum (Current\% - Baseline\%) \times \ln\frac{Current\%}{Baseline\%}\]Interpretation: - PSI < 0.1: No significant shift - 0.1 ≤ PSI < 0.25: Moderate shift, investigate - PSI ≥ 0.25: Significant shift, action needed
Medium¶
Q: Drift metrics comparison — PSI vs KS-test vs Wasserstein?
A:
Metric Best for Pros Cons PSI Binned distributions Interpretable, industry standard Requires binning KS-test Continuous, 1D Statistical significance, no binning Only 1D, sensitive to sample size Wasserstein Continuous, geometric Robust, captures shape Less interpretable JS/KL Divergence Probability distributions Information-theoretic Requires density estimation
Q: How to set up drift monitoring in production?
A:
- Baselines:
- Training distribution
- Healthy production window
Seasonal baselines (last year same period)
Windows:
- Short (1h/1d): Sudden shifts, pipeline bugs
- Medium (7d): Noise smoothing
Long (30d): Slow drift
Slicing:
- Country/locale
- Device/OS
User segment (new/returning)
Alerting:
- Warning: investigate (PSI > 0.1)
- Critical: mitigate (PSI > 0.25, performance drop)
- Persistence: alert only if N consecutive windows
Killer¶
Q: Drift detected at 3am. Your response playbook?
A:
Phase 1: Triage (15 min) 1. Check data integrity: null spikes, schema changes, pipeline failures 2. Check recent changes: deployments, upstream API changes 3. Localize: which slice(s) affected?
Phase 2: Immediate Mitigation (1h)
if data_pipeline_broken: fix_pipeline() # Highest priority elif model_degraded: if new_model_recently_deployed: rollback() else: increase_fallback_threshold() route_to_human_review()Phase 3: Investigation (4h) - Compare failure patterns to baseline - Check feature-level drift - Review label drift (if labels available)
Phase 4: Resolution (1d) - Targeted labeling for drifted slices - Retrain with refreshed data - Calibration refresh if score distribution shifted
Phase 5: Prevention (1w) - Add data validation gates in CI/CD - Improve dashboard/alerting - Document incident in runbook
Q: Drift in LLM/RAG systems — what's different?
A:
LLM-specific drift sources: 1. Prompt drift: System prompt changes, template updates 2. Retrieval drift: Knowledge base updates, embedding model changes 3. Tool drift: API schemas change, latency changes
Monitoring signals: - Retrieval hit rate - Top-k similarity scores - Citation coverage - Answer without retrieval rate - Tool call success rate
Key insight: "The model" in LLM systems = weights + prompts + retrieval + tools. Version all components.
16. Online Learning (Streaming ML)¶
Basic¶
Q: Online vs Batch Learning — в чём разница?
A:
Aspect Batch Online Data Fixed dataset Continuous stream Updates Retrain periodically Update after each sample Memory Store all data Recent window only Latency Hours/days Milliseconds Use case Stable distributions Non-stationary data
Q: Когда использовать online learning?
A: (1) Real-time bidding (ad tech), (2) Fraud detection, (3) Recommendation systems, (4) High-velocity data streams, (5) Concept drift environments.
Medium¶
Q: FTRL-Proximal — как работает?
A: Follow-The-Regularized-Leader with L1 regularization. Для sparse high-dimensional features (ads, recommendations).
# FTRL update rule
z_i += grad - (sqrt(n_i + grad^2) - sqrt(n_i)) * w_i / alpha
n_i += grad^2
w_i = -z_i / n_i if |z_i| > lambda1 else 0 # L1 sparsity
Q: Как детектировать concept drift в online learning?
A: - ADWIN: Adaptive Windowing — detects change when window variance exceeds threshold - DDM: Drift Detection Method — monitors error rate, alerts on significant increase - Page-Hinkley Test: Cumulative sum of deviations from mean
from river import drift
detector = drift.ADWIN()
for x, y in stream:
y_pred = model.predict_one(x)
error = int(y_pred != y)
detector.update(error)
if detector.drift_detected:
model = reset_model() # Retrain from scratch
Killer¶
Q: Спроектируйте online ML pipeline для fraud detection.
A:
Architecture:
[Kafka Stream] → [Flink ML] → [Model] → [Decision Engine] ↓ ↓ ↓ ↓ Transactions Features Prediction Action (100K/sec) (aggregates) (fraud prob) (block/allow)Feature Pipeline (Flink): - 5-min tumbling windows: tx_count, tx_amount_sum - Sliding windows: velocity_1h, velocity_24h - Real-time aggregations: merchant_tx_count, user_distinct_merchants
Model: Online Logistic Regression with FTRL - Features: ~1M sparse features (user, merchant, device embeddings) - Update: per-transaction gradient step - Latency: <10ms including feature computation
Drift Handling: - ADWIN for performance monitoring - Automatic model reset on significant drift - Shadow model for A/B comparison
Fallback: - Rule-based fallback if ML latency > 50ms - Feature freshness monitoring
17. Multi-Stage Recommender Systems¶
Basic¶
Q: Что такое multi-stage recommender?
A: Funnel architecture из нескольких stages: 1. Retrieval: Millions → Thousands (coarse filtering) 2. Pre-ranking: Thousands → Hundreds (light model) 3. Ranking: Hundreds → Tens (heavy model) 4. Re-ranking: Final diversity/freshness adjustments
Q: Почему нужна multi-stage архитектура?
A: Accuracy vs Latency tradeoff. Нельзя пропустить миллион items через тяжёлую модель за 50ms. Retrieval дешёвый, ranking дорогой.
Medium¶
Q: Two-Tower model для retrieval — как работает?
A:
User Features → [User Tower] → User Embedding (64-256d) ↓ dot product ↑ Item Features → [Item Tower] → Item Embedding (64-256d)Training: In-batch negatives (other items in batch as negatives) Inference: Pre-computed item embeddings + ANN search (FAISS, ScaNN)
class TwoTower(nn.Module):
def __init__(self, user_dim, item_dim, embed_dim):
super().__init__()
self.user_tower = nn.Sequential(
nn.Linear(user_dim, 256), nn.ReLU(),
nn.Linear(256, embed_dim)
)
self.item_tower = nn.Sequential(
nn.Linear(item_dim, 256), nn.ReLU(),
nn.Linear(256, embed_dim)
)
def forward(self, user_features, item_features):
user_emb = F.normalize(self.user_tower(user_features), dim=-1)
item_emb = F.normalize(self.item_tower(item_features), dim=-1)
return (user_emb * item_emb).sum(dim=-1) # Cosine similarity
Q: ANN indexes — когда какой?
A:
Index Build Time Query Time Recall Use Case Flat O(1) O(n) 100% <100K items IVF O(n) O(sqrt(n)) 90-95% 100K-10M HNSW O(n log n) O(log n) 95-99% Real-time, high recall IVF-PQ O(n) O(sqrt(n)/m) 80-90% Memory-constrained
Killer¶
Q: Спроектируйте YouTube-scale recommender (2B users, 1B videos).
A:
Stage 1: Retrieval (Candidates: 1B → 100) - Two-Tower with user watch history + video embeddings - ANN index: HNSW on 256-dim embeddings - Multiple retrieval sources: collaborative, content-based, trending - Latency: ~10ms
Stage 2: Pre-ranking (100 → 20) - Light GBDT (50 trees, depth 4) - Features: user-video affinity, video popularity, recency - Latency: ~5ms
Stage 3: Ranking (20 → 5) - Deep ranking model (DCNv2 or DeepFM) - Features: rich cross-features, user context, video quality scores - Target: watch time prediction (weighted logistic) - Latency: ~20ms
Stage 4: Re-ranking - Diversity: MMR (Maximal Marginal Relevance) - Freshness: boost new content - Business rules: remove watched, age restrictions - Latency: ~2ms
Total latency: ~40ms P99
18. Vector Databases for ML¶
Basic¶
Q: Что такое vector database?
A: База данных для хранения и поиска по vector embeddings. Оптимизирована для Approximate Nearest Neighbor (ANN) search.
Q: Когда нужен vector DB vs обычный DB?
A: - Vector DB: semantic search, RAG, recommendation retrieval, duplicate detection - Regular DB: exact match, range queries, aggregations, ACID transactions
Medium¶
Q: HNSW vs IVF — когда что?
A:
HNSW IVF Graph-based Cluster-based Higher recall, more memory Lower memory, tunable recall Better for real-time updates Better for batch rebuilds O(log n) query O(sqrt(n)) query Complex params (M, ef) Simpler params (nlist, nprobe)
Q: Что такое hybrid search?
A: Комбинация vector search + keyword search (BM25). RRF (Reciprocal Rank Fusion) для объединения:
def reciprocal_rank_fusion(vector_results, keyword_results, k=60):
scores = {}
for rank, doc in enumerate(vector_results):
scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank)
for rank, doc in enumerate(keyword_results):
scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank)
return sorted(scores.items(), key=lambda x: -x[1])
Killer¶
Q: Выбор vector DB для production — критерии?
A:
DB Strengths Weaknesses Use Case Pinecone Managed, scalable Expensive, vendor lock-in Enterprise, no ops Milvus Open-source, feature-rich Complex setup Large-scale, self-hosted Weaviate GraphQL, modules Younger ecosystem RAG, multimodal Qdrant Rust, filtering Smaller community Performance-critical pgvector Postgres extension Limited scale Existing Postgres infra Chroma Simple, embedded Not for scale Prototyping, small apps
19. Cost Optimization for ML Inference¶
Basic¶
Q: Основные статьи расходов ML inference?
A: - GPU compute: 60-70% - Memory: 15-20% - Network: 5-10% - Storage: 5-10%
Q: Как снизить cost per prediction?
A: (1) Model quantization, (2) Batching, (3) GPU sharing, (4) Spot instances, (5) Model right-sizing, (6) Caching.
Medium¶
Q: Spot instances для ML — стратегия?
A: - Use для batch inference, training - NOT для latency-critical online inference - Preemption detection: cloud metadata API - Graceful shutdown: checkpoint every N batches - Fallback: on-demand pool ready
# Spot instance preemption handling
import requests
def check_preemption():
try:
# GCP metadata
resp = requests.get('http://metadata.google.internal/computeMetadata/v1/instance/preempted',
headers={'Metadata-Flavor': 'Google'})
return resp.text == 'TRUE'
except:
return False
# In inference loop
for batch in data:
if check_preemption():
save_checkpoint(model, batch_position)
notify_fallback_pool()
break
predictions = model(batch)
Q: Semantic caching для LLM — как работает?
A: 1. Embed query with sentence transformer 2. Search for similar queries in cache (cosine similarity > 0.95) 3. If found: return cached response 4. If not: call LLM, cache response with embedding
Savings: 20-40% of LLM calls for customer support, FAQ use cases.
Killer¶
Q: Cost optimization strategy для inference platform (100 models, 1B predictions/day)?
A:
Tier 1: Model Optimization (40% savings) - Quantize all models to INT8 (2-4x throughput) - Distill ensemble models where possible - Prune unused features/neurons
Tier 2: Infrastructure (30% savings) - Spot instances for 70% of batch traffic - GPU sharing with MIG (Multi-Instance GPU) - Right-size: A10G for small models, H100 for large
Tier 3: Traffic Optimization (20% savings) - Semantic caching for LLM endpoints (30% hit rate) - Request batching with max_wait=10ms - Model routing: simple queries → small models
Tier 4: Monitoring & Governance (10% savings) - Cost per prediction dashboards - Budget alerts per team - Unused model deprecation policy
20. Multi-Model Serving and Model Routing¶
Basic¶
Q: Зачем нужна multi-model serving?
A: (1) Different tasks (classification, NER, QA), (2) Cost optimization (route to cheaper models), (3) Redundancy, (4) A/B testing, (5) Graceful degradation.
Q: Routing strategies — какие бывают?
A: - Weighted round-robin - Latency-based - Cost-aware - Confidence-based (cascade) - Content-based (route by input features)
Medium¶
Q: Cascade routing — как работает?
A: 1. Try small/fast model first 2. If confidence > threshold: return prediction 3. If confidence < threshold: route to larger model 4. Optionally: third tier for edge cases
class CascadeRouter:
def __init__(self, small_model, large_model, threshold=0.8):
self.small = small_model
self.large = large_model
self.threshold = threshold
def predict(self, x):
pred, conf = self.small.predict_with_confidence(x)
if conf > self.threshold:
return pred
return self.large.predict(x)
Q: Circuit Breaker pattern для model fallback?
A:
Killer¶
Q: Спроектируйте model router для LLM API (GPT-4, Claude, Gemini, local Llama).
A:
Routing Decision Matrix:
Query Type Route To Why Code generation Claude/GPT-4 Best code quality Simple Q&A Llama 70B 100x cheaper Long context (>32K) Claude 200K Context window Real-time chat Llama 70B Lowest latency Complex reasoning GPT-4 o1 Chain-of-thought Image input GPT-4V/Claude Multimodal Implementation:
class LLMRouter: def route(self, query, context): # Content-based routing if len(context) > 32000: return "claude-200k" if "code" in query or "implement" in query: return self.circuit_breaker.call("gpt-4", fallback="llama-70b") if self.is_simple_query(query): return "llama-70b" if self.needs_reasoning(query): return "o1" return self.cost_aware_select(query) # Balance cost/qualityCircuit Breaker Integration: - Track per-model error rates - Fallback chain: primary → backup → local model - Automatic recovery after 30s cooldown
21. AI Agents in Production¶
Basic¶
Q: Что такое AI agent?
A: Автономная система, которая: (1) Воспринимает environment, (2) Принимает решения, (3) Выполняет actions через tools, (4) Имеет memory/goals.
Q: Agent vs обычный LLM chat?
A: - LLM chat: single response, no tools, no memory - Agent: multi-step reasoning, tool use, persistent memory, goal-directed
Medium¶
Q: ReAct pattern — как работает?
A: Reasoning + Acting loop:
Q: Human-in-the-Loop (HITL) patterns?
A: 1. Approval gates: Critical actions require human approval 2. Review queues: Batch agent outputs for human review 3. Escalation: Agent requests help when uncertain 4. Correction feedback: Human corrections improve agent
from langgraph import interrupt
def agent_with_hitl(state):
result = agent_step(state)
if is_critical_action(result):
human_response = interrupt("Approve action?")
if not human_response.approved:
result = revise_plan(result)
return result
Killer¶
Q: Defence-in-Depth для AI agents — архитектура?
A:
Layer 1: Input Sanitization - PII detection and redaction - Prompt injection detection - Length/rate limits
Layer 2: Agent Execution - Sandbox environment - Resource limits (time, tokens, API calls) - State isolation
Layer 3: Tool Gatekeeping - Allowlist of approved tools - Permission levels per tool - Schema validation on inputs
Layer 4: Output Validation - Content policy checks - Format validation - Sensitive data filter
Layer 5: Observability - Full execution traces - Decision audit log - Anomaly detection
22. Security for ML¶
Basic¶
Q: Основные типы атак на ML модели?
A: - Evasion: Adversarial inputs at inference (FGSM, PGD) - Poisoning: Malicious training data - Extraction: Steal model via queries - Inversion: Reconstruct training data - Membership Inference: Determine if sample was in training
Q: Что такое adversarial example?
A: Input with imperceptible perturbation that causes misclassification. Example: image + noise → wrong class with high confidence.
Medium¶
Q: FGSM attack — формула?
A: Fast Gradient Sign Method: $\(x_{adv} = x + \epsilon \cdot \text{sign}(\nabla_x J(\theta, x, y))\)$
Where: - \(x\) = original input - \(\epsilon\) = perturbation magnitude (e.g., 0.01) - \(J\) = loss function - \(\nabla_x\) = gradient w.r.t. input
def fgsm_attack(model, x, y, epsilon):
x_adv = x.clone().requires_grad_(True)
loss = F.cross_entropy(model(x_adv), y)
loss.backward()
return x_adv + epsilon * x_adv.grad.sign()
Q: Как защититься от model extraction?
A: - Rate limiting per API key - Output perturbation (add noise, round predictions) - Watermarking model outputs - Query pattern detection
Killer¶
Q: Security architecture для production ML API?
A:
Defense Layers:
- Input Layer:
- Schema validation
- Anomaly detection on inputs
Rate limiting (100 req/min/user)
Model Layer:
- Adversarial training (PGD)
- Input preprocessing (randomization)
Confidence thresholding
Output Layer:
- Prediction rounding (2-3 decimals)
- Add calibrated noise
Watermark embedding
Monitoring Layer:
- Query distribution drift
- Suspicious user patterns
Model extraction detection
Access Layer:
- Authentication required
- API key rotation
- IP allowlisting for enterprise
Типичные заблуждения¶
Заблуждение: на MLSD-интервью главное -- правильно выбрать модель
Model choice -- 10-15% оценки. Интервьюеры оценивают: (1) Правильные clarifying questions (scope, scale, latency), (2) System architecture (data flow, components), (3) Feature engineering (что и почему), (4) Trade-offs discussion (precision vs recall, latency vs accuracy), (5) Monitoring и feedback loops. Кандидат, который сразу говорит "BERT" без обсуждения requirements -- red flag.
Заблуждение: нужно запоминать точные архитектуры (YouTube, Instagram)
Запоминание конкретных архитектур бесполезно -- интервьюер меняет constraints. Нужно понимать ПРИНЦИПЫ: multi-stage funnel (retrieval -> ranking -> re-ranking), cascade routing (fast model -> heavy model), confidence-based human-in-the-loop, feedback loops. С этими принципами можно спроектировать любую систему.
Заблуждение: если не знаешь ответ на вопрос -- нужно что-то придумать
Честное 'я не уверен, но вот моё рассуждение...' оценивается выше, чем уверенный неправильный ответ. MLSD-интервью проверяет мышление, а не память. Подход: (1) Назови что знаешь, (2) Рассуждай от первых принципов, (3) Предложи как бы ты это исследовал. Это показывает инженерное мышление.
Вопросы с оценкой ответов¶
Как вы подойдёте к MLSD-вопросу, который вы раньше не решали?
"Начну с выбора модели и опишу training pipeline" -- skip requirements gathering
"Стандартный framework: (1) Clarifying questions: scope, scale, latency SLA, data availability -- 5 min. (2) High-level architecture: data flow, основные components -- 10 min. (3) Deep dive: features, model choice с обоснованием, training pipeline -- 15 min. (4) Trade-offs и operations: monitoring, A/B testing, failure modes -- 10 min. (5) Extensions: scaling, edge cases. Этот framework работает для ЛЮБОЙ MLSD задачи, потому что фокусируется на системном дизайне, а не на конкретной модели."
Precision 95% vs Recall 95% -- что выбрать для fraud detection?
"Precision, чтобы не блокировать легитимные транзакции" -- не учитывает asymmetric cost
"Recall > Precision для fraud detection: пропущенный fraud ($1000-100K потеря) стоит в 10-100x дороже, чем ложное срабатывание (задержка транзакции на 30 секунд для verification). Но не бинарный выбор -- использую tiered approach: (1) High recall (99%+) для flagging, (2) Human review для flagged transactions, (3) Auto-block только при очень высокой confidence (>99.5%). Business metric: $ saved from fraud / $ lost from false blocks."
See Also¶
- Recommendation System Case — полная архитектура RecSys
- Search Ranking Case — BM25, semantic search, LambdaMART
- Ad Click Prediction Case — CTR, DCN, auction
- Fraud Detection Case — rules + ML + graph analysis
- Spam Detection Case — multi-model ensemble, network analysis
- News Feed Ranking Case — multi-task ranking
- Content Moderation Case — text/image/video pipelines
- Metrics Cheatsheet — NDCG, MAP, PR-AUC, F1