ML System Design: Подготовка к интервью¶

~9 минут чтения

Предварительно: Материалы MLSD | Обновления MLSD | Кейсы

ML System Design -- это не алгоритмы, а инженерия компромиссов: latency vs throughput, freshness vs cost, accuracy vs interpretability. В этом файле -- 50+ вопросов по 8 ключевым темам, от model serving до LLM production. Каждый вопрос классифицирован по уровню: Basic (Junior), Medium (Middle), Killer (Senior+). Формат: Q/A с конкретными числами и формулами, не абстрактные рассуждения.

Вопросы с собеседований для 8 задач ML System Design Уровни: Basic, Medium, Killer Обновлено: 2026-02-11

1. Model Serving & Latency¶

Basic¶

Q: Что такое latency P50, P99, P99.9?

A: P50 = медиана (50% запросов быстрее), P99 = 99-й перцентиль (важно для SLA), P99.9 = worst-case outliers.

Q: Зачем dynamic batching?

A: Объединяет запросы для эффективного GPU использования. Уменьшает per-request overhead.

Medium¶

Q: Как снизить inference latency в 2x?

A: (1) Quantization (FP16/INT8), (2) Batching, (3) Model distillation, (4) ONNX optimization, (5) Caching, (6) Async processing.

Q: Отличие online vs batch inference?

A: Online = real-time (<100ms), для user-facing. Batch = deferred, для analytics/offline. Batch дешевле, online требует low-latency optimization.

Killer¶

Q: Спроектируйте inference system для 100K RPS с P99 < 50ms.

A: (1) Model quantization INT8, (2) Request batching с max_wait=5ms, (3) GPU inference с TensorRT, (4) Load balancer + auto-scaling, (5) Request coalescing, (6) Regional endpoints, (7) Cache для популярных queries.

2. A/B Testing¶

Basic¶

Q: Зачем нужен minimum detectable effect (MDE)?

A: Определяет минимальное изменение, которое мы хотим detect. Влияет на sample size: меньше MDE → больше выборка.

Q: Что такое p-value?

A: Вероятность получить такие же или более экстремальные результаты при H0 (нет разницы). p < 0.05 = statistically significant.

Medium¶

Q: Формула sample size для A/B теста?

A: n = 16 * sigma^2 / delta^2 для 95% confidence, 80% power. Где sigma^2 = p(1-p), delta = MDE.

Q: В чём проблема multiple comparisons?

A: Больше тестов → выше шанс false positive. Решение: Bonferroni correction (alpha/m), False Discovery Rate.

Killer¶

Q: Как провести A/B тест для ML модели с network effects?

A: Network effects = поведение user A влияет на user B. Решения: (1) Cluster-based randomization, (2) Geo-based split, (3) Time-based A/B, (4) Counterfactual evaluation.

3. Drift Detection¶

Basic¶

Q: Типы drift?

A: Data drift (распределение X меняется), Concept drift (P(Y|X) меняется), Label drift (распределение Y меняется).

Q: Что такое PSI?

A: Population Stability Index = мера изменения распределения. PSI < 0.1 = OK, 0.1-0.25 = moderate, > 0.25 = significant.

Medium¶

Q: PSI vs KS-test vs Wasserstein?

A: PSI = для binned distributions, интерпретируем. KS-test = для continuous, statistical significance. Wasserstein = robust, geometric interpretation.

Q: Как настроить alerting для drift?

A: (1) Define baseline window, (2) Calculate metrics hourly/daily, (3) Set thresholds (PSI > 0.25), (4) Multi-feature monitoring, (5) Business metric correlation.

Killer¶

Q: Drift detected. Ваши действия?

A: (1) Investigate root cause (data pipeline, feature, business change), (2) Check if label drift vs feature drift, (3) Evaluate model on new data, (4) Retrain если accuracy drop > threshold, (5) Consider incremental learning, (6) Update monitoring thresholds.

4. Model Calibration¶

Basic¶

Q: Что такое calibrated model?

A: Предсказанная вероятность = эмпирическая частота. Если модель предсказывает 0.7, то в 70% случаев это true.

Q: Когда нужна калибровка?

A: Когда нужны точные вероятности: medical diagnosis, risk scoring, cost-sensitive decisions.

Medium¶

Q: Platt Scaling vs Isotonic Regression?

A: Platt = parametric (logistic), хорош для sigmoid curves, 2 параметра. Isotonic = non-parametric, гибче, но нужно больше данных, risk of overfitting.

Q: Как оценить calibration?

A: (1) Calibration curve (reliability diagram), (2) Brier score (lower = better), (3) Expected Calibration Error (ECE).

Killer¶

Q: Модель хорошо calibrated, но низкая accuracy. Проблема?

A: Calibration != discrimination. Модель может быть calibrated но бесполезной (предсказывает baseline probabilities). Нужно проверять оба: calibration + ROC-AUC/accuracy.

5. Ranking Metrics¶

Basic¶

Q: Что такое NDCG?

A: Normalized Discounted Cumulative Gain. Учитывает позицию и relevance. DCG суммирует relevance / log2(position). NDCG = DCG / ideal DCG.

Q: Precision@k vs Recall@k?

A: P@k = релевантных в топ-k / k. R@k = релевантных в топ-k / всего релевантных.

Medium¶

Q: MRR vs MAP?

A: MRR = Mean Reciprocal Rank, учитывает только первый релевантный. MAP = Mean Average Precision, учитывает все релевантные позиции.

Q: Когда использовать NDCG vs MAP?

A: NDCG = когда есть graded relevance (0,1,2,3...). MAP = binary relevance (0 или 1).

Killer¶

Q: Как оптимизировать ranking метрики в обучении?

A: (1) Listwise loss (LambdaLoss), (2) Pairwise loss (RankNet), (3) Approximate NDCG loss, (4) Learning to Rank frameworks (XGBoost ranker, TF-Ranking).

6. Recommendation Systems¶

Basic¶

Q: Two-Tower модель?

A: Две нейросети: user tower и item tower. Каждая создаёт embedding. Similarity = dot product. Efficient для large catalogs.

Q: Cold start problem?

A: Новые пользователи/items без истории. Решения: content-based, popularity, exploration (bandits), cross-domain transfer.

Medium¶

Q: Collaborative Filtering vs Content-Based?

A: CF = основано на поведении похожих users/items. Content-based = основано на features. Hybrid = комбинация.

Q: Matrix Factorization vs Two-Tower?

A: MF = линейная декомпозиция, cold start problem. Two-Tower = нелинейная, handles features, scalable.

Killer¶

Q: Спроектируйте RecSys для 100M users, 10M items.

A: (1) Two-stage: retrieval (ANN, FAISS) + ranking (neural), (2) Real-time features через feature store, (3) User/item embeddings обновляются batch, (4) Cold start: content features + exploration, (5) A/B testing framework, (6) Real-time personalization через session features.

7. ML Trade-offs¶

Basic¶

Q: Accuracy vs Latency trade-off?

A: Сложные модели = выше accuracy, но медленнее. Решение: model distillation, quantization, caching.

Q: Precision vs Recall?

A: Precision = TP/(TP+FP), Recall = TP/(TP+FN). Trade-off зависит от business cost false positives vs false negatives.

Medium¶

Q: Online vs Batch learning trade-offs?

A: Online = свежая модель, но сложнее debugging, potential instability. Batch = стабильность, но stale model.

Q: Interpretability vs Performance?

A: Complex models (deep learning) vs interpretable (decision trees). Решение: SHAP, LIME для объяснения сложных моделей.

Killer¶

Q: 15 trade-off сценариев — выбери правильный подход: 1. Модель для medical diagnosis: Interpretability > accuracy (regulatory) 2. Real-time bidding ad system: Latency > accuracy (budget constraints) 3. Fraud detection с rare events: Recall > precision (missing fraud is costly) 4. Content moderation: Precision > recall (false positives bad UX) 5. Recommendation для new users: Exploration > exploitation 6. Feature selection для production: Simplicity > marginal gains 7. Model retraining frequency: Cost vs freshness 8. Ensemble vs single model: Maintenance vs accuracy 9. Custom loss vs standard: Complexity vs business alignment 10. GPU vs CPU inference: Cost vs latency 11. Real-time vs batch features: Freshness vs stability 12. Deep learning vs GBM: Data vs interpretability 13. Multi-task vs single-task: Shared knowledge vs task conflict 14. Online A/B vs offline eval: Confidence vs cost 15. Feature store vs direct queries: Latency vs freshness

8. LLM Production¶

Basic¶

Q: Что такое prompt injection?

A: Атака через user input, которая меняет поведение LLM. "Ignore previous instructions and..."

Q: Как защититься от prompt injection?

A: (1) Input sanitization, (2) System prompt separation, (3) Output validation, (4) Guardrails, (5) Rate limiting.

Medium¶

Q: OWASP Top 10 для LLM?

A: Prompt injection, Insecure output handling, Training data poisoning, Model DoS, Supply chain, Sensitive info disclosure, Insecure plugins, Excessive agency, Overreliance, Model theft.

Q: Как организовать guardrails?

A: Input guardrails (sanitization, PII detection), Output guardrails (format validation, content policy), Tool guardrails (permission checks).

Killer¶

Q: Спроектируйте LLM систему с enterprise security.

A: (1) Input/output guardrails pipeline, (2) PII detection и redaction, (3) Audit logging, (4) Rate limiting per user, (5) Content policy enforcement, (6) Model access control, (7) Fallback models, (8) Human-in-the-loop для risky operations, (9) Red team testing schedule, (10) Incident response plan.

9. Feature Stores¶

Basic¶

Q: Что такое feature store?

A: Centralized repository для ML features: (1) Storage — batch и real-time, (2) Serving — low-latency retrieval, (3) Registry — metadata, lineage, versioning, (4) Computation — transformation pipelines. Examples: Feast (OSS), Tecton (managed), Databricks Feature Store.

Q: Зачем нужен feature store?

A: (1) Training-serving skew prevention — одинаковые features при train и inference, (2) Feature reuse — не пересчитывать для разных моделей, (3) Point-in-time correctness — исторические features без leakage, (4) Real-time serving — низкая latency для online inference.

Medium¶

Q: Feast vs Tecton — когда что?

A:

Feature	Feast (OSS)	Tecton (Managed)
Cost	Free	`$$$`
Setup	Self-managed	Managed
Real-time	Redis integration	Built-in
Transformations	Limited	Rich (Spark, Pandas)
Monitoring	Basic	Advanced
Enterprise	DIY	Full support

Feast: Startups, learning, budget constraints. Tecton: Enterprise, scale, team velocity > cost.

Q: Что такое point-in-time join?

A: Проблема: при обучении нельзя использовать features из будущего. Point-in-time join гарантирует, что для каждой training example используются features, которые существовали на момент event.

# Without PIT join: LEAKAGE!
features = feature_store.get_features(user_id)  # Current features

# With PIT join: CORRECT
features = feature_store.get_features(
    entity=user_id,
    timestamp=event_timestamp  # Features as of event time
)

Q: Online vs Offline feature store?

A:

Offline	Online
Batch computation	Real-time updates
S3, BigQuery, Delta	Redis, DynamoDB
Training	Inference
Low cost	Low latency
Historical data	Latest values only

Architecture: - Offline: Spark jobs → Parquet/Delta → Training - Online: Stream → Redis → Inference (<10ms)

Killer¶

Q: Спроектируйте feature store для fraud detection.

A:

Requirements: - 10M predictions/day - <50ms latency - Real-time features (last 5 min transactions) - Historical features (30-day aggregates)

Architecture:

Layer 1: Batch Pipeline (Daily) - Spark ETL → Aggregated features (30-day stats) - Store в Delta Lake + Sync to Redis - Examples: avg_transaction_amount, distinct_merchants_30d

Layer 2: Stream Pipeline (Real-time) - Kafka → Flink → Redis - Windowed aggregations (5 min tumbling) - Examples: tx_count_5m, velocity_score

Layer 3: Feature Registry - Metadata: name, type, owner, freshness SLA - Lineage: source → transformation → feature - Monitoring: staleness alerts, distribution drift

Layer 4: Serving API - Feature Server: gRPC endpoint - Request: (user_id, merchant_id, timestamp) - Response: feature vector (<10ms) - Fallback: cached features if upstream fails

Cost: ~$15K/month (Spark cluster + Redis cluster + storage)

10. Recommendation Systems¶

Basic¶

Q: Collaborative Filtering vs Content-Based?

A: - Collaborative Filtering: Рекомендации на основе похожих users/items. "Люди, которые купили X, также купили Y". Matrix: User × Item interactions. - Content-Based: Рекомендации на основе признаков item. "Похож на то, что вы смотрели". Features: genre, tags, description.

CF лучше для discovery, CB для explainability. Hybrid = best of both.

Q: User-based vs Item-based CF?

A: - User-based: Найти похожих users → рекомендовать их items. Проблема: users меняются, scaling для миллионов. - Item-based: Найти похожие items → рекомендовать. Стабильнее, можно pre-compute item similarity.

Production обычно item-based (Amazon, Netflix).

Medium¶

Q: Matrix Factorization для RecSys?

A: Разложение User-Item матрицы R на User factors U и Item factors V: $$R \approx U \times V^T$$

SGD update: $$e_{ui} = r_{ui} - u_i \cdot v_j$$

\[u_i \leftarrow u_i + \eta (e_{ui} v_j - \lambda u_i)\]

Advantages: Handles sparsity, latent factors capture preferences. Libraries: Implicit, LightFM, Surprise.

Q: Cold Start Problem — решения?

A: 1. New User: Content-based first, ask preferences, popularity baseline 2. New Item: Content features, exploitation after K interactions 3. Hybrid: Combine CF + CB, smooth transition 4. Bandits: Explore-exploit для new items 5. Side information: Demographics, context, metadata

Q: Two-Tower Architecture для RecSys?

A:

Architecture: - User Tower: User features → MLP → User embedding - Item Tower: Item features → MLP → Item embedding - Score: Dot product or cosine similarity

Training: - Loss: Cross-entropy или BPR - Negatives: In-batch или sampled

Advantages: - Decoupled inference (pre-compute item embeddings) - Scalable to millions of items - ANN search for retrieval (FAISS, ScaNN)

Production: YouTube, Pinterest, Amazon используют two-tower.

Killer¶

Q: Спроектируйте RecSys для e-commerce с 10M users, 1M items.

A:

Requirements: Real-time personalization, 100ms latency, 10% CTR improvement.

Architecture:

Stage 1: Retrieval (Candidates) - Two-tower model → ANN search (FAISS) - Input: User features + context - Output: 1000 candidates - Latency: ~20ms

Stage 2: Ranking (Scoring) - Gradient Boosted Trees (LightGBM) or Deep Ranking - Features: user, item, cross, context - Output: Top 100 scored items - Latency: ~50ms

Stage 3: Re-ranking (Business Logic) - Diversity (MMR) - Freshness boost - Business rules (promotions, inventory) - Output: Final 20 items - Latency: ~5ms

Infrastructure: - Feature Store: Redis (online) + BigQuery (offline) - ANN Index: FAISS IVF-PQ, refreshed hourly - Model serving: TensorFlow Serving + gRPC

Training Pipeline: - Data: Click logs, purchases, impressions - Features: User history, item embeddings, context - Labels: Click/purchase (weighted) - Frequency: Daily retraining

Metrics: - Offline: NDCG@20, Recall@100, AUC - Online: CTR, CVR, Revenue per user

Q: Как оценить RecSys offline vs online?

A:

Offline Metrics: - Ranking: NDCG, MAP, MRR - Classification: AUC, Precision@K, Recall@K - Coverage: % items ever recommended - Diversity: Intra-list similarity

Problem: Offline ≠ Online performance (top-K mismatch, position bias)

Online Metrics: - CTR, Conversion Rate - Revenue per session - Engagement time - A/B test vs baseline

Best practice: Offline screening → Online A/B → Ship if +business metric.

11. MLOps & CI/CD¶

Basic¶

Q: Что такое MLOps?

A: MLOps = DevOps принципы для ML. Автоматизация всего lifecycle: data prep → training → deployment → monitoring. Цель: reproducibility, reliability, scalability.

Q: MLOps vs DevOps?

A: DevOps = code-centric, MLOps = data + model-centric. MLOps добавляет: experiment tracking, model versioning, data validation, drift detection, continuous training.

Medium¶

Q: Ключевые компоненты MLOps pipeline?

A: (1) Data Collection & Validation, (2) Feature Engineering, (3) Model Training & Evaluation, (4) Model Registry & Versioning, (5) Model Deployment (batch/real-time), (6) Monitoring & Alerting, (7) Retraining triggers.

Q: Что такое CI/CD для ML?

A: CI (Continuous Integration): validate data, test code, validate models. CD (Continuous Deployment): deploy model + infrastructure. CT (Continuous Training): auto-retrain on new data.

Killer¶

Q: Спроектируйте end-to-end MLOps pipeline для fraud detection.

A: (1) Feature Store: Redis (real-time) + BigQuery (batch), (2) Training: daily Airflow job + drift-triggered retraining, (3) Model Registry: MLflow with staging/production stages, (4) Deployment: Kubernetes + canary rollout, (5) Monitoring: Evidently for drift, Prometheus for latency, (6) Alerting: PagerDuty for accuracy drop > 2%, (7) Rollback: Blue-green deployment.

12. Experiment Tracking¶

Basic¶

Q: Зачем нужен experiment tracking?

A: (1) Reproducibility — можно повторить любой эксперимент, (2) Comparison — сравнить метрики между runs, (3) Collaboration — команда видит все эксперименты, (4) Debugging — найти что пошло не так.

Q: MLflow vs W&B (Weights & Biases)?

A: MLflow = open-source, self-hosted, basic UI. W&B = SaaS, rich visualizations, team collaboration, но платный. MLflow для on-prem/privacy, W&B для скорости разработки.

Medium¶

Q: Что логировать в experiment tracking?

A: - Parameters: learning_rate, batch_size, architecture - Metrics: loss, accuracy (train/val), per-epoch - Artifacts: model weights, checkpoints, config files - Code: git commit, branch, diff - Data: dataset version (DVC), feature stats - Environment: requirements.txt, Docker image

import mlflow

with mlflow.start_run():
    mlflow.log_param("learning_rate", 0.001)
    mlflow.log_param("batch_size", 64)

    for epoch in range(epochs):
        train_loss, val_loss = train_epoch(model, train_loader, val_loader)
        mlflow.log_metric("train_loss", train_loss, step=epoch)
        mlflow.log_metric("val_loss", val_loss, step=epoch)

    mlflow.log_artifact("model.pth")
    mlflow.sklearn.log_model(model, "model")

Q: Model Registry — зачем и как?

A: Central repository для моделей с lifecycle management: - Versioning: model v1, v2, v3... - Stages: None → Staging → Production → Archived - Metadata: metrics, tags, lineage - Transition approvals для production

Killer¶

Q: Как организовать MLflow в production с высокой доступностью?

A: (1) Backend store: PostgreSQL (не SQLite!), (2) Artifact store: S3/MinIO с versioning, (3) Tracking server: load-balanced, (4) Authentication: basic auth или OIDC, (5) Model serving: MLflow Models + Docker/Kubernetes, (6) Backup: daily DB dump + artifact sync, (7) Monitoring: health checks на tracking server.

13. Data Validation & Quality¶

Basic¶

Q: Что такое data validation в ML?

A: Проверка что incoming data соответствует ожиданиям: schema, range, completeness, uniqueness. Предотвращает "garbage in, garbage out".

Q: Great Expectations — что это?

A: Open-source library для data validation. Определяет "expectations" (rules) и проверяет data batches. Генерирует docs с validation results.

Medium¶

Q: Типы data quality checks?

A: - Schema validation: correct columns, types - Completeness: null % < threshold - Uniqueness: no duplicate IDs - Range checks: age 0-120, price > 0 - Distribution checks: mean/std within bounds - Referential integrity: foreign keys exist

import great_expectations as gx

# Define expectation suite
expectation_suite = gx.ExpectationSuite("data_quality")

expectation_suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeNotNull(column="user_id")
)
expectation_suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="age", min_value=0, max_value=120
    )
)
expectation_suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeUnique(column="email")
)

# Validate
results = validator.validate(expectation_suite)

Q: Как валидировать data в production pipeline?

A: (1) Define expectations на training data, (2) Apply при inference, (3) Alert на failures, (4) Quarantine bad data, (5) Trigger re-training если distribution shift.

Killer¶

Q: Data contract между Data Engineering и ML командами?

A: Формальный договор: - Schema: column names, types, nullable - Freshness: SLA на data arrival (e.g., data by 6am) - Quality: max null % = 1%, uniqueness = 100% - Volume: expected rows per day ± 10% - Semantic: what each column means - Ownership: who to contact for issues - Change process: 7-day notice for schema changes

14. Model Monitoring & Observability¶

Basic¶

Q: Что мониторить в ML модели?

A: (1) Prediction metrics: accuracy, F1 (если есть ground truth), (2) Data drift: feature distributions, (3) Prediction drift: output distributions, (4) System metrics: latency, throughput, errors, (5) Business metrics: revenue, CTR.

Q: Что такое training-serving skew?

A: Разница между training и inference pipeline: разные preprocessing, feature engineering, library versions. Приводит к silent accuracy degradation.

Medium¶

Q: Monitoring tools comparison?

A: | Tool | Type | Best For | |------|------|----------| | Evidently | OSS | Data drift, visual reports | | NannyML | OSS | CBPE (no ground truth) | | WhyLabs | SaaS | Enterprise observability | | Prometheus + Grafana | OSS | System metrics | | Datadog | SaaS | Full-stack observability |

Q: Как мониторить без ground truth?

A: 1. Confidence-based: Monitor prediction confidence distribution 2. CBPE (Confidence-Based Performance Estimation): NannyML approach 3. Drift-only: Alert на significant distribution shift 4. Proxy metrics: Business KPIs (revenue, engagement) 5. Sampling: Label small % for delayed validation

Killer¶

Q: Спроектируйте monitoring system для 100K predictions/day.

A:

Architecture: 1. Log layer: Async logging → Kafka → ClickHouse (predictions + features) 2. Compute layer: Hourly batch jobs (drift metrics, aggregation) 3. Alert layer: Prometheus alerts → PagerDuty 4. Visualization: Grafana dashboards

Metrics tracked: - Data drift: PSI per feature (alert > 0.25) - Prediction distribution: KS-test vs baseline - Confidence: mean, std, % low confidence - Latency: P50, P99 - Volume: predictions/hour

Alerting strategy: - Critical: accuracy drop > 5%, service down → immediate - Warning: drift detected, latency high → Slack - Info: daily summary → email

Cost: ~$500/month (ClickHouse + Grafana + storage)

11. Distributed Training¶

Basic¶

Q: В чём разница Data Parallel vs Model Parallel?

A: - Data Parallel: Модель реплицируется на все GPU, данные разбиваются. Простой, но каждый GPU хранит полную модель. - Model Parallel: Модель разбивается по слоям или тензорам. Каждый GPU хранит часть модели. Сложнее, но позволяет тренировать модели > GPU memory.

Q: Что такое DDP (DistributedDataParallel)?

A: PyTorch's data parallelism — каждый GPU имеет копию модели, gradients синхронизируются через all-reduce. Эффективен для моделей, влезающих в память.

Medium¶

Q: ZeRO (Zero Redundancy Optimizer) — stages и memory savings?

A:

Stage Что шардингуется Memory savings

ZeRO-1 Optimizer states ~4x

ZeRO-2 Optimizer + gradients ~8x

ZeRO-3 Optimizer + gradients + parameters ~N× (N = GPUs)

Memory для 7B модели:
Standard DDP: 112GB (28GB params + 28GB grads + 56GB optimizer)
ZeRO-3 (8 GPUs): ~14GB per GPU

Q: FSDP vs DeepSpeed — когда что использовать?

A:

Criterion FSDP DeepSpeed

Memory Excellent (90%) Outstanding (95%)

Setup Low High

Ecosystem PyTorch native Microsoft

Features Basic sharding Pipeline parallel, MoE

Best for Most transformers >10B params

Use FSDP when: PyTorch-only stack, balanced simplicity/performance Use DeepSpeed when: Maximum memory efficiency, >10B params, dedicated ML infra team

Q: Как работает gradient accumulation?

A: Симулирует больший batch без увеличения памяти:

accum_steps = 4
for i, (x, y) in enumerate(dataloader):
    loss = model(x, y) / accum_steps  # Scale down
    loss.backward()
    if (i + 1) % accum_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Эффективный batch = actual_batch × accum_steps

Killer¶

Q: Спроектируйте distributed training для 70B LLM на 8×A100 80GB.

A:

Memory analysis:
70B params × 2 bytes (FP16) = 140GB params
70B params × 2 bytes = 140GB gradients
70B params × 12 bytes = 840GB optimizer (Adam: m, v, master)
Total: 1120GB → 140GB per GPU minimum
Solution stack: 1. ZeRO-3 + CPU offload: Shards everything, offloads optimizer to CPU 2. Activation checkpointing: 50% activation memory reduction 3. Mixed precision (BF16): 2x memory savings 4. Flash Attention: O(N) vs O(N²) attention memory

Configuration:
# DeepSpeed config
{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {"device": "cpu"},
    "offload_param": {"device": "cpu"},
    "overlap_comm": true
  },
  "activation_checkpointing": {
    "partition_activations": true
  }
}
Expected per-GPU memory: ~40GB (fits in 80GB A100)

Q: Pipeline Parallelism — как работает и когда использовать?

A:

Concept: Разбить модель на последовательные стадии, каждая на своём GPU.
GPU0: Embedding + Layers 0-5
GPU1: Layers 6-11
GPU2: Layers 12-17
GPU3: Layers 18-23 + LM Head
Micro-batching: Несколько микро-батчей обрабатываются конвейером, заполняя "пузыри".

When to use: - Models too large for data parallel alone - Combined with ZeRO (3D parallelism) - Very deep models (depth > 100 layers)

Trade-offs: - Pipeline bubbles reduce efficiency - Complex to implement - Best combined with tensor parallelism

12. Feature Stores¶

Basic¶

Q: Что такое Feature Store?

A: Централизованное хранилище для ML features, обеспечивающее: - Consistency: Одинаковые features для training и serving - Reusability: Features переиспользуются между моделями - Time-travel: Point-in-time correct joins - Freshness: Real-time feature serving

Q: Offline vs Online Feature Store?

A:

Store Use case Latency Storage

Offline Training Hours-minutes Parquet, Delta Lake

Online Real-time inference <10ms Redis, DynamoDB

Features materialize: Offline → Online (batch or streaming)

Medium¶

Q: Point-in-Time Join — зачем нужен?

A: Предотвращает data leakage при обучении.

Problem: Feature value на момент serving может отличаться от training time.

Solution: Join feature по entity_id + timestamp, получая значение, которое существовало на момент события.

-- Point-in-time join
SELECT e.entity_id, e.event_time, f.feature_value
FROM events e
JOIN feature_history f
  ON e.entity_id = f.entity_id
  AND f.feature_time <= e.event_time
  AND f.feature_time > e.event_time - INTERVAL '1 day'

Q: Feast vs Tecton vs Hopsworks — сравнение?

A:

Criterion Feast Tecton Hopsworks

Type OSS Managed SaaS Hybrid

Real-time Limited Excellent Excellent

Setup Medium Low Medium

Cost Free $$$ $$

Best for Startups Enterprise Mid-market

Killer¶

Q: Спроектируйте Feature Store для 1000 features, 1M users, <10ms latency.

A:

Architecture:
Sources → Stream (Kafka) → Feature Computation → Online Store (Redis)
                ↓
        Batch (Spark) → Offline Store (Delta Lake)
Key decisions: 1. Storage: Redis Cluster for online (sub-ms), Delta Lake for offline 2. Materialization: Every 5 min for hot features, hourly for others 3. Feature groups: Group by freshness requirements 4. Monitoring: Feature freshness alerts, latency P99

Code pattern:
# Feature serving
def get_features(entity_id, feature_names):
    keys = [f"{name}:{entity_id}" for name in feature_names]
    values = redis.mget(keys)  # Multi-get for <10ms
    return dict(zip(feature_names, values))

13. Causal Inference & Uplift Modeling¶

Basic¶

Q: Correlation vs Causation — в чём разница?

A: - Correlation: X и Y связаны, но не обязательно причина-следствие - Causation: X вызывает Y

Example: Ice cream sales correlate with drowning. Cause? Heat → both increase. Not ice cream → drowning.

Q: Что такое Uplift Modeling?

A: Техника для оценки incremental effect воздействия на конкретного пользователя.

\[\text{Uplift} = P(Y|T=1) - P(Y|T=0)\]

Где T = treatment (воздействие), Y = outcome.

Medium¶

Q: Segment users by treatment response?

A:

Segment Treatment response Strategy

Persuadables Positive uplift Target!

Sure Things Buy anyway Don't waste treatment

Lost Causes Won't buy anyway Don't target

Sleeping Dogs Negative uplift (treatment hurts) Avoid!

Q: T-Learner vs S-Learner vs X-Learner?

A:

T-Learner (Two-model): - Train separate models for treatment and control - Uplift = Model_T(x) - Model_C(x)

S-Learner (Single-model): - One model with treatment as feature - Uplift = Model(x, T=1) - Model(x, T=0)

X-Learner: - Step 1: T-Learner base models - Step 2: Compute individual treatment effects - Step 3: Propensity-weighted combination - Better when treatment/control sizes differ

# T-Learner implementation
from sklearn.ensemble import GradientBoostingClassifier

def t_learner(X_train, y_train, treatment_train):
    # Split by treatment
    X_treat = X_train[treatment_train == 1]
    y_treat = y_train[treatment_train == 1]
    X_ctrl = X_train[treatment_train == 0]
    y_ctrl = y_train[treatment_train == 0]

    # Train separate models
    model_t = GradientBoostingClassifier().fit(X_treat, y_treat)
    model_c = GradientBoostingClassifier().fit(X_ctrl, y_ctrl)

    return lambda X: model_t.predict_proba(X)[:,1] - model_c.predict_proba(X)[:,1]

Killer¶

Q: Как оценить uplift model без ground truth?

A:

Problem: Для одного пользователя знаем только T=1 ИЛИ T=0, но не оба.

Solutions:

AUUC (Area Under Uplift Curve):

Rank users by predicted uplift

Compare cumulative treatment effect vs random

Qini Coefficient: $$Q = \sum_i (Y_{T,i} - Y_{C,i} \cdot \frac{n_T}{n_C})$$

Uplift-at-k:

Evaluate treatment effect in top-k predicted uplift

Requires held-out A/B test data

Counterfactual estimation:

Use causal inference methods (IPW, Doubly Robust)

Q: Propensity Score Matching — зачем и как?

A:

Goal: Сравнить treatment и control groups с одинаковыми characteristics.

Method: 1. Estimate $P(T=1|X)$ = propensity score (logistic regression) 2. Match treated and control units with similar scores 3. Compare outcomes within matched pairs

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors

# 1. Estimate propensity scores
ps_model = LogisticRegression().fit(X, treatment)
propensity_scores = ps_model.predict_proba(X)[:, 1]

# 2. Match nearest neighbors
nn = NearestNeighbors(n_neighbors=1).fit(
    propensity_scores[treatment == 0].reshape(-1, 1)
)
distances, indices = nn.kneighbors(
    propensity_scores[treatment == 1].reshape(-1, 1)
)

# 3. Compare matched pairs
ate = y[treatment == 1].mean() - y[treatment == 0][indices.flatten()].mean()

14. Multi-Armed Bandits¶

Источники: GeeksforGeeks A/B Testing vs MAB, Analytics Vidhya MLOps Questions

Basic¶

Q: Что такое Multi-Armed Bandit (MAB)?

A: Алгоритм reinforcement learning, который балансирует exploration (пробовать новые варианты) и exploitation (использовать лучший известный вариант).

Название: От "one-armed bandit" (slot machine). K arms = K вариантов.

Q: MAB vs A/B Testing — в чём разница?

A:

Aspect A/B Testing Multi-Armed Bandit

Allocation Fixed (50/50) Dynamic (adapt to winners)

Goal Statistical significance Maximize cumulative reward

Duration Fixed period Continuous

Regret Wastes traffic on losers Minimizes regret

Speed Slow convergence Fast adaptation

Medium¶

Q: Epsilon-Greedy vs UCB vs Thompson Sampling?

A:

Epsilon-Greedy: - With probability ε: explore random arm - With probability 1-ε: exploit best arm - Simple but fixed exploration rate

UCB (Upper Confidence Bound): - Select arm with highest upper confidence bound - $UCB_i = \bar{r}_i + \sqrt{\frac{2 \ln n}{n_i}}$ - Balances exploration/exploitation automatically

Thompson Sampling: - Bayesian: sample from posterior distribution - Select arm with highest sample - Works well for Bernoulli and Gaussian rewards

Q: Когда использовать MAB вместо A/B?

A:

Use MAB when: - Need to maximize reward during experiment (ads, recommendations) - Non-stationary environment (preferences change) - Many variants to test - Short experiment duration acceptable

Use A/B when: - Need statistical rigor (scientific conclusions) - Regulation requires definitive proof - Learning about user behavior (not just optimizing) - Potential for negative impact from exploration

Killer¶

Q: Спроектируйте MAB system для ad selection с 1000+ creatives.

A:

Challenges: - 1000+ arms → slow convergence - New creatives added constantly (cold start) - Non-stationary CTR (seasonality, fatigue)

Solution:
# Hierarchical approach
# 1. Contextual bandit with features (not just arm ID)
# 2. Clustering: similar creatives share learning
# 3. Thompson Sampling with warm start for new creatives

class ContextualBandit:
    def __init__(self, n_arms, context_dim):
        self.models = [BayesianLinear(context_dim) for _ in range(n_arms)]

    def select(self, context):
        samples = [m.sample(context) for m in self.models]
        return np.argmax(samples)
Key components: 1. Contextual features: user, placement, time 2. Thompson Sampling: natural exploration 3. Cold start: use creative metadata for initial priors 4. Non-stationarity: decay old observations 5. Fallback: guaranteed exploration for new creatives

Q: Как оценить MAB алгоритм оффлайн?

A:

Problem: Can't run multiple bandits on same traffic.

Solutions:

Counterfactual Evaluation:

Use logged data with propensity scores

$V = \frac{1}{N} \sum_{i} \frac{r_i \cdot \mathbb{1}(a_i = a)}{p(a_i | x_i)}$

Replay Method:

Simulate bandit on historical data

Only count reward when bandit action = logged action

Off-policy Evaluation (OPE):

IPS (Inverse Propensity Scoring)

Doubly Robust estimator
# IPS Estimator
def ips_estimate(logs, policy):
    total = 0
    for log in logs:
        if policy.select(log.context) == log.action:
            total += log.reward / log.logging_prob
    return total / len(logs)
Best practice: Online A/B test final candidates after offline filtering.

15. Model Drift Detection (Advanced)¶

Источник: Model Drift in Production (2026)

Basic¶

Q: Data drift vs Concept drift vs Label drift?

A:

Type What changes Example

Data drift P(X) New user demographics, new devices

Concept drift P(Y X)

Label drift P(Y) Class balance changes, policy changes

Q: What is PSI (Population Stability Index)?

A: Measure of distribution shift between baseline and current.

\[PSI = \sum (Current\% - Baseline\%) \times \ln\frac{Current\%}{Baseline\%}\]

Interpretation: - PSI < 0.1: No significant shift - 0.1 ≤ PSI < 0.25: Moderate shift, investigate - PSI ≥ 0.25: Significant shift, action needed

Medium¶

Q: Drift metrics comparison — PSI vs KS-test vs Wasserstein?

A:

Metric Best for Pros Cons

PSI Binned distributions Interpretable, industry standard Requires binning

KS-test Continuous, 1D Statistical significance, no binning Only 1D, sensitive to sample size

Wasserstein Continuous, geometric Robust, captures shape Less interpretable

JS/KL Divergence Probability distributions Information-theoretic Requires density estimation

Q: How to set up drift monitoring in production?

A:

Baselines:

Training distribution

Healthy production window

Seasonal baselines (last year same period)

Windows:

Short (1h/1d): Sudden shifts, pipeline bugs

Medium (7d): Noise smoothing

Long (30d): Slow drift

Slicing:

Country/locale

Device/OS

User segment (new/returning)

Alerting:

Warning: investigate (PSI > 0.1)

Critical: mitigate (PSI > 0.25, performance drop)

Persistence: alert only if N consecutive windows

Killer¶

Q: Drift detected at 3am. Your response playbook?

A:

Phase 1: Triage (15 min) 1. Check data integrity: null spikes, schema changes, pipeline failures 2. Check recent changes: deployments, upstream API changes 3. Localize: which slice(s) affected?

Phase 2: Immediate Mitigation (1h)
if data_pipeline_broken:
    fix_pipeline()  # Highest priority
elif model_degraded:
    if new_model_recently_deployed:
        rollback()
    else:
        increase_fallback_threshold()
        route_to_human_review()
Phase 3: Investigation (4h) - Compare failure patterns to baseline - Check feature-level drift - Review label drift (if labels available)

Phase 4: Resolution (1d) - Targeted labeling for drifted slices - Retrain with refreshed data - Calibration refresh if score distribution shifted

Phase 5: Prevention (1w) - Add data validation gates in CI/CD - Improve dashboard/alerting - Document incident in runbook

Q: Drift in LLM/RAG systems — what's different?

A:

LLM-specific drift sources: 1. Prompt drift: System prompt changes, template updates 2. Retrieval drift: Knowledge base updates, embedding model changes 3. Tool drift: API schemas change, latency changes

Monitoring signals: - Retrieval hit rate - Top-k similarity scores - Citation coverage - Answer without retrieval rate - Tool call success rate

Key insight: "The model" in LLM systems = weights + prompts + retrieval + tools. Version all components.

16. Online Learning (Streaming ML)¶

Basic¶

Q: Online vs Batch Learning — в чём разница?

A:

Aspect Batch Online

Data Fixed dataset Continuous stream

Updates Retrain periodically Update after each sample

Memory Store all data Recent window only

Latency Hours/days Milliseconds

Use case Stable distributions Non-stationary data

Q: Когда использовать online learning?

A: (1) Real-time bidding (ad tech), (2) Fraud detection, (3) Recommendation systems, (4) High-velocity data streams, (5) Concept drift environments.

Medium¶

Q: FTRL-Proximal — как работает?

A: Follow-The-Regularized-Leader with L1 regularization. Для sparse high-dimensional features (ads, recommendations).

# FTRL update rule
z_i += grad - (sqrt(n_i + grad^2) - sqrt(n_i)) * w_i / alpha
n_i += grad^2
w_i = -z_i / n_i if |z_i| > lambda1 else 0  # L1 sparsity

Q: Как детектировать concept drift в online learning?

A: - ADWIN: Adaptive Windowing — detects change when window variance exceeds threshold - DDM: Drift Detection Method — monitors error rate, alerts on significant increase - Page-Hinkley Test: Cumulative sum of deviations from mean

from river import drift

detector = drift.ADWIN()
for x, y in stream:
    y_pred = model.predict_one(x)
    error = int(y_pred != y)
    detector.update(error)
    if detector.drift_detected:
        model = reset_model()  # Retrain from scratch

Killer¶

Q: Спроектируйте online ML pipeline для fraud detection.

A:

Architecture:
[Kafka Stream] → [Flink ML] → [Model] → [Decision Engine]
      ↓              ↓           ↓            ↓
  Transactions   Features    Prediction    Action
  (100K/sec)    (aggregates)  (fraud prob)  (block/allow)
Feature Pipeline (Flink): - 5-min tumbling windows: tx_count, tx_amount_sum - Sliding windows: velocity_1h, velocity_24h - Real-time aggregations: merchant_tx_count, user_distinct_merchants

Model: Online Logistic Regression with FTRL - Features: ~1M sparse features (user, merchant, device embeddings) - Update: per-transaction gradient step - Latency: <10ms including feature computation

Drift Handling: - ADWIN for performance monitoring - Automatic model reset on significant drift - Shadow model for A/B comparison

Fallback: - Rule-based fallback if ML latency > 50ms - Feature freshness monitoring

17. Multi-Stage Recommender Systems¶

Basic¶

Q: Что такое multi-stage recommender?

A: Funnel architecture из нескольких stages: 1. Retrieval: Millions → Thousands (coarse filtering) 2. Pre-ranking: Thousands → Hundreds (light model) 3. Ranking: Hundreds → Tens (heavy model) 4. Re-ranking: Final diversity/freshness adjustments

Q: Почему нужна multi-stage архитектура?

A: Accuracy vs Latency tradeoff. Нельзя пропустить миллион items через тяжёлую модель за 50ms. Retrieval дешёвый, ranking дорогой.

Medium¶

Q: Two-Tower model для retrieval — как работает?

A:

User Features → [User Tower] → User Embedding (64-256d)
                                       ↓
                                  dot product
                                       ↑
Item Features → [Item Tower] → Item Embedding (64-256d)

Training: In-batch negatives (other items in batch as negatives) Inference: Pre-computed item embeddings + ANN search (FAISS, ScaNN)

class TwoTower(nn.Module):
    def __init__(self, user_dim, item_dim, embed_dim):
        super().__init__()
        self.user_tower = nn.Sequential(
            nn.Linear(user_dim, 256), nn.ReLU(),
            nn.Linear(256, embed_dim)
        )
        self.item_tower = nn.Sequential(
            nn.Linear(item_dim, 256), nn.ReLU(),
            nn.Linear(256, embed_dim)
        )

    def forward(self, user_features, item_features):
        user_emb = F.normalize(self.user_tower(user_features), dim=-1)
        item_emb = F.normalize(self.item_tower(item_features), dim=-1)
        return (user_emb * item_emb).sum(dim=-1)  # Cosine similarity

Q: ANN indexes — когда какой?

A:

Index Build Time Query Time Recall Use Case

Flat O(1) O(n) 100% <100K items

IVF O(n) O(sqrt(n)) 90-95% 100K-10M

HNSW O(n log n) O(log n) 95-99% Real-time, high recall

IVF-PQ O(n) O(sqrt(n)/m) 80-90% Memory-constrained

Killer¶

Q: Спроектируйте YouTube-scale recommender (2B users, 1B videos).

A:

Stage 1: Retrieval (Candidates: 1B → 100) - Two-Tower with user watch history + video embeddings - ANN index: HNSW on 256-dim embeddings - Multiple retrieval sources: collaborative, content-based, trending - Latency: ~10ms

Stage 2: Pre-ranking (100 → 20) - Light GBDT (50 trees, depth 4) - Features: user-video affinity, video popularity, recency - Latency: ~5ms

Stage 3: Ranking (20 → 5) - Deep ranking model (DCNv2 or DeepFM) - Features: rich cross-features, user context, video quality scores - Target: watch time prediction (weighted logistic) - Latency: ~20ms

Stage 4: Re-ranking - Diversity: MMR (Maximal Marginal Relevance) - Freshness: boost new content - Business rules: remove watched, age restrictions - Latency: ~2ms

Total latency: ~40ms P99

18. Vector Databases for ML¶

Basic¶

Q: Что такое vector database?

A: База данных для хранения и поиска по vector embeddings. Оптимизирована для Approximate Nearest Neighbor (ANN) search.

Q: Когда нужен vector DB vs обычный DB?

A: - Vector DB: semantic search, RAG, recommendation retrieval, duplicate detection - Regular DB: exact match, range queries, aggregations, ACID transactions

Medium¶

Q: HNSW vs IVF — когда что?

A:

HNSW IVF

Graph-based Cluster-based

Higher recall, more memory Lower memory, tunable recall

Better for real-time updates Better for batch rebuilds

O(log n) query O(sqrt(n)) query

Complex params (M, ef) Simpler params (nlist, nprobe)

Q: Что такое hybrid search?

A: Комбинация vector search + keyword search (BM25). RRF (Reciprocal Rank Fusion) для объединения:

def reciprocal_rank_fusion(vector_results, keyword_results, k=60):
    scores = {}
    for rank, doc in enumerate(vector_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank)
    for rank, doc in enumerate(keyword_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

Killer¶

Q: Выбор vector DB для production — критерии?

A:

DB Strengths Weaknesses Use Case

Pinecone Managed, scalable Expensive, vendor lock-in Enterprise, no ops

Milvus Open-source, feature-rich Complex setup Large-scale, self-hosted

Weaviate GraphQL, modules Younger ecosystem RAG, multimodal

Qdrant Rust, filtering Smaller community Performance-critical

pgvector Postgres extension Limited scale Existing Postgres infra

Chroma Simple, embedded Not for scale Prototyping, small apps

19. Cost Optimization for ML Inference¶

Basic¶

Q: Основные статьи расходов ML inference?

A: - GPU compute: 60-70% - Memory: 15-20% - Network: 5-10% - Storage: 5-10%

Q: Как снизить cost per prediction?

A: (1) Model quantization, (2) Batching, (3) GPU sharing, (4) Spot instances, (5) Model right-sizing, (6) Caching.

Medium¶

Q: Spot instances для ML — стратегия?

A: - Use для batch inference, training - NOT для latency-critical online inference - Preemption detection: cloud metadata API - Graceful shutdown: checkpoint every N batches - Fallback: on-demand pool ready

# Spot instance preemption handling
import requests

def check_preemption():
    try:
        # GCP metadata
        resp = requests.get('http://metadata.google.internal/computeMetadata/v1/instance/preempted',
                          headers={'Metadata-Flavor': 'Google'})
        return resp.text == 'TRUE'
    except:
        return False

# In inference loop
for batch in data:
    if check_preemption():
        save_checkpoint(model, batch_position)
        notify_fallback_pool()
        break
    predictions = model(batch)

Q: Semantic caching для LLM — как работает?

A: 1. Embed query with sentence transformer 2. Search for similar queries in cache (cosine similarity > 0.95) 3. If found: return cached response 4. If not: call LLM, cache response with embedding

Savings: 20-40% of LLM calls for customer support, FAQ use cases.

Killer¶

Q: Cost optimization strategy для inference platform (100 models, 1B predictions/day)?

A:

Tier 1: Model Optimization (40% savings) - Quantize all models to INT8 (2-4x throughput) - Distill ensemble models where possible - Prune unused features/neurons

Tier 2: Infrastructure (30% savings) - Spot instances for 70% of batch traffic - GPU sharing with MIG (Multi-Instance GPU) - Right-size: A10G for small models, H100 for large

Tier 3: Traffic Optimization (20% savings) - Semantic caching for LLM endpoints (30% hit rate) - Request batching with max_wait=10ms - Model routing: simple queries → small models

Tier 4: Monitoring & Governance (10% savings) - Cost per prediction dashboards - Budget alerts per team - Unused model deprecation policy

20. Multi-Model Serving and Model Routing¶

Basic¶

Q: Зачем нужна multi-model serving?

A: (1) Different tasks (classification, NER, QA), (2) Cost optimization (route to cheaper models), (3) Redundancy, (4) A/B testing, (5) Graceful degradation.

Q: Routing strategies — какие бывают?

A: - Weighted round-robin - Latency-based - Cost-aware - Confidence-based (cascade) - Content-based (route by input features)

Medium¶

Q: Cascade routing — как работает?

A: 1. Try small/fast model first 2. If confidence > threshold: return prediction 3. If confidence < threshold: route to larger model 4. Optionally: third tier for edge cases

class CascadeRouter:
    def __init__(self, small_model, large_model, threshold=0.8):
        self.small = small_model
        self.large = large_model
        self.threshold = threshold

    def predict(self, x):
        pred, conf = self.small.predict_with_confidence(x)
        if conf > self.threshold:
            return pred
        return self.large.predict(x)

Q: Circuit Breaker pattern для model fallback?

A:

States: CLOSED → OPEN → HALF_OPEN → CLOSED

CLOSED: Normal operation, track failures
OPEN: All requests go to fallback, wait for timeout
HALF_OPEN: Test with single request, decide state

Killer¶

Q: Спроектируйте model router для LLM API (GPT-4, Claude, Gemini, local Llama).

A:

Routing Decision Matrix:

Query Type Route To Why

Code generation Claude/GPT-4 Best code quality

Simple Q&A Llama 70B 100x cheaper

Long context (>32K) Claude 200K Context window

Real-time chat Llama 70B Lowest latency

Complex reasoning GPT-4 o1 Chain-of-thought

Image input GPT-4V/Claude Multimodal

Implementation:
class LLMRouter:
    def route(self, query, context):
        # Content-based routing
        if len(context) > 32000:
            return "claude-200k"
        if "code" in query or "implement" in query:
            return self.circuit_breaker.call("gpt-4", fallback="llama-70b")
        if self.is_simple_query(query):
            return "llama-70b"
        if self.needs_reasoning(query):
            return "o1"
        return self.cost_aware_select(query)  # Balance cost/quality
Circuit Breaker Integration: - Track per-model error rates - Fallback chain: primary → backup → local model - Automatic recovery after 30s cooldown

21. AI Agents in Production¶

Basic¶

Q: Что такое AI agent?

A: Автономная система, которая: (1) Воспринимает environment, (2) Принимает решения, (3) Выполняет actions через tools, (4) Имеет memory/goals.

Q: Agent vs обычный LLM chat?

A: - LLM chat: single response, no tools, no memory - Agent: multi-step reasoning, tool use, persistent memory, goal-directed

Medium¶

Q: ReAct pattern — как работает?

A: Reasoning + Acting loop:

Thought: What should I do next?
Action: [tool_name, tool_args]
Observation: [tool output]
Thought: Based on observation...
Action: [next action or Final Answer]

Q: Human-in-the-Loop (HITL) patterns?

A: 1. Approval gates: Critical actions require human approval 2. Review queues: Batch agent outputs for human review 3. Escalation: Agent requests help when uncertain 4. Correction feedback: Human corrections improve agent

from langgraph import interrupt

def agent_with_hitl(state):
    result = agent_step(state)
    if is_critical_action(result):
        human_response = interrupt("Approve action?")
        if not human_response.approved:
            result = revise_plan(result)
    return result

Killer¶

Q: Defence-in-Depth для AI agents — архитектура?

A:

Layer 1: Input Sanitization - PII detection and redaction - Prompt injection detection - Length/rate limits

Layer 2: Agent Execution - Sandbox environment - Resource limits (time, tokens, API calls) - State isolation

Layer 3: Tool Gatekeeping - Allowlist of approved tools - Permission levels per tool - Schema validation on inputs

Layer 4: Output Validation - Content policy checks - Format validation - Sensitive data filter

Layer 5: Observability - Full execution traces - Decision audit log - Anomaly detection

22. Security for ML¶

Basic¶

Q: Основные типы атак на ML модели?

A: - Evasion: Adversarial inputs at inference (FGSM, PGD) - Poisoning: Malicious training data - Extraction: Steal model via queries - Inversion: Reconstruct training data - Membership Inference: Determine if sample was in training

Q: Что такое adversarial example?

A: Input with imperceptible perturbation that causes misclassification. Example: image + noise → wrong class with high confidence.

Medium¶

Q: FGSM attack — формула?

A: Fast Gradient Sign Method: $$x_{adv} = x + \epsilon \cdot \text{sign}(\nabla_x J(\theta, x, y))$$

Where: - $x$ = original input - $\epsilon$ = perturbation magnitude (e.g., 0.01) - $J$ = loss function - $\nabla_x$ = gradient w.r.t. input

def fgsm_attack(model, x, y, epsilon):
    x_adv = x.clone().requires_grad_(True)
    loss = F.cross_entropy(model(x_adv), y)
    loss.backward()
    return x_adv + epsilon * x_adv.grad.sign()

Q: Как защититься от model extraction?

A: - Rate limiting per API key - Output perturbation (add noise, round predictions) - Watermarking model outputs - Query pattern detection

Killer¶

Q: Security architecture для production ML API?

A:

Defense Layers:

Input Layer:

Schema validation

Anomaly detection on inputs

Rate limiting (100 req/min/user)

Model Layer:

Adversarial training (PGD)

Input preprocessing (randomization)

Confidence thresholding

Output Layer:

Prediction rounding (2-3 decimals)

Add calibrated noise

Watermark embedding

Monitoring Layer:

Query distribution drift

Suspicious user patterns

Model extraction detection

Access Layer:

Authentication required

API key rotation

IP allowlisting for enterprise

Типичные заблуждения¶

Заблуждение: на MLSD-интервью главное -- правильно выбрать модель

Model choice -- 10-15% оценки. Интервьюеры оценивают: (1) Правильные clarifying questions (scope, scale, latency), (2) System architecture (data flow, components), (3) Feature engineering (что и почему), (4) Trade-offs discussion (precision vs recall, latency vs accuracy), (5) Monitoring и feedback loops. Кандидат, который сразу говорит "BERT" без обсуждения requirements -- red flag.

Заблуждение: нужно запоминать точные архитектуры (YouTube, Instagram)

Запоминание конкретных архитектур бесполезно -- интервьюер меняет constraints. Нужно понимать ПРИНЦИПЫ: multi-stage funnel (retrieval -> ranking -> re-ranking), cascade routing (fast model -> heavy model), confidence-based human-in-the-loop, feedback loops. С этими принципами можно спроектировать любую систему.

Заблуждение: если не знаешь ответ на вопрос -- нужно что-то придумать

Честное 'я не уверен, но вот моё рассуждение...' оценивается выше, чем уверенный неправильный ответ. MLSD-интервью проверяет мышление, а не память. Подход: (1) Назови что знаешь, (2) Рассуждай от первых принципов, (3) Предложи как бы ты это исследовал. Это показывает инженерное мышление.

Вопросы с оценкой ответов¶

Как вы подойдёте к MLSD-вопросу, который вы раньше не решали?

"Начну с выбора модели и опишу training pipeline" -- skip requirements gathering

"Стандартный framework: (1) Clarifying questions: scope, scale, latency SLA, data availability -- 5 min. (2) High-level architecture: data flow, основные components -- 10 min. (3) Deep dive: features, model choice с обоснованием, training pipeline -- 15 min. (4) Trade-offs и operations: monitoring, A/B testing, failure modes -- 10 min. (5) Extensions: scaling, edge cases. Этот framework работает для ЛЮБОЙ MLSD задачи, потому что фокусируется на системном дизайне, а не на конкретной модели."

Precision 95% vs Recall 95% -- что выбрать для fraud detection?

"Precision, чтобы не блокировать легитимные транзакции" -- не учитывает asymmetric cost

"Recall > Precision для fraud detection: пропущенный fraud ($1000-100K потеря) стоит в 10-100x дороже, чем ложное срабатывание (задержка транзакции на 30 секунд для verification). Но не бинарный выбор -- использую tiered approach: (1) High recall (99%+) для flagging, (2) Human review для flagged transactions, (3) Auto-block только при очень высокой confidence (>99.5%). Business metric: $ saved from fraud / $ lost from false blocks."

Stage	Что шардингуется	Memory savings
ZeRO-1	Optimizer states	~4x
ZeRO-2	Optimizer + gradients	~8x
ZeRO-3	Optimizer + gradients + parameters	~N× (N = GPUs)

Criterion	FSDP	DeepSpeed
Memory	Excellent (90%)	Outstanding (95%)
Setup	Low	High
Ecosystem	PyTorch native	Microsoft
Features	Basic sharding	Pipeline parallel, MoE
Best for	Most transformers	>10B params

Store	Use case	Latency	Storage
Offline	Training	Hours-minutes	Parquet, Delta Lake
Online	Real-time inference	<10ms	Redis, DynamoDB

Criterion	Feast	Tecton	Hopsworks
Type	OSS	Managed SaaS	Hybrid
Real-time	Limited	Excellent	Excellent
Setup	Medium	Low	Medium
Cost	Free	`$$$`	`$$`
Best for	Startups	Enterprise	Mid-market

Segment	Treatment response	Strategy
Persuadables	Positive uplift	Target!
Sure Things	Buy anyway	Don't waste treatment
Lost Causes	Won't buy anyway	Don't target
Sleeping Dogs	Negative uplift (treatment hurts)	Avoid!

Aspect	A/B Testing	Multi-Armed Bandit
Allocation	Fixed (50/50)	Dynamic (adapt to winners)
Goal	Statistical significance	Maximize cumulative reward
Duration	Fixed period	Continuous
Regret	Wastes traffic on losers	Minimizes regret
Speed	Slow convergence	Fast adaptation

Type	What changes	Example
Data drift	P(X)	New user demographics, new devices
Concept drift	P(Y	X)
Label drift	P(Y)	Class balance changes, policy changes

Metric	Best for	Pros	Cons
PSI	Binned distributions	Interpretable, industry standard	Requires binning
KS-test	Continuous, 1D	Statistical significance, no binning	Only 1D, sensitive to sample size
Wasserstein	Continuous, geometric	Robust, captures shape	Less interpretable
JS/KL Divergence	Probability distributions	Information-theoretic	Requires density estimation

Aspect	Batch	Online
Data	Fixed dataset	Continuous stream
Updates	Retrain periodically	Update after each sample
Memory	Store all data	Recent window only
Latency	Hours/days	Milliseconds
Use case	Stable distributions	Non-stationary data

Index	Build Time	Query Time	Recall	Use Case
Flat	O(1)	O(n)	100%	<100K items
IVF	O(n)	O(sqrt(n))	90-95%	100K-10M
HNSW	O(n log n)	O(log n)	95-99%	Real-time, high recall
IVF-PQ	O(n)	O(sqrt(n)/m)	80-90%	Memory-constrained

HNSW	IVF
Graph-based	Cluster-based
Higher recall, more memory	Lower memory, tunable recall
Better for real-time updates	Better for batch rebuilds
O(log n) query	O(sqrt(n)) query
Complex params (M, ef)	Simpler params (nlist, nprobe)

DB	Strengths	Weaknesses	Use Case
Pinecone	Managed, scalable	Expensive, vendor lock-in	Enterprise, no ops
Milvus	Open-source, feature-rich	Complex setup	Large-scale, self-hosted
Weaviate	GraphQL, modules	Younger ecosystem	RAG, multimodal
Qdrant	Rust, filtering	Smaller community	Performance-critical
pgvector	Postgres extension	Limited scale	Existing Postgres infra
Chroma	Simple, embedded	Not for scale	Prototyping, small apps

Query Type	Route To	Why
Code generation	Claude/GPT-4	Best code quality
Simple Q&A	Llama 70B	100x cheaper
Long context (>32K)	Claude 200K	Context window
Real-time chat	Llama 70B	Lowest latency
Complex reasoning	GPT-4 o1	Chain-of-thought
Image input	GPT-4V/Claude	Multimodal