Перейти к содержанию

ML System Design: Подготовка к интервью

~9 минут чтения

Предварительно: Материалы MLSD | Обновления MLSD | Кейсы

ML System Design -- это не алгоритмы, а инженерия компромиссов: latency vs throughput, freshness vs cost, accuracy vs interpretability. В этом файле -- 50+ вопросов по 8 ключевым темам, от model serving до LLM production. Каждый вопрос классифицирован по уровню: Basic (Junior), Medium (Middle), Killer (Senior+). Формат: Q/A с конкретными числами и формулами, не абстрактные рассуждения.

Вопросы с собеседований для 8 задач ML System Design Уровни: Basic, Medium, Killer Обновлено: 2026-02-11


1. Model Serving & Latency

Basic

Q: Что такое latency P50, P99, P99.9?

A: P50 = медиана (50% запросов быстрее), P99 = 99-й перцентиль (важно для SLA), P99.9 = worst-case outliers.

Q: Зачем dynamic batching?

A: Объединяет запросы для эффективного GPU использования. Уменьшает per-request overhead.

Medium

Q: Как снизить inference latency в 2x?

A: (1) Quantization (FP16/INT8), (2) Batching, (3) Model distillation, (4) ONNX optimization, (5) Caching, (6) Async processing.

Q: Отличие online vs batch inference?

A: Online = real-time (<100ms), для user-facing. Batch = deferred, для analytics/offline. Batch дешевле, online требует low-latency optimization.

Killer

Q: Спроектируйте inference system для 100K RPS с P99 < 50ms.

A: (1) Model quantization INT8, (2) Request batching с max_wait=5ms, (3) GPU inference с TensorRT, (4) Load balancer + auto-scaling, (5) Request coalescing, (6) Regional endpoints, (7) Cache для популярных queries.


2. A/B Testing

Basic

Q: Зачем нужен minimum detectable effect (MDE)?

A: Определяет минимальное изменение, которое мы хотим detect. Влияет на sample size: меньше MDE → больше выборка.

Q: Что такое p-value?

A: Вероятность получить такие же или более экстремальные результаты при H0 (нет разницы). p < 0.05 = statistically significant.

Medium

Q: Формула sample size для A/B теста?

A: n = 16 * sigma^2 / delta^2 для 95% confidence, 80% power. Где sigma^2 = p(1-p), delta = MDE.

Q: В чём проблема multiple comparisons?

A: Больше тестов → выше шанс false positive. Решение: Bonferroni correction (alpha/m), False Discovery Rate.

Killer

Q: Как провести A/B тест для ML модели с network effects?

A: Network effects = поведение user A влияет на user B. Решения: (1) Cluster-based randomization, (2) Geo-based split, (3) Time-based A/B, (4) Counterfactual evaluation.


3. Drift Detection

Basic

Q: Типы drift?

A: Data drift (распределение X меняется), Concept drift (P(Y|X) меняется), Label drift (распределение Y меняется).

Q: Что такое PSI?

A: Population Stability Index = мера изменения распределения. PSI < 0.1 = OK, 0.1-0.25 = moderate, > 0.25 = significant.

Medium

Q: PSI vs KS-test vs Wasserstein?

A: PSI = для binned distributions, интерпретируем. KS-test = для continuous, statistical significance. Wasserstein = robust, geometric interpretation.

Q: Как настроить alerting для drift?

A: (1) Define baseline window, (2) Calculate metrics hourly/daily, (3) Set thresholds (PSI > 0.25), (4) Multi-feature monitoring, (5) Business metric correlation.

Killer

Q: Drift detected. Ваши действия?

A: (1) Investigate root cause (data pipeline, feature, business change), (2) Check if label drift vs feature drift, (3) Evaluate model on new data, (4) Retrain если accuracy drop > threshold, (5) Consider incremental learning, (6) Update monitoring thresholds.


4. Model Calibration

Basic

Q: Что такое calibrated model?

A: Предсказанная вероятность = эмпирическая частота. Если модель предсказывает 0.7, то в 70% случаев это true.

Q: Когда нужна калибровка?

A: Когда нужны точные вероятности: medical diagnosis, risk scoring, cost-sensitive decisions.

Medium

Q: Platt Scaling vs Isotonic Regression?

A: Platt = parametric (logistic), хорош для sigmoid curves, 2 параметра. Isotonic = non-parametric, гибче, но нужно больше данных, risk of overfitting.

Q: Как оценить calibration?

A: (1) Calibration curve (reliability diagram), (2) Brier score (lower = better), (3) Expected Calibration Error (ECE).

Killer

Q: Модель хорошо calibrated, но низкая accuracy. Проблема?

A: Calibration != discrimination. Модель может быть calibrated но бесполезной (предсказывает baseline probabilities). Нужно проверять оба: calibration + ROC-AUC/accuracy.


5. Ranking Metrics

Basic

Q: Что такое NDCG?

A: Normalized Discounted Cumulative Gain. Учитывает позицию и relevance. DCG суммирует relevance / log2(position). NDCG = DCG / ideal DCG.

Q: Precision@k vs Recall@k?

A: P@k = релевантных в топ-k / k. R@k = релевантных в топ-k / всего релевантных.

Medium

Q: MRR vs MAP?

A: MRR = Mean Reciprocal Rank, учитывает только первый релевантный. MAP = Mean Average Precision, учитывает все релевантные позиции.

Q: Когда использовать NDCG vs MAP?

A: NDCG = когда есть graded relevance (0,1,2,3...). MAP = binary relevance (0 или 1).

Killer

Q: Как оптимизировать ranking метрики в обучении?

A: (1) Listwise loss (LambdaLoss), (2) Pairwise loss (RankNet), (3) Approximate NDCG loss, (4) Learning to Rank frameworks (XGBoost ranker, TF-Ranking).


6. Recommendation Systems

Basic

Q: Two-Tower модель?

A: Две нейросети: user tower и item tower. Каждая создаёт embedding. Similarity = dot product. Efficient для large catalogs.

Q: Cold start problem?

A: Новые пользователи/items без истории. Решения: content-based, popularity, exploration (bandits), cross-domain transfer.

Medium

Q: Collaborative Filtering vs Content-Based?

A: CF = основано на поведении похожих users/items. Content-based = основано на features. Hybrid = комбинация.

Q: Matrix Factorization vs Two-Tower?

A: MF = линейная декомпозиция, cold start problem. Two-Tower = нелинейная, handles features, scalable.

Killer

Q: Спроектируйте RecSys для 100M users, 10M items.

A: (1) Two-stage: retrieval (ANN, FAISS) + ranking (neural), (2) Real-time features через feature store, (3) User/item embeddings обновляются batch, (4) Cold start: content features + exploration, (5) A/B testing framework, (6) Real-time personalization через session features.


7. ML Trade-offs

Basic

Q: Accuracy vs Latency trade-off?

A: Сложные модели = выше accuracy, но медленнее. Решение: model distillation, quantization, caching.

Q: Precision vs Recall?

A: Precision = TP/(TP+FP), Recall = TP/(TP+FN). Trade-off зависит от business cost false positives vs false negatives.

Medium

Q: Online vs Batch learning trade-offs?

A: Online = свежая модель, но сложнее debugging, potential instability. Batch = стабильность, но stale model.

Q: Interpretability vs Performance?

A: Complex models (deep learning) vs interpretable (decision trees). Решение: SHAP, LIME для объяснения сложных моделей.

Killer

Q: 15 trade-off сценариев — выбери правильный подход: 1. Модель для medical diagnosis: Interpretability > accuracy (regulatory) 2. Real-time bidding ad system: Latency > accuracy (budget constraints) 3. Fraud detection с rare events: Recall > precision (missing fraud is costly) 4. Content moderation: Precision > recall (false positives bad UX) 5. Recommendation для new users: Exploration > exploitation 6. Feature selection для production: Simplicity > marginal gains 7. Model retraining frequency: Cost vs freshness 8. Ensemble vs single model: Maintenance vs accuracy 9. Custom loss vs standard: Complexity vs business alignment 10. GPU vs CPU inference: Cost vs latency 11. Real-time vs batch features: Freshness vs stability 12. Deep learning vs GBM: Data vs interpretability 13. Multi-task vs single-task: Shared knowledge vs task conflict 14. Online A/B vs offline eval: Confidence vs cost 15. Feature store vs direct queries: Latency vs freshness


8. LLM Production

Basic

Q: Что такое prompt injection?

A: Атака через user input, которая меняет поведение LLM. "Ignore previous instructions and..."

Q: Как защититься от prompt injection?

A: (1) Input sanitization, (2) System prompt separation, (3) Output validation, (4) Guardrails, (5) Rate limiting.

Medium

Q: OWASP Top 10 для LLM?

A: Prompt injection, Insecure output handling, Training data poisoning, Model DoS, Supply chain, Sensitive info disclosure, Insecure plugins, Excessive agency, Overreliance, Model theft.

Q: Как организовать guardrails?

A: Input guardrails (sanitization, PII detection), Output guardrails (format validation, content policy), Tool guardrails (permission checks).

Killer

Q: Спроектируйте LLM систему с enterprise security.

A: (1) Input/output guardrails pipeline, (2) PII detection и redaction, (3) Audit logging, (4) Rate limiting per user, (5) Content policy enforcement, (6) Model access control, (7) Fallback models, (8) Human-in-the-loop для risky operations, (9) Red team testing schedule, (10) Incident response plan.


9. Feature Stores

Basic

Q: Что такое feature store?

A: Centralized repository для ML features: (1) Storage — batch и real-time, (2) Serving — low-latency retrieval, (3) Registry — metadata, lineage, versioning, (4) Computation — transformation pipelines. Examples: Feast (OSS), Tecton (managed), Databricks Feature Store.

Q: Зачем нужен feature store?

A: (1) Training-serving skew prevention — одинаковые features при train и inference, (2) Feature reuse — не пересчитывать для разных моделей, (3) Point-in-time correctness — исторические features без leakage, (4) Real-time serving — низкая latency для online inference.

Medium

Q: Feast vs Tecton — когда что?

A:

Feature Feast (OSS) Tecton (Managed)
Cost Free $$$
Setup Self-managed Managed
Real-time Redis integration Built-in
Transformations Limited Rich (Spark, Pandas)
Monitoring Basic Advanced
Enterprise DIY Full support

Feast: Startups, learning, budget constraints. Tecton: Enterprise, scale, team velocity > cost.

Q: Что такое point-in-time join?

A: Проблема: при обучении нельзя использовать features из будущего. Point-in-time join гарантирует, что для каждой training example используются features, которые существовали на момент event.

# Without PIT join: LEAKAGE!
features = feature_store.get_features(user_id)  # Current features

# With PIT join: CORRECT
features = feature_store.get_features(
    entity=user_id,
    timestamp=event_timestamp  # Features as of event time
)

Q: Online vs Offline feature store?

A:

Offline Online
Batch computation Real-time updates
S3, BigQuery, Delta Redis, DynamoDB
Training Inference
Low cost Low latency
Historical data Latest values only

Architecture: - Offline: Spark jobs → Parquet/Delta → Training - Online: Stream → Redis → Inference (<10ms)

Killer

Q: Спроектируйте feature store для fraud detection.

A:

Requirements: - 10M predictions/day - <50ms latency - Real-time features (last 5 min transactions) - Historical features (30-day aggregates)

Architecture:

Layer 1: Batch Pipeline (Daily) - Spark ETL → Aggregated features (30-day stats) - Store в Delta Lake + Sync to Redis - Examples: avg_transaction_amount, distinct_merchants_30d

Layer 2: Stream Pipeline (Real-time) - Kafka → Flink → Redis - Windowed aggregations (5 min tumbling) - Examples: tx_count_5m, velocity_score

Layer 3: Feature Registry - Metadata: name, type, owner, freshness SLA - Lineage: source → transformation → feature - Monitoring: staleness alerts, distribution drift

Layer 4: Serving API - Feature Server: gRPC endpoint - Request: (user_id, merchant_id, timestamp) - Response: feature vector (<10ms) - Fallback: cached features if upstream fails

Cost: ~$15K/month (Spark cluster + Redis cluster + storage)


10. Recommendation Systems

Basic

Q: Collaborative Filtering vs Content-Based?

A: - Collaborative Filtering: Рекомендации на основе похожих users/items. "Люди, которые купили X, также купили Y". Matrix: User × Item interactions. - Content-Based: Рекомендации на основе признаков item. "Похож на то, что вы смотрели". Features: genre, tags, description.

CF лучше для discovery, CB для explainability. Hybrid = best of both.

Q: User-based vs Item-based CF?

A: - User-based: Найти похожих users → рекомендовать их items. Проблема: users меняются, scaling для миллионов. - Item-based: Найти похожие items → рекомендовать. Стабильнее, можно pre-compute item similarity.

Production обычно item-based (Amazon, Netflix).

Medium

Q: Matrix Factorization для RecSys?

A: Разложение User-Item матрицы R на User factors U и Item factors V: $\(R \approx U \times V^T\)$

SGD update: $\(e_{ui} = r_{ui} - u_i \cdot v_j\)$

\[u_i \leftarrow u_i + \eta (e_{ui} v_j - \lambda u_i)\]

Advantages: Handles sparsity, latent factors capture preferences. Libraries: Implicit, LightFM, Surprise.

Q: Cold Start Problem — решения?

A: 1. New User: Content-based first, ask preferences, popularity baseline 2. New Item: Content features, exploitation after K interactions 3. Hybrid: Combine CF + CB, smooth transition 4. Bandits: Explore-exploit для new items 5. Side information: Demographics, context, metadata

Q: Two-Tower Architecture для RecSys?

A:

Architecture: - User Tower: User features → MLP → User embedding - Item Tower: Item features → MLP → Item embedding - Score: Dot product or cosine similarity

Training: - Loss: Cross-entropy или BPR - Negatives: In-batch или sampled

Advantages: - Decoupled inference (pre-compute item embeddings) - Scalable to millions of items - ANN search for retrieval (FAISS, ScaNN)

Production: YouTube, Pinterest, Amazon используют two-tower.

Killer

Q: Спроектируйте RecSys для e-commerce с 10M users, 1M items.

A:

Requirements: Real-time personalization, 100ms latency, 10% CTR improvement.

Architecture:

Stage 1: Retrieval (Candidates) - Two-tower model → ANN search (FAISS) - Input: User features + context - Output: 1000 candidates - Latency: ~20ms

Stage 2: Ranking (Scoring) - Gradient Boosted Trees (LightGBM) or Deep Ranking - Features: user, item, cross, context - Output: Top 100 scored items - Latency: ~50ms

Stage 3: Re-ranking (Business Logic) - Diversity (MMR) - Freshness boost - Business rules (promotions, inventory) - Output: Final 20 items - Latency: ~5ms

Infrastructure: - Feature Store: Redis (online) + BigQuery (offline) - ANN Index: FAISS IVF-PQ, refreshed hourly - Model serving: TensorFlow Serving + gRPC

Training Pipeline: - Data: Click logs, purchases, impressions - Features: User history, item embeddings, context - Labels: Click/purchase (weighted) - Frequency: Daily retraining

Metrics: - Offline: NDCG@20, Recall@100, AUC - Online: CTR, CVR, Revenue per user

Q: Как оценить RecSys offline vs online?

A:

Offline Metrics: - Ranking: NDCG, MAP, MRR - Classification: AUC, Precision@K, Recall@K - Coverage: % items ever recommended - Diversity: Intra-list similarity

Problem: Offline ≠ Online performance (top-K mismatch, position bias)

Online Metrics: - CTR, Conversion Rate - Revenue per session - Engagement time - A/B test vs baseline

Best practice: Offline screening → Online A/B → Ship if +business metric.


11. MLOps & CI/CD

Basic

Q: Что такое MLOps?

A: MLOps = DevOps принципы для ML. Автоматизация всего lifecycle: data prep → training → deployment → monitoring. Цель: reproducibility, reliability, scalability.

Q: MLOps vs DevOps?

A: DevOps = code-centric, MLOps = data + model-centric. MLOps добавляет: experiment tracking, model versioning, data validation, drift detection, continuous training.

Medium

Q: Ключевые компоненты MLOps pipeline?

A: (1) Data Collection & Validation, (2) Feature Engineering, (3) Model Training & Evaluation, (4) Model Registry & Versioning, (5) Model Deployment (batch/real-time), (6) Monitoring & Alerting, (7) Retraining triggers.

Q: Что такое CI/CD для ML?

A: CI (Continuous Integration): validate data, test code, validate models. CD (Continuous Deployment): deploy model + infrastructure. CT (Continuous Training): auto-retrain on new data.

Killer

Q: Спроектируйте end-to-end MLOps pipeline для fraud detection.

A: (1) Feature Store: Redis (real-time) + BigQuery (batch), (2) Training: daily Airflow job + drift-triggered retraining, (3) Model Registry: MLflow with staging/production stages, (4) Deployment: Kubernetes + canary rollout, (5) Monitoring: Evidently for drift, Prometheus for latency, (6) Alerting: PagerDuty for accuracy drop > 2%, (7) Rollback: Blue-green deployment.


12. Experiment Tracking

Basic

Q: Зачем нужен experiment tracking?

A: (1) Reproducibility — можно повторить любой эксперимент, (2) Comparison — сравнить метрики между runs, (3) Collaboration — команда видит все эксперименты, (4) Debugging — найти что пошло не так.

Q: MLflow vs W&B (Weights & Biases)?

A: MLflow = open-source, self-hosted, basic UI. W&B = SaaS, rich visualizations, team collaboration, но платный. MLflow для on-prem/privacy, W&B для скорости разработки.

Medium

Q: Что логировать в experiment tracking?

A: - Parameters: learning_rate, batch_size, architecture - Metrics: loss, accuracy (train/val), per-epoch - Artifacts: model weights, checkpoints, config files - Code: git commit, branch, diff - Data: dataset version (DVC), feature stats - Environment: requirements.txt, Docker image

import mlflow

with mlflow.start_run():
    mlflow.log_param("learning_rate", 0.001)
    mlflow.log_param("batch_size", 64)

    for epoch in range(epochs):
        train_loss, val_loss = train_epoch(model, train_loader, val_loader)
        mlflow.log_metric("train_loss", train_loss, step=epoch)
        mlflow.log_metric("val_loss", val_loss, step=epoch)

    mlflow.log_artifact("model.pth")
    mlflow.sklearn.log_model(model, "model")

Q: Model Registry — зачем и как?

A: Central repository для моделей с lifecycle management: - Versioning: model v1, v2, v3... - Stages: None → Staging → Production → Archived - Metadata: metrics, tags, lineage - Transition approvals для production

Killer

Q: Как организовать MLflow в production с высокой доступностью?

A: (1) Backend store: PostgreSQL (не SQLite!), (2) Artifact store: S3/MinIO с versioning, (3) Tracking server: load-balanced, (4) Authentication: basic auth или OIDC, (5) Model serving: MLflow Models + Docker/Kubernetes, (6) Backup: daily DB dump + artifact sync, (7) Monitoring: health checks на tracking server.


13. Data Validation & Quality

Basic

Q: Что такое data validation в ML?

A: Проверка что incoming data соответствует ожиданиям: schema, range, completeness, uniqueness. Предотвращает "garbage in, garbage out".

Q: Great Expectations — что это?

A: Open-source library для data validation. Определяет "expectations" (rules) и проверяет data batches. Генерирует docs с validation results.

Medium

Q: Типы data quality checks?

A: - Schema validation: correct columns, types - Completeness: null % < threshold - Uniqueness: no duplicate IDs - Range checks: age 0-120, price > 0 - Distribution checks: mean/std within bounds - Referential integrity: foreign keys exist

import great_expectations as gx

# Define expectation suite
expectation_suite = gx.ExpectationSuite("data_quality")

expectation_suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeNotNull(column="user_id")
)
expectation_suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="age", min_value=0, max_value=120
    )
)
expectation_suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeUnique(column="email")
)

# Validate
results = validator.validate(expectation_suite)

Q: Как валидировать data в production pipeline?

A: (1) Define expectations на training data, (2) Apply при inference, (3) Alert на failures, (4) Quarantine bad data, (5) Trigger re-training если distribution shift.

Killer

Q: Data contract между Data Engineering и ML командами?

A: Формальный договор: - Schema: column names, types, nullable - Freshness: SLA на data arrival (e.g., data by 6am) - Quality: max null % = 1%, uniqueness = 100% - Volume: expected rows per day ± 10% - Semantic: what each column means - Ownership: who to contact for issues - Change process: 7-day notice for schema changes


14. Model Monitoring & Observability

Basic

Q: Что мониторить в ML модели?

A: (1) Prediction metrics: accuracy, F1 (если есть ground truth), (2) Data drift: feature distributions, (3) Prediction drift: output distributions, (4) System metrics: latency, throughput, errors, (5) Business metrics: revenue, CTR.

Q: Что такое training-serving skew?

A: Разница между training и inference pipeline: разные preprocessing, feature engineering, library versions. Приводит к silent accuracy degradation.

Medium

Q: Monitoring tools comparison?

A: | Tool | Type | Best For | |------|------|----------| | Evidently | OSS | Data drift, visual reports | | NannyML | OSS | CBPE (no ground truth) | | WhyLabs | SaaS | Enterprise observability | | Prometheus + Grafana | OSS | System metrics | | Datadog | SaaS | Full-stack observability |

Q: Как мониторить без ground truth?

A: 1. Confidence-based: Monitor prediction confidence distribution 2. CBPE (Confidence-Based Performance Estimation): NannyML approach 3. Drift-only: Alert на significant distribution shift 4. Proxy metrics: Business KPIs (revenue, engagement) 5. Sampling: Label small % for delayed validation

Killer

Q: Спроектируйте monitoring system для 100K predictions/day.

A:

Architecture: 1. Log layer: Async logging → Kafka → ClickHouse (predictions + features) 2. Compute layer: Hourly batch jobs (drift metrics, aggregation) 3. Alert layer: Prometheus alerts → PagerDuty 4. Visualization: Grafana dashboards

Metrics tracked: - Data drift: PSI per feature (alert > 0.25) - Prediction distribution: KS-test vs baseline - Confidence: mean, std, % low confidence - Latency: P50, P99 - Volume: predictions/hour

Alerting strategy: - Critical: accuracy drop > 5%, service down → immediate - Warning: drift detected, latency high → Slack - Info: daily summary → email

Cost: ~$500/month (ClickHouse + Grafana + storage)


11. Distributed Training

Basic

Q: В чём разница Data Parallel vs Model Parallel?

A: - Data Parallel: Модель реплицируется на все GPU, данные разбиваются. Простой, но каждый GPU хранит полную модель. - Model Parallel: Модель разбивается по слоям или тензорам. Каждый GPU хранит часть модели. Сложнее, но позволяет тренировать модели > GPU memory.

Q: Что такое DDP (DistributedDataParallel)?

A: PyTorch's data parallelism — каждый GPU имеет копию модели, gradients синхронизируются через all-reduce. Эффективен для моделей, влезающих в память.

Medium

Q: ZeRO (Zero Redundancy Optimizer) — stages и memory savings?

A:

Stage Что шардингуется Memory savings
ZeRO-1 Optimizer states ~4x
ZeRO-2 Optimizer + gradients ~8x
ZeRO-3 Optimizer + gradients + parameters ~N× (N = GPUs)

Memory для 7B модели:

Standard DDP: 112GB (28GB params + 28GB grads + 56GB optimizer)
ZeRO-3 (8 GPUs): ~14GB per GPU

Q: FSDP vs DeepSpeed — когда что использовать?

A:

Criterion FSDP DeepSpeed
Memory Excellent (90%) Outstanding (95%)
Setup Low High
Ecosystem PyTorch native Microsoft
Features Basic sharding Pipeline parallel, MoE
Best for Most transformers >10B params

Use FSDP when: PyTorch-only stack, balanced simplicity/performance Use DeepSpeed when: Maximum memory efficiency, >10B params, dedicated ML infra team

Q: Как работает gradient accumulation?

A: Симулирует больший batch без увеличения памяти:

accum_steps = 4
for i, (x, y) in enumerate(dataloader):
    loss = model(x, y) / accum_steps  # Scale down
    loss.backward()
    if (i + 1) % accum_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
Эффективный batch = actual_batch × accum_steps

Killer

Q: Спроектируйте distributed training для 70B LLM на 8×A100 80GB.

A:

Memory analysis:

70B params × 2 bytes (FP16) = 140GB params
70B params × 2 bytes = 140GB gradients
70B params × 12 bytes = 840GB optimizer (Adam: m, v, master)
Total: 1120GB → 140GB per GPU minimum

Solution stack: 1. ZeRO-3 + CPU offload: Shards everything, offloads optimizer to CPU 2. Activation checkpointing: 50% activation memory reduction 3. Mixed precision (BF16): 2x memory savings 4. Flash Attention: O(N) vs O(N²) attention memory

Configuration:

# DeepSpeed config
{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {"device": "cpu"},
    "offload_param": {"device": "cpu"},
    "overlap_comm": true
  },
  "activation_checkpointing": {
    "partition_activations": true
  }
}

Expected per-GPU memory: ~40GB (fits in 80GB A100)

Q: Pipeline Parallelism — как работает и когда использовать?

A:

Concept: Разбить модель на последовательные стадии, каждая на своём GPU.

GPU0: Embedding + Layers 0-5
GPU1: Layers 6-11
GPU2: Layers 12-17
GPU3: Layers 18-23 + LM Head

Micro-batching: Несколько микро-батчей обрабатываются конвейером, заполняя "пузыри".

When to use: - Models too large for data parallel alone - Combined with ZeRO (3D parallelism) - Very deep models (depth > 100 layers)

Trade-offs: - Pipeline bubbles reduce efficiency - Complex to implement - Best combined with tensor parallelism


12. Feature Stores

Basic

Q: Что такое Feature Store?

A: Централизованное хранилище для ML features, обеспечивающее: - Consistency: Одинаковые features для training и serving - Reusability: Features переиспользуются между моделями - Time-travel: Point-in-time correct joins - Freshness: Real-time feature serving

Q: Offline vs Online Feature Store?

A:

Store Use case Latency Storage
Offline Training Hours-minutes Parquet, Delta Lake
Online Real-time inference <10ms Redis, DynamoDB

Features materialize: Offline → Online (batch or streaming)

Medium

Q: Point-in-Time Join — зачем нужен?

A: Предотвращает data leakage при обучении.

Problem: Feature value на момент serving может отличаться от training time.

Solution: Join feature по entity_id + timestamp, получая значение, которое существовало на момент события.

-- Point-in-time join
SELECT e.entity_id, e.event_time, f.feature_value
FROM events e
JOIN feature_history f
  ON e.entity_id = f.entity_id
  AND f.feature_time <= e.event_time
  AND f.feature_time > e.event_time - INTERVAL '1 day'

Q: Feast vs Tecton vs Hopsworks — сравнение?

A:

Criterion Feast Tecton Hopsworks
Type OSS Managed SaaS Hybrid
Real-time Limited Excellent Excellent
Setup Medium Low Medium
Cost Free $$$ $$
Best for Startups Enterprise Mid-market

Killer

Q: Спроектируйте Feature Store для 1000 features, 1M users, <10ms latency.

A:

Architecture:

Sources → Stream (Kafka) → Feature Computation → Online Store (Redis)
        Batch (Spark) → Offline Store (Delta Lake)

Key decisions: 1. Storage: Redis Cluster for online (sub-ms), Delta Lake for offline 2. Materialization: Every 5 min for hot features, hourly for others 3. Feature groups: Group by freshness requirements 4. Monitoring: Feature freshness alerts, latency P99

Code pattern:

# Feature serving
def get_features(entity_id, feature_names):
    keys = [f"{name}:{entity_id}" for name in feature_names]
    values = redis.mget(keys)  # Multi-get for <10ms
    return dict(zip(feature_names, values))


13. Causal Inference & Uplift Modeling

Basic

Q: Correlation vs Causation — в чём разница?

A: - Correlation: X и Y связаны, но не обязательно причина-следствие - Causation: X вызывает Y

Example: Ice cream sales correlate with drowning. Cause? Heat → both increase. Not ice cream → drowning.

Q: Что такое Uplift Modeling?

A: Техника для оценки incremental effect воздействия на конкретного пользователя.

\[\text{Uplift} = P(Y|T=1) - P(Y|T=0)\]

Где T = treatment (воздействие), Y = outcome.

Medium

Q: Segment users by treatment response?

A:

Segment Treatment response Strategy
Persuadables Positive uplift Target!
Sure Things Buy anyway Don't waste treatment
Lost Causes Won't buy anyway Don't target
Sleeping Dogs Negative uplift (treatment hurts) Avoid!

Q: T-Learner vs S-Learner vs X-Learner?

A:

T-Learner (Two-model): - Train separate models for treatment and control - Uplift = Model_T(x) - Model_C(x)

S-Learner (Single-model): - One model with treatment as feature - Uplift = Model(x, T=1) - Model(x, T=0)

X-Learner: - Step 1: T-Learner base models - Step 2: Compute individual treatment effects - Step 3: Propensity-weighted combination - Better when treatment/control sizes differ

# T-Learner implementation
from sklearn.ensemble import GradientBoostingClassifier

def t_learner(X_train, y_train, treatment_train):
    # Split by treatment
    X_treat = X_train[treatment_train == 1]
    y_treat = y_train[treatment_train == 1]
    X_ctrl = X_train[treatment_train == 0]
    y_ctrl = y_train[treatment_train == 0]

    # Train separate models
    model_t = GradientBoostingClassifier().fit(X_treat, y_treat)
    model_c = GradientBoostingClassifier().fit(X_ctrl, y_ctrl)

    return lambda X: model_t.predict_proba(X)[:,1] - model_c.predict_proba(X)[:,1]

Killer

Q: Как оценить uplift model без ground truth?

A:

Problem: Для одного пользователя знаем только T=1 ИЛИ T=0, но не оба.

Solutions:

  1. AUUC (Area Under Uplift Curve):
  2. Rank users by predicted uplift
  3. Compare cumulative treatment effect vs random

  4. Qini Coefficient: $\(Q = \sum_i (Y_{T,i} - Y_{C,i} \cdot \frac{n_T}{n_C})\)$

  5. Uplift-at-k:

  6. Evaluate treatment effect in top-k predicted uplift
  7. Requires held-out A/B test data

  8. Counterfactual estimation:

  9. Use causal inference methods (IPW, Doubly Robust)

Q: Propensity Score Matching — зачем и как?

A:

Goal: Сравнить treatment и control groups с одинаковыми characteristics.

Method: 1. Estimate \(P(T=1|X)\) = propensity score (logistic regression) 2. Match treated and control units with similar scores 3. Compare outcomes within matched pairs

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors

# 1. Estimate propensity scores
ps_model = LogisticRegression().fit(X, treatment)
propensity_scores = ps_model.predict_proba(X)[:, 1]

# 2. Match nearest neighbors
nn = NearestNeighbors(n_neighbors=1).fit(
    propensity_scores[treatment == 0].reshape(-1, 1)
)
distances, indices = nn.kneighbors(
    propensity_scores[treatment == 1].reshape(-1, 1)
)

# 3. Compare matched pairs
ate = y[treatment == 1].mean() - y[treatment == 0][indices.flatten()].mean()

14. Multi-Armed Bandits

Источники: GeeksforGeeks A/B Testing vs MAB, Analytics Vidhya MLOps Questions

Basic

Q: Что такое Multi-Armed Bandit (MAB)?

A: Алгоритм reinforcement learning, который балансирует exploration (пробовать новые варианты) и exploitation (использовать лучший известный вариант).

Название: От "one-armed bandit" (slot machine). K arms = K вариантов.

Q: MAB vs A/B Testing — в чём разница?

A:

Aspect A/B Testing Multi-Armed Bandit
Allocation Fixed (50/50) Dynamic (adapt to winners)
Goal Statistical significance Maximize cumulative reward
Duration Fixed period Continuous
Regret Wastes traffic on losers Minimizes regret
Speed Slow convergence Fast adaptation

Medium

Q: Epsilon-Greedy vs UCB vs Thompson Sampling?

A:

Epsilon-Greedy: - With probability ε: explore random arm - With probability 1-ε: exploit best arm - Simple but fixed exploration rate

UCB (Upper Confidence Bound): - Select arm with highest upper confidence bound - \(UCB_i = \bar{r}_i + \sqrt{\frac{2 \ln n}{n_i}}\) - Balances exploration/exploitation automatically

Thompson Sampling: - Bayesian: sample from posterior distribution - Select arm with highest sample - Works well for Bernoulli and Gaussian rewards

Q: Когда использовать MAB вместо A/B?

A:

Use MAB when: - Need to maximize reward during experiment (ads, recommendations) - Non-stationary environment (preferences change) - Many variants to test - Short experiment duration acceptable

Use A/B when: - Need statistical rigor (scientific conclusions) - Regulation requires definitive proof - Learning about user behavior (not just optimizing) - Potential for negative impact from exploration

Killer

Q: Спроектируйте MAB system для ad selection с 1000+ creatives.

A:

Challenges: - 1000+ arms → slow convergence - New creatives added constantly (cold start) - Non-stationary CTR (seasonality, fatigue)

Solution:

# Hierarchical approach
# 1. Contextual bandit with features (not just arm ID)
# 2. Clustering: similar creatives share learning
# 3. Thompson Sampling with warm start for new creatives

class ContextualBandit:
    def __init__(self, n_arms, context_dim):
        self.models = [BayesianLinear(context_dim) for _ in range(n_arms)]

    def select(self, context):
        samples = [m.sample(context) for m in self.models]
        return np.argmax(samples)

Key components: 1. Contextual features: user, placement, time 2. Thompson Sampling: natural exploration 3. Cold start: use creative metadata for initial priors 4. Non-stationarity: decay old observations 5. Fallback: guaranteed exploration for new creatives

Q: Как оценить MAB алгоритм оффлайн?

A:

Problem: Can't run multiple bandits on same traffic.

Solutions:

  1. Counterfactual Evaluation:
  2. Use logged data with propensity scores
  3. \(V = \frac{1}{N} \sum_{i} \frac{r_i \cdot \mathbb{1}(a_i = a)}{p(a_i | x_i)}\)

  4. Replay Method:

  5. Simulate bandit on historical data
  6. Only count reward when bandit action = logged action

  7. Off-policy Evaluation (OPE):

  8. IPS (Inverse Propensity Scoring)
  9. Doubly Robust estimator
# IPS Estimator
def ips_estimate(logs, policy):
    total = 0
    for log in logs:
        if policy.select(log.context) == log.action:
            total += log.reward / log.logging_prob
    return total / len(logs)

Best practice: Online A/B test final candidates after offline filtering.


15. Model Drift Detection (Advanced)

Источник: Model Drift in Production (2026)

Basic

Q: Data drift vs Concept drift vs Label drift?

A:

Type What changes Example
Data drift P(X) New user demographics, new devices
Concept drift P(Y X)
Label drift P(Y) Class balance changes, policy changes

Q: What is PSI (Population Stability Index)?

A: Measure of distribution shift between baseline and current.

\[PSI = \sum (Current\% - Baseline\%) \times \ln\frac{Current\%}{Baseline\%}\]

Interpretation: - PSI < 0.1: No significant shift - 0.1 ≤ PSI < 0.25: Moderate shift, investigate - PSI ≥ 0.25: Significant shift, action needed

Medium

Q: Drift metrics comparison — PSI vs KS-test vs Wasserstein?

A:

Metric Best for Pros Cons
PSI Binned distributions Interpretable, industry standard Requires binning
KS-test Continuous, 1D Statistical significance, no binning Only 1D, sensitive to sample size
Wasserstein Continuous, geometric Robust, captures shape Less interpretable
JS/KL Divergence Probability distributions Information-theoretic Requires density estimation

Q: How to set up drift monitoring in production?

A:

  1. Baselines:
  2. Training distribution
  3. Healthy production window
  4. Seasonal baselines (last year same period)

  5. Windows:

  6. Short (1h/1d): Sudden shifts, pipeline bugs
  7. Medium (7d): Noise smoothing
  8. Long (30d): Slow drift

  9. Slicing:

  10. Country/locale
  11. Device/OS
  12. User segment (new/returning)

  13. Alerting:

  14. Warning: investigate (PSI > 0.1)
  15. Critical: mitigate (PSI > 0.25, performance drop)
  16. Persistence: alert only if N consecutive windows

Killer

Q: Drift detected at 3am. Your response playbook?

A:

Phase 1: Triage (15 min) 1. Check data integrity: null spikes, schema changes, pipeline failures 2. Check recent changes: deployments, upstream API changes 3. Localize: which slice(s) affected?

Phase 2: Immediate Mitigation (1h)

if data_pipeline_broken:
    fix_pipeline()  # Highest priority
elif model_degraded:
    if new_model_recently_deployed:
        rollback()
    else:
        increase_fallback_threshold()
        route_to_human_review()

Phase 3: Investigation (4h) - Compare failure patterns to baseline - Check feature-level drift - Review label drift (if labels available)

Phase 4: Resolution (1d) - Targeted labeling for drifted slices - Retrain with refreshed data - Calibration refresh if score distribution shifted

Phase 5: Prevention (1w) - Add data validation gates in CI/CD - Improve dashboard/alerting - Document incident in runbook

Q: Drift in LLM/RAG systems — what's different?

A:

LLM-specific drift sources: 1. Prompt drift: System prompt changes, template updates 2. Retrieval drift: Knowledge base updates, embedding model changes 3. Tool drift: API schemas change, latency changes

Monitoring signals: - Retrieval hit rate - Top-k similarity scores - Citation coverage - Answer without retrieval rate - Tool call success rate

Key insight: "The model" in LLM systems = weights + prompts + retrieval + tools. Version all components.


16. Online Learning (Streaming ML)

Basic

Q: Online vs Batch Learning — в чём разница?

A:

Aspect Batch Online
Data Fixed dataset Continuous stream
Updates Retrain periodically Update after each sample
Memory Store all data Recent window only
Latency Hours/days Milliseconds
Use case Stable distributions Non-stationary data

Q: Когда использовать online learning?

A: (1) Real-time bidding (ad tech), (2) Fraud detection, (3) Recommendation systems, (4) High-velocity data streams, (5) Concept drift environments.

Medium

Q: FTRL-Proximal — как работает?

A: Follow-The-Regularized-Leader with L1 regularization. Для sparse high-dimensional features (ads, recommendations).

# FTRL update rule
z_i += grad - (sqrt(n_i + grad^2) - sqrt(n_i)) * w_i / alpha
n_i += grad^2
w_i = -z_i / n_i if |z_i| > lambda1 else 0  # L1 sparsity

Q: Как детектировать concept drift в online learning?

A: - ADWIN: Adaptive Windowing — detects change when window variance exceeds threshold - DDM: Drift Detection Method — monitors error rate, alerts on significant increase - Page-Hinkley Test: Cumulative sum of deviations from mean

from river import drift

detector = drift.ADWIN()
for x, y in stream:
    y_pred = model.predict_one(x)
    error = int(y_pred != y)
    detector.update(error)
    if detector.drift_detected:
        model = reset_model()  # Retrain from scratch

Killer

Q: Спроектируйте online ML pipeline для fraud detection.

A:

Architecture:

[Kafka Stream] → [Flink ML] → [Model] → [Decision Engine]
      ↓              ↓           ↓            ↓
  Transactions   Features    Prediction    Action
  (100K/sec)    (aggregates)  (fraud prob)  (block/allow)

Feature Pipeline (Flink): - 5-min tumbling windows: tx_count, tx_amount_sum - Sliding windows: velocity_1h, velocity_24h - Real-time aggregations: merchant_tx_count, user_distinct_merchants

Model: Online Logistic Regression with FTRL - Features: ~1M sparse features (user, merchant, device embeddings) - Update: per-transaction gradient step - Latency: <10ms including feature computation

Drift Handling: - ADWIN for performance monitoring - Automatic model reset on significant drift - Shadow model for A/B comparison

Fallback: - Rule-based fallback if ML latency > 50ms - Feature freshness monitoring


17. Multi-Stage Recommender Systems

Basic

Q: Что такое multi-stage recommender?

A: Funnel architecture из нескольких stages: 1. Retrieval: Millions → Thousands (coarse filtering) 2. Pre-ranking: Thousands → Hundreds (light model) 3. Ranking: Hundreds → Tens (heavy model) 4. Re-ranking: Final diversity/freshness adjustments

Q: Почему нужна multi-stage архитектура?

A: Accuracy vs Latency tradeoff. Нельзя пропустить миллион items через тяжёлую модель за 50ms. Retrieval дешёвый, ranking дорогой.

Medium

Q: Two-Tower model для retrieval — как работает?

A:

User Features → [User Tower] → User Embedding (64-256d)
                                  dot product
Item Features → [Item Tower] → Item Embedding (64-256d)

Training: In-batch negatives (other items in batch as negatives) Inference: Pre-computed item embeddings + ANN search (FAISS, ScaNN)

class TwoTower(nn.Module):
    def __init__(self, user_dim, item_dim, embed_dim):
        super().__init__()
        self.user_tower = nn.Sequential(
            nn.Linear(user_dim, 256), nn.ReLU(),
            nn.Linear(256, embed_dim)
        )
        self.item_tower = nn.Sequential(
            nn.Linear(item_dim, 256), nn.ReLU(),
            nn.Linear(256, embed_dim)
        )

    def forward(self, user_features, item_features):
        user_emb = F.normalize(self.user_tower(user_features), dim=-1)
        item_emb = F.normalize(self.item_tower(item_features), dim=-1)
        return (user_emb * item_emb).sum(dim=-1)  # Cosine similarity

Q: ANN indexes — когда какой?

A:

Index Build Time Query Time Recall Use Case
Flat O(1) O(n) 100% <100K items
IVF O(n) O(sqrt(n)) 90-95% 100K-10M
HNSW O(n log n) O(log n) 95-99% Real-time, high recall
IVF-PQ O(n) O(sqrt(n)/m) 80-90% Memory-constrained

Killer

Q: Спроектируйте YouTube-scale recommender (2B users, 1B videos).

A:

Stage 1: Retrieval (Candidates: 1B → 100) - Two-Tower with user watch history + video embeddings - ANN index: HNSW on 256-dim embeddings - Multiple retrieval sources: collaborative, content-based, trending - Latency: ~10ms

Stage 2: Pre-ranking (100 → 20) - Light GBDT (50 trees, depth 4) - Features: user-video affinity, video popularity, recency - Latency: ~5ms

Stage 3: Ranking (20 → 5) - Deep ranking model (DCNv2 or DeepFM) - Features: rich cross-features, user context, video quality scores - Target: watch time prediction (weighted logistic) - Latency: ~20ms

Stage 4: Re-ranking - Diversity: MMR (Maximal Marginal Relevance) - Freshness: boost new content - Business rules: remove watched, age restrictions - Latency: ~2ms

Total latency: ~40ms P99


18. Vector Databases for ML

Basic

Q: Что такое vector database?

A: База данных для хранения и поиска по vector embeddings. Оптимизирована для Approximate Nearest Neighbor (ANN) search.

Q: Когда нужен vector DB vs обычный DB?

A: - Vector DB: semantic search, RAG, recommendation retrieval, duplicate detection - Regular DB: exact match, range queries, aggregations, ACID transactions

Medium

Q: HNSW vs IVF — когда что?

A:

HNSW IVF
Graph-based Cluster-based
Higher recall, more memory Lower memory, tunable recall
Better for real-time updates Better for batch rebuilds
O(log n) query O(sqrt(n)) query
Complex params (M, ef) Simpler params (nlist, nprobe)

Q: Что такое hybrid search?

A: Комбинация vector search + keyword search (BM25). RRF (Reciprocal Rank Fusion) для объединения:

def reciprocal_rank_fusion(vector_results, keyword_results, k=60):
    scores = {}
    for rank, doc in enumerate(vector_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank)
    for rank, doc in enumerate(keyword_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

Killer

Q: Выбор vector DB для production — критерии?

A:

DB Strengths Weaknesses Use Case
Pinecone Managed, scalable Expensive, vendor lock-in Enterprise, no ops
Milvus Open-source, feature-rich Complex setup Large-scale, self-hosted
Weaviate GraphQL, modules Younger ecosystem RAG, multimodal
Qdrant Rust, filtering Smaller community Performance-critical
pgvector Postgres extension Limited scale Existing Postgres infra
Chroma Simple, embedded Not for scale Prototyping, small apps

19. Cost Optimization for ML Inference

Basic

Q: Основные статьи расходов ML inference?

A: - GPU compute: 60-70% - Memory: 15-20% - Network: 5-10% - Storage: 5-10%

Q: Как снизить cost per prediction?

A: (1) Model quantization, (2) Batching, (3) GPU sharing, (4) Spot instances, (5) Model right-sizing, (6) Caching.

Medium

Q: Spot instances для ML — стратегия?

A: - Use для batch inference, training - NOT для latency-critical online inference - Preemption detection: cloud metadata API - Graceful shutdown: checkpoint every N batches - Fallback: on-demand pool ready

# Spot instance preemption handling
import requests

def check_preemption():
    try:
        # GCP metadata
        resp = requests.get('http://metadata.google.internal/computeMetadata/v1/instance/preempted',
                          headers={'Metadata-Flavor': 'Google'})
        return resp.text == 'TRUE'
    except:
        return False

# In inference loop
for batch in data:
    if check_preemption():
        save_checkpoint(model, batch_position)
        notify_fallback_pool()
        break
    predictions = model(batch)

Q: Semantic caching для LLM — как работает?

A: 1. Embed query with sentence transformer 2. Search for similar queries in cache (cosine similarity > 0.95) 3. If found: return cached response 4. If not: call LLM, cache response with embedding

Savings: 20-40% of LLM calls for customer support, FAQ use cases.

Killer

Q: Cost optimization strategy для inference platform (100 models, 1B predictions/day)?

A:

Tier 1: Model Optimization (40% savings) - Quantize all models to INT8 (2-4x throughput) - Distill ensemble models where possible - Prune unused features/neurons

Tier 2: Infrastructure (30% savings) - Spot instances for 70% of batch traffic - GPU sharing with MIG (Multi-Instance GPU) - Right-size: A10G for small models, H100 for large

Tier 3: Traffic Optimization (20% savings) - Semantic caching for LLM endpoints (30% hit rate) - Request batching with max_wait=10ms - Model routing: simple queries → small models

Tier 4: Monitoring & Governance (10% savings) - Cost per prediction dashboards - Budget alerts per team - Unused model deprecation policy


20. Multi-Model Serving and Model Routing

Basic

Q: Зачем нужна multi-model serving?

A: (1) Different tasks (classification, NER, QA), (2) Cost optimization (route to cheaper models), (3) Redundancy, (4) A/B testing, (5) Graceful degradation.

Q: Routing strategies — какие бывают?

A: - Weighted round-robin - Latency-based - Cost-aware - Confidence-based (cascade) - Content-based (route by input features)

Medium

Q: Cascade routing — как работает?

A: 1. Try small/fast model first 2. If confidence > threshold: return prediction 3. If confidence < threshold: route to larger model 4. Optionally: third tier for edge cases

class CascadeRouter:
    def __init__(self, small_model, large_model, threshold=0.8):
        self.small = small_model
        self.large = large_model
        self.threshold = threshold

    def predict(self, x):
        pred, conf = self.small.predict_with_confidence(x)
        if conf > self.threshold:
            return pred
        return self.large.predict(x)

Q: Circuit Breaker pattern для model fallback?

A:

States: CLOSED → OPEN → HALF_OPEN → CLOSED

CLOSED: Normal operation, track failures
OPEN: All requests go to fallback, wait for timeout
HALF_OPEN: Test with single request, decide state

Killer

Q: Спроектируйте model router для LLM API (GPT-4, Claude, Gemini, local Llama).

A:

Routing Decision Matrix:

Query Type Route To Why
Code generation Claude/GPT-4 Best code quality
Simple Q&A Llama 70B 100x cheaper
Long context (>32K) Claude 200K Context window
Real-time chat Llama 70B Lowest latency
Complex reasoning GPT-4 o1 Chain-of-thought
Image input GPT-4V/Claude Multimodal

Implementation:

class LLMRouter:
    def route(self, query, context):
        # Content-based routing
        if len(context) > 32000:
            return "claude-200k"
        if "code" in query or "implement" in query:
            return self.circuit_breaker.call("gpt-4", fallback="llama-70b")
        if self.is_simple_query(query):
            return "llama-70b"
        if self.needs_reasoning(query):
            return "o1"
        return self.cost_aware_select(query)  # Balance cost/quality

Circuit Breaker Integration: - Track per-model error rates - Fallback chain: primary → backup → local model - Automatic recovery after 30s cooldown


21. AI Agents in Production

Basic

Q: Что такое AI agent?

A: Автономная система, которая: (1) Воспринимает environment, (2) Принимает решения, (3) Выполняет actions через tools, (4) Имеет memory/goals.

Q: Agent vs обычный LLM chat?

A: - LLM chat: single response, no tools, no memory - Agent: multi-step reasoning, tool use, persistent memory, goal-directed

Medium

Q: ReAct pattern — как работает?

A: Reasoning + Acting loop:

Thought: What should I do next?
Action: [tool_name, tool_args]
Observation: [tool output]
Thought: Based on observation...
Action: [next action or Final Answer]

Q: Human-in-the-Loop (HITL) patterns?

A: 1. Approval gates: Critical actions require human approval 2. Review queues: Batch agent outputs for human review 3. Escalation: Agent requests help when uncertain 4. Correction feedback: Human corrections improve agent

from langgraph import interrupt

def agent_with_hitl(state):
    result = agent_step(state)
    if is_critical_action(result):
        human_response = interrupt("Approve action?")
        if not human_response.approved:
            result = revise_plan(result)
    return result

Killer

Q: Defence-in-Depth для AI agents — архитектура?

A:

Layer 1: Input Sanitization - PII detection and redaction - Prompt injection detection - Length/rate limits

Layer 2: Agent Execution - Sandbox environment - Resource limits (time, tokens, API calls) - State isolation

Layer 3: Tool Gatekeeping - Allowlist of approved tools - Permission levels per tool - Schema validation on inputs

Layer 4: Output Validation - Content policy checks - Format validation - Sensitive data filter

Layer 5: Observability - Full execution traces - Decision audit log - Anomaly detection


22. Security for ML

Basic

Q: Основные типы атак на ML модели?

A: - Evasion: Adversarial inputs at inference (FGSM, PGD) - Poisoning: Malicious training data - Extraction: Steal model via queries - Inversion: Reconstruct training data - Membership Inference: Determine if sample was in training

Q: Что такое adversarial example?

A: Input with imperceptible perturbation that causes misclassification. Example: image + noise → wrong class with high confidence.

Medium

Q: FGSM attack — формула?

A: Fast Gradient Sign Method: $\(x_{adv} = x + \epsilon \cdot \text{sign}(\nabla_x J(\theta, x, y))\)$

Where: - \(x\) = original input - \(\epsilon\) = perturbation magnitude (e.g., 0.01) - \(J\) = loss function - \(\nabla_x\) = gradient w.r.t. input

def fgsm_attack(model, x, y, epsilon):
    x_adv = x.clone().requires_grad_(True)
    loss = F.cross_entropy(model(x_adv), y)
    loss.backward()
    return x_adv + epsilon * x_adv.grad.sign()

Q: Как защититься от model extraction?

A: - Rate limiting per API key - Output perturbation (add noise, round predictions) - Watermarking model outputs - Query pattern detection

Killer

Q: Security architecture для production ML API?

A:

Defense Layers:

  1. Input Layer:
  2. Schema validation
  3. Anomaly detection on inputs
  4. Rate limiting (100 req/min/user)

  5. Model Layer:

  6. Adversarial training (PGD)
  7. Input preprocessing (randomization)
  8. Confidence thresholding

  9. Output Layer:

  10. Prediction rounding (2-3 decimals)
  11. Add calibrated noise
  12. Watermark embedding

  13. Monitoring Layer:

  14. Query distribution drift
  15. Suspicious user patterns
  16. Model extraction detection

  17. Access Layer:

  18. Authentication required
  19. API key rotation
  20. IP allowlisting for enterprise

Типичные заблуждения

Заблуждение: на MLSD-интервью главное -- правильно выбрать модель

Model choice -- 10-15% оценки. Интервьюеры оценивают: (1) Правильные clarifying questions (scope, scale, latency), (2) System architecture (data flow, components), (3) Feature engineering (что и почему), (4) Trade-offs discussion (precision vs recall, latency vs accuracy), (5) Monitoring и feedback loops. Кандидат, который сразу говорит "BERT" без обсуждения requirements -- red flag.

Заблуждение: нужно запоминать точные архитектуры (YouTube, Instagram)

Запоминание конкретных архитектур бесполезно -- интервьюер меняет constraints. Нужно понимать ПРИНЦИПЫ: multi-stage funnel (retrieval -> ranking -> re-ranking), cascade routing (fast model -> heavy model), confidence-based human-in-the-loop, feedback loops. С этими принципами можно спроектировать любую систему.

Заблуждение: если не знаешь ответ на вопрос -- нужно что-то придумать

Честное 'я не уверен, но вот моё рассуждение...' оценивается выше, чем уверенный неправильный ответ. MLSD-интервью проверяет мышление, а не память. Подход: (1) Назови что знаешь, (2) Рассуждай от первых принципов, (3) Предложи как бы ты это исследовал. Это показывает инженерное мышление.

Вопросы с оценкой ответов

Как вы подойдёте к MLSD-вопросу, который вы раньше не решали?

❌ "Начну с выбора модели и опишу training pipeline" -- skip requirements gathering

✅ "Стандартный framework: (1) Clarifying questions: scope, scale, latency SLA, data availability -- 5 min. (2) High-level architecture: data flow, основные components -- 10 min. (3) Deep dive: features, model choice с обоснованием, training pipeline -- 15 min. (4) Trade-offs и operations: monitoring, A/B testing, failure modes -- 10 min. (5) Extensions: scaling, edge cases. Этот framework работает для ЛЮБОЙ MLSD задачи, потому что фокусируется на системном дизайне, а не на конкретной модели."

Precision 95% vs Recall 95% -- что выбрать для fraud detection?

❌ "Precision, чтобы не блокировать легитимные транзакции" -- не учитывает asymmetric cost

✅ "Recall > Precision для fraud detection: пропущенный fraud ($1000-100K потеря) стоит в 10-100x дороже, чем ложное срабатывание (задержка транзакции на 30 секунд для verification). Но не бинарный выбор -- использую tiered approach: (1) High recall (99%+) для flagging, (2) Human review для flagged transactions, (3) Auto-block только при очень высокой confidence (>99.5%). Business metric: $ saved from fraud / $ lost from false blocks."


See Also