Дата-инженерия: подготовка к интервью¶
~13 минут чтения
Предварительно: Учебные материалы дата-инженерия | Классический ML
Data Engineering -- must-have компетенция для ML-позиций: по данным Indeed (2025), 89% ML job postings требуют знание SQL/Spark, 67% -- понимание data leakage и feature stores. Раздел покрывает 7 крупных тем: Data Leakage, Spark (RDD/DataFrame), Feature Stores, Data Lakehouse (Delta/Iceberg/Hudi), dbt, Airflow и workflow orchestration. Вопросы трёх уровней: Basic, Medium, Killer.
Обновлено: 2026-02-11
1. Data Leakage¶
Basic¶
Q: Что такое data leakage в ML?
A: Когда модель получает информацию во время обучения, которая не будет доступна в production. Приводит к завышенным метрикам на train, но провалу в production.
Q: Назовите типы data leakage.
A: (1) Target leakage — признак коррелирует с target, (2) Train-test contamination — info из test в train, (3) Temporal leakage — будущие данные в train, (4) Group leakage — дубликаты entities.
Medium¶
Q: Как предотвратить data leakage при feature engineering?
A: (1) Split до любого preprocessing, (2) Использовать Pipeline, (3) Fit только на train, (4) Temporal split для time series, (5) Cross-validation с temporal ordering.
Q: Пример target leakage?
A: В fraud detection: признак "transaction_reversed" = True для fraud, False для normal. В production на момент prediction ещё неизвестно, будет ли reversal.
Q: Как обнаружить data leakage?
A: (1) Проверить корреляции признаков с target (> 0.95 — подозрительно), (2) Проверить overlap train/test IDs, (3) Сравнить distribution train vs test, (4) Валидация на hold-out после всех экспериментов.
Killer¶
Q: В модели 99% accuracy, но в production 60%. Что делать?
A: (1) Проверить data leakage —fit на всём датасете?, (2) Проверить temporal leakage — правильный split?, (3) Проверить distribution shift, (4) Проверить feature availability в production, (5) A/B test vs baseline.
2. Spark (RDD/DataFrame)¶
Basic¶
Q: RDD vs DataFrame?
A: RDD — low-level, type-safe, manual optimization. DataFrame — high-level SQL-like, Catalyst optimizer, lazy evaluation. DataFrame предпочтительнее для ML.
Q: Что такое lazy evaluation в Spark?
A: Transformations не выполняются немедленно. Spark строит DAG (directed acyclic graph) и выполняет только при action (collect, count, save).
Medium¶
Q: Как работает Spark ML Pipeline?
A: Последовательность stages: (1) Transformers (StringIndexer, OneHotEncoder), (2) Estimators (модели), (3) Pipeline.fit() — обучает все stages, (4) Pipeline.transform() — применяет к данным.
Q: Как оптимизировать Spark job?
A: (1) Cache frequently used DataFrames, (2) Broadcast join для small tables, (3) Repartition для skewed data, (4) Coalesce для reduce partitions, (5) Use DataFrame API вместо RDD.
Q: Что такое shuffle в Spark?
A: Перераспределение данных между partitions. Дорогостоящая операция. Происходит при: groupBy, join, distinct, sortBy. Minimize для performance.
Killer¶
Q: Спроектируйте ML pipeline на Spark для 1TB данных.
A: (1) Read из S3/HDFS параллельно, (2) Feature engineering через DataFrame API, (3) ML Pipeline с VectorAssembler + Model, (4) Cache intermediate results, (5) Partitioning по ключу для join efficiency, (6) Write predictions partitioned по дате.
3. Комбинированные вопросы¶
Killer¶
Q: Как предотвратить data leakage в distributed ML pipeline?
A: (1) Deterministic split по key (hash partitioning), (2) Temporal ordering preserved across partitions, (3) Fit transformers per partition только на train data, (4) Avoid global statistics (imputation) перед split, (5) Validate leakage detection на sample.
Q: Spark ML vs sklearn Pipeline — различия?
A: Spark ML: distributed, lazy evaluation, работает с DataFrames, no in-memory single-machine. sklearn: single-machine, eager, numpy/pandas. Spark ML предпочтительнее для >1TB данных.
4. Дополнительные вопросы¶
Medium¶
Q: Что такое feature store и зачем он нужен?
A: Feature store — централизованное хранилище признаков для ML. Решает проблемы: (1) consistency между train и serve — одни и те же трансформации, (2) reusability — разные модели используют общие фичи, (3) time-travel — историческая реконструкция фич на дату, (4) real-time features — streaming фичи с низкой задержкой. Примеры: Feast (open-source), Tecton, Databricks Feature Store.
Q: Как предотвратить train-serving skew?
A: Train-serving skew — когда фичи на train отличаются от serve. Причины: (1) разный код для offline/online transformation, (2) stale features в inference, (3) data pipeline bugs. Решения: (1) единый pipeline для train и serve (feature store), (2) мониторинг distribution drift, (3) shadow mode — сравнение online vs offline predictions, (4) data contracts для input validation.
Q: Какие стратегии partitioning в PySpark для skewed data?
A: (1) Salting — добавить random prefix к ключу для распределения, (2) Adaptive Query Execution (AQE) в Spark 3+ — автоматический coalesce/split, (3) Broadcast join для маленькой таблицы (<10MB default), (4) Custom partitioner с repartition по computed key, (5) Two-stage aggregation — partial aggregate на partition, затем global merge.
Killer¶
Q: Спроектируйте систему feature computation для real-time ML модели.
A: Архитектура: (1) Batch features: ETL pipeline -> Feature Store (daily/hourly), (2) Streaming features: Kafka -> Flink/Spark Streaming -> Feature Store (seconds), (3) Serving: Online Feature Store (Redis/DynamoDB) с low-latency reads, (4) Point-in-time correctness: при train join features по timestamp, не по latest value, (5) Monitoring: feature freshness, missing rate, distribution drift. Ключевые решения: event time vs processing time, late arrivals handling, exactly-once semantics.
5. Data Lakehouse (Delta Lake, Iceberg, Hudi)¶
Basic¶
Q: Что такое Data Lakehouse?
A: Data Lakehouse = Data Lake (cheap storage, any format) + Data Warehouse (ACID, schema, queries). Unified architecture combining best of both worlds.
Key features: - ACID transactions on object storage (S3, GCS, ADLS) - Schema enforcement and evolution - Time travel (query historical data versions) - Incremental processing (upserts, deletes) - BI + ML workloads on same data
Q: Какие основные table formats для lakehouse?
A:
Format Creator Ecosystem Key Strength Delta Lake Databricks Spark-native Best Spark integration, mature Apache Iceberg Netflix Multi-engine Vendor-neutral, widest support Apache Hudi Uber Incremental Best for CDC, streaming upserts
Medium¶
Q: Delta Lake vs Iceberg vs Hudi — когда что использовать?
A:
Delta Lake: - Best for: Databricks environment, Spark-heavy workloads - Pros: Mature, great Spark integration, Unity Catalog - Cons: Databricks-centric (though open-sourced)
Apache Iceberg: - Best for: Multi-engine environments, vendor-neutral architecture - Pros: Works with Spark, Flink, Trino, Snowflake; excellent schema evolution - Cons: Newer ecosystem than Delta
Apache Hudi: - Best for: CDC (Change Data Capture), incremental processing - Pros: Built for streaming upserts, record-level updates - Cons: Higher operational complexity
2025-2026 trend: Iceberg gaining momentum due to Snowflake/Tabular acquisition and multi-engine flexibility.
Q: Как работает ACID в data lakehouse?
A:
Transaction log approach: 1. All changes written to transaction log (JSON/Avro manifests) 2. Readers use snapshot isolation — see consistent view 3. Writers acquire locks on log (optimistic concurrency) 4. Atomic commit: log entry + data files together
Delta Lake: -
_delta_log/00000000000000000000.json— transaction log entries - Each commit = add/remove file operationsIceberg: -
metadata/snapshots/— snapshot references -metadata/manifests/— file lists with stats
Q: Что такое schema evolution в lakehouse?
A:
Ability to modify table schema without rewriting data: - Add columns (backward compatible) - Rename columns (with mapping) - Change column types (with validation) - Drop columns (soft delete)
Python example (Delta Lake):
Killer¶
Q: Спроектируйте ML data pipeline на lakehouse архитектуре.
A:
Architecture:
Raw Zone (Bronze) → Curated Zone (Silver) → Feature Store (Gold) ↓ ↓ ↓ Iceberg tables Delta tables Iceberg/Delta Full history Cleaned + joined ML featuresKey decisions: 1. Format choice: Iceberg for raw (multi-engine), Delta for curated (Spark-native) 2. Partitioning: By date + event_type for query performance 3. Compaction: Auto-optimize small files (Z-order for Delta) 4. Time travel: Train ML models on point-in-time data
ML-specific features:
# Point-in-time join for training features_df = spark.read.format("iceberg") \ .option("snapshot-id", snapshot_id) \ .load("catalog.db.features") # Incremental feature computation new_events = spark.read.format("delta") \ .option("startingVersion", last_processed_version) \ .table("events")Considerations: - VACUUM for cost optimization (old file cleanup) - Compaction for query performance (small files problem) - Schema registry integration for data contracts
6. dbt & Data Transformation Tools¶
Basic¶
Q: Что такое dbt и зачем он нужен?
A:
dbt (data build tool) — трансформация данных через SQL/Python код с version control, testing и documentation.
Ключевые преимущества: - SQL-first: аналитики пишут трансформации на SQL - Version control: Git для всех трансформаций - Testing: автоматические data tests - Documentation: auto-generated docs - Lineage: DAG зависимостей между моделями
ELT vs ETL: | Approach | Order | Tool Example | |----------|-------|--------------| | ETL | Extract → Transform → Load | Informatica, Talend | | ELT | Extract → Load → Transform | dbt, Spark SQL |
dbt focus: Transform (последний шаг ELT)
Q: Что такое dbt model?
A:
Model — SQL/Python файл, определяющий трансформацию данных.
-- models/stg_customers.sql {{ config(materialized='table') }} SELECT customer_id, UPPER(name) as name, email, created_at FROM {{ source('raw', 'customers') }} WHERE deleted_at IS NULLMaterialization types: | Type | Description | When to Use | |------|-------------|-------------| | view | CREATE VIEW | Simple transformations, always fresh | | table | CREATE TABLE | Heavy transformations, queried often | | incremental | INSERT/UPDATE | Large data, append-only | | ephemeral | CTE (no object) | Reusable logic |
Medium¶
Q: Как работает incremental materialization?
A:
Incremental — обновляет только новые/изменённые данные вместо полного пересчёта.
-- models/fct_orders.sql {{ config( materialized='incremental', unique_key='order_id', incremental_strategy='merge' ) }} SELECT * FROM {{ source('raw', 'orders') }} {% if is_incremental() %} WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }}) {% endif %}Incremental strategies: | Strategy | Description | Supported DB | |----------|-------------|--------------| | append | INSERT only | All | | merge | UPSERT | Snowflake, BigQuery, Databricks | | delete+insert | DELETE then INSERT | Postgres, Snowflake | | insert_overwrite | Partition replacement | Spark, BigQuery |
When to use incremental: - Large dataset (>10M rows) - High-frequency updates - Expensive transformations
Q: Какие типы tests есть в dbt?
A:
Built-in tests: | Test | Description | |------|-------------| | unique | No duplicates | | not_null | No NULL values | | accepted_values | Value in list | | relationships | Foreign key valid |
Schema YAML:
version: 2 models: - name: customers columns: - name: customer_id tests: - unique - not_null - name: status tests: - accepted_values: values: ['active', 'inactive', 'pending'] - name: email tests: - dbt_utils.is_emailCustom tests:
Q: dbt macros — что это и зачем?
A:
Macro — reusable Jinja2 функция для SQL кода.
-- macros/date_spine.sql {% macro date_spine(start_date, end_date) %} WITH dates AS ( SELECT {{ start_date }} as date UNION ALL SELECT DATEADD(day, 1, date) FROM dates WHERE date < {{ end_date }} ) SELECT * FROM dates {% endmacro %} -- Usage {{ date_spine("'2024-01-01'", "'2024-12-31'") }}Common use cases: - Reusable transformations - Database-specific SQL (portability) - Complex logic encapsulation - Package functions (dbt_utils, dbt_expectations)
Killer¶
Q: Спроектируйте dbt pipeline для ML feature store.
A:
Architecture:
Raw Layer (Bronze) stg_events, stg_users, stg_transactions ↓ Intermediate Layer (Silver) int_user_features, int_product_features, int_session_features ↓ Features Layer (Gold) fct_user_features, fct_product_features, dim_users ↓ ML Feature Store Feature ingestion API → Online Store (Redis)Project structure:
models/ ├── staging/ │ ├── stg_events.sql │ └── stg_users.sql ├── intermediate/ │ ├── int_user_features.sql │ └── int_session_features.sql ├── marts/ │ ├── fct_user_features.sql │ └── dim_users.sql └── ml/ └── ml_feature_table.sql -- Export for MLML feature model:
-- models/ml/ml_feature_table.sql {{ config( materialized='incremental', unique_key='user_id', incremental_strategy='merge' ) }} WITH user_features AS ( SELECT * FROM {{ ref('fct_user_features') }} {% if is_incremental() %} WHERE feature_timestamp > (SELECT MAX(feature_timestamp) FROM {{ this }}) {% endif %} ), aggregated AS ( SELECT user_id, feature_timestamp, -- Rolling features SUM(order_value_7d) as total_value_7d, AVG(session_duration_30d) as avg_session_duration, COUNT(DISTINCT product_id_7d) as unique_products, -- Feature metadata CURRENT_TIMESTAMP as _loaded_at FROM user_features GROUP BY user_id, feature_timestamp ) SELECT * FROM aggregatedTests for ML features:
Q: dbt vs Spark для data transformations — когда что?
A:
Criterion dbt Spark Primary language SQL Python/Scala/SQL Target users Analysts, Analytics Engineers Data Engineers, Data Scientists Scale GB to low TB TB to PB Transformation complexity SQL-based Complex custom logic Compute Data warehouse Distributed cluster Cost model Pay per query Cluster + storage When dbt: - SQL transformations only - Data warehouse native (Snowflake, BigQuery) - Analyst-driven development - Version control and testing needed
When Spark: - Complex Python logic (ML, UDFs) - Very large data (>10TB) - Multi-source joins with complex logic - Streaming data
Hybrid approach (2025-2026): - dbt for SQL transforms → export to Parquet - Spark for ML features → write to Feature Store - dbt for final aggregations and reporting
7. Airflow & Workflow Orchestration¶
Basic¶
Q: Что такое Apache Airflow и зачем он нужен?
A:
Apache Airflow — platform для программного создания, scheduling и мониторинга workflows.
Ключевые концепции: - DAG (Directed Acyclic Graph): Описание workflow с задачами и зависимостями - Task: Единица работы (Operator, Sensor) - Operator: Тип task (PythonOperator, BashOperator, etc.) - Scheduler: Daemon, который запускает tasks - Executor: Как tasks выполняются (LocalExecutor, CeleryExecutor, KubernetesExecutor)
Принципы Airflow: - DAG как code (Python) - Idempotent tasks - Retry logic built-in - Rich monitoring UI
Q: Какие типы Operators есть в Airflow?
A:
Operator Description Use Case PythonOperator Execute Python function Custom logic, ML training BashOperator Execute bash command Shell scripts, CLI tools SqlOperator Execute SQL query Database operations DockerOperator Run Docker container Isolated environments KubernetesPodOperator Run K8s pod Scalable ML workloads Sensor Wait for condition File arrival, external trigger from airflow import DAG from airflow.operators.python import PythonOperator from datetime import datetime with DAG('ml_training', start_date=datetime(2025, 1, 1), schedule='@daily') as dag: def train_model(**context): # ML training logic pass train_task = PythonOperator( task_id='train_model', python_callable=train_model, provide_context=True )
Medium¶
Q: Как организовать ML pipeline в Airflow?
A:
Typical ML DAG:
from airflow import DAG from airflow.operators.python import PythonOperator from airflow.sensors.external_task import ExternalTaskSensor from datetime import datetime, timedelta default_args = { 'owner': 'ml-team', 'retries': 3, 'retry_delay': timedelta(minutes=5), 'execution_timeout': timedelta(hours=2), } with DAG( 'ml_training_pipeline', default_args=default_args, start_date=datetime(2025, 1, 1), schedule='0 2 * * *', # 2 AM daily max_active_runs=1, # Prevent parallel runs catchup=False, ) as dag: # Wait for data pipeline to complete wait_for_data = ExternalTaskSensor( task_id='wait_for_data', external_dag_id='data_pipeline', external_task_id='publish_features', timeout=3600, mode='reschedule', # Free worker while waiting ) def validate_data(**context): ds = context['ds'] # Data quality checks pass def train_model(**context): # Model training run_id = context['run_id'] pass def evaluate_model(**context): # Model evaluation pass def deploy_model(**context): # Model deployment pass validate = PythonOperator(task_id='validate_data', python_callable=validate_data) train = PythonOperator(task_id='train_model', python_callable=train_model) evaluate = PythonOperator(task_id='evaluate_model', python_callable=evaluate_model) deploy = PythonOperator(task_id='deploy_model', python_callable=deploy_model) wait_for_data >> validate >> train >> evaluate >> deploy
Q: Airflow vs Dagster vs Prefect — когда что использовать?
A:
Feature Airflow Dagster Prefect Paradigm DAG-centric Asset-centric Flow-centric Data awareness Low High (data lineage) Medium Type safety None Strong (ops/assets) Python typing Testing Manual Built-in fixtures Built-in Local dev Requires DB First-class First-class Best for Batch pipelines Data platforms Modern ETL When Airflow: - Existing Airflow infrastructure - Large community and plugins - Simple batch scheduling
When Dagster: - Data-aware pipelines - Strong testing requirements - Asset-based architecture
When Prefect: - Python-native workflows - Hybrid execution (local/cloud) - Event-driven pipelines
Q: Как обрабатывать backfill и catchup?
A:
Backfill scenarios: 1. New DAG deployed — need to run historical data 2. Bug fixed — rerun affected periods 3. Data late arrival — reprocess
Approaches:
# Option 1: catchup=True in DAG with DAG('my_dag', catchup=True, start_date=datetime(2024, 1, 1)): ... # Option 2: Manual backfill CLI # airflow dags backfill my_dag -s 2024-01-01 -e 2024-12-31 # Option 3: Backfill DAG pattern with DAG( 'backfill_dag', params={ 'start_date': Param(type='string', default=''), 'end_date': Param(type='string', default=''), }, schedule=None, # Manual trigger only ) as dag: def backfill_task(**context): start = context['params']['start_date'] end = context['params']['end_date'] # Process date range passBest practices: - Set
catchup=Falsefor production DAGs - Use dedicated backfill DAGs for large backfills - Monitor resource usage during backfill - Consider parallelism limits
Killer¶
Q: Спроектируйте production ML platform на Airflow.
A:
Architecture:
graph TD subgraph "Airflow Cluster" WEB[Web UI] SCHED[Scheduler] TRIG[Triggerer] subgraph "KubernetesExecutor" P1[Pod 1: Train] P2[Pod 2: Evaluate] P3[Pod 3: Deploy] end SCHED --> P1 SCHED --> P2 SCHED --> P3 end P1 --> FS[Feature Store] P2 --> MR[Model Registry] P3 --> AS[Artifact Storage] style WEB fill:#e8eaf6,stroke:#3f51b5 style SCHED fill:#e8eaf6,stroke:#3f51b5 style TRIG fill:#e8eaf6,stroke:#3f51b5 style P1 fill:#e8f5e9,stroke:#4caf50 style P2 fill:#e8f5e9,stroke:#4caf50 style P3 fill:#e8f5e9,stroke:#4caf50 style FS fill:#fff3e0,stroke:#ef6c00 style MR fill:#fff3e0,stroke:#ef6c00 style AS fill:#fff3e0,stroke:#ef6c00Key configurations:
# airflow.cfg for production [core] executor = KubernetesExecutor parallelism = 100 max_active_runs_per_dag = 3 [kubernetes] worker_service_account = airflow-worker image_pull_secrets = docker-registry-secret namespace = airflow [scheduler] scheduler_heartbeat_sec = 5 min_file_process_interval = 30 catchup_by_default = FalseDAG patterns:
# ML Training DAG with resource limits from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator from kubernetes.client import models as k8s train_task = KubernetesPodOperator( task_id='train_model', name='ml-training-{{ ds_nodash }}', image='ml-training:latest', cmds=['python', 'train.py'], arguments=['--date', '{{ ds }}'], container_resources=k8s.V1ResourceRequirements( requests={'memory': '8Gi', 'cpu': '4'}, limits={'memory': '16Gi', 'cpu': '8', 'nvidia.com/gpu': '1'}, ), config_file='/path/to/kube/config', in_cluster=True, get_logs=True, )Monitoring & Alerting:
from airflow.callbacks import on_failure_callback def ml_failure_callback(context): task_id = context['task_instance'].task_id dag_id = context['dag'].dag_id execution_date = context['execution_date'] exception = context['exception'] # Send alert to Slack/PagerDuty send_alert( f"ML Pipeline Failed: {dag_id}.{task_id}", f"Time: {execution_date}\nError: {exception}" ) # Log to MLflow mlflow.log_param("status", "failed") mlflow.log_param("error", str(exception)) default_args = { 'on_failure_callback': ml_failure_callback, 'on_retry_callback': ml_failure_callback, }
Q: Как масштабировать Airflow для 1000+ DAGs?
A:
Scaling strategies:
Component Scaling Approach Scheduler Multiple schedulers (Airflow 2.0+) Executor CeleryExecutor or KubernetesExecutor Database Connection pooling, read replicas Web UI Separate webserver, caching Best practices for large deployments:
DAG parsing optimization:
Resource isolation:
- Separate DAGs by priority (queue parameter)
- Use pool limits for expensive tasks
KubernetesExecutor for per-task isolation
Monitoring:
- Scheduler lag metrics
- DAG parse time metrics
- Task duration outliers
Database connection pool usage
Multi-team governance:
Заблуждение: 99% accuracy на тесте = отличная модель
99% accuracy при 60% на production = data leakage. Проверяйте: (1) fit scaler/imputer до split? (2) random split для time series? (3) overlap entities в train/test? (4) признаки, недоступные в production? 95% случаев -- одна из этих четырёх причин.
Заблуждение: Spark shuffle -- просто медленная операция
Shuffle -- перераспределение данных между executors через network + disk I/O. При skewed data один executor получает 90% данных -> OOM. Решения: salting (random prefix к ключу), AQE (Spark 3+), broadcast join (<10MB). На интервью ожидают конкретные стратегии, а не "избегайте shuffle."
Заблуждение: Delta Lake и Iceberg -- одно и то же
Delta Lake = Databricks-centric, лучшая Spark интеграция. Iceberg = vendor-neutral, работает с Spark/Flink/Trino/Snowflake. Hudi = streaming upserts, CDC. Тренд 2025-2026: Iceberg набирает momentum из-за multi-engine flexibility (Snowflake acquisition).
Интервью: формат ответов¶
Data Leakage¶
Красный флаг: "Data leakage -- это когда test данные попадают в train"
Сильный ответ: "4 типа leakage: target (признак коррелирует с target, но недоступен в production), train-test contamination (fit scaler на всём датасете), temporal (будущие данные в train), group (один клиент в train и test). Обнаружение: корреляция > 0.95 с target, проверка overlap IDs, temporal ordering. Prevention: Pipeline, temporal split, GroupKFold."
Spark¶
Красный флаг: "RDD и DataFrame -- одно и то же, просто разный API"
Сильный ответ: "DataFrame с Catalyst optimizer в 10-100x быстрее RDD для structured data. Lazy evaluation строит DAG, оптимизирует план выполнения. Оптимизация: cache() для частых DataFrame, broadcast join для малых таблиц, repartition для skewed data, coalesce для reduce partitions. ML Pipeline: StringIndexer -> VectorAssembler -> Model."
Data Lakehouse¶
Красный флаг: "Lakehouse -- это просто data lake с SQL"
Сильный ответ: "Lakehouse = Data Lake (cheap storage, any format) + Data Warehouse (ACID, schema, SQL). Delta Lake/Iceberg/Hudi обеспечивают ACID через transaction log. Time travel для point-in-time ML training. Schema evolution без rewrite данных. Medallion architecture: Bronze (raw) -> Silver (cleaned) -> Gold (features)."
See Also¶
- MLOps Interview Q&A -- model serving, CI/CD, monitoring, experiment tracking
- AI Agents Interview Q&A -- ReAct, multi-agent, memory systems
- Data Engineering Materials -- ресурсы для подготовки
- Feature Stores Comparison -- Feast vs Hopsworks vs Redis
- Classical ML Interview Q&A -- связанные вопросы по ML