Дата-инженерия: подготовка к интервью¶

~13 минут чтения

Предварительно: Учебные материалы дата-инженерия | Классический ML

Data Engineering -- must-have компетенция для ML-позиций: по данным Indeed (2025), 89% ML job postings требуют знание SQL/Spark, 67% -- понимание data leakage и feature stores. Раздел покрывает 7 крупных тем: Data Leakage, Spark (RDD/DataFrame), Feature Stores, Data Lakehouse (Delta/Iceberg/Hudi), dbt, Airflow и workflow orchestration. Вопросы трёх уровней: Basic, Medium, Killer.

Обновлено: 2026-02-11

1. Data Leakage¶

Basic¶

Q: Что такое data leakage в ML?

A: Когда модель получает информацию во время обучения, которая не будет доступна в production. Приводит к завышенным метрикам на train, но провалу в production.

Q: Назовите типы data leakage.

A: (1) Target leakage — признак коррелирует с target, (2) Train-test contamination — info из test в train, (3) Temporal leakage — будущие данные в train, (4) Group leakage — дубликаты entities.

Medium¶

Q: Как предотвратить data leakage при feature engineering?

A: (1) Split до любого preprocessing, (2) Использовать Pipeline, (3) Fit только на train, (4) Temporal split для time series, (5) Cross-validation с temporal ordering.

Q: Пример target leakage?

A: В fraud detection: признак "transaction_reversed" = True для fraud, False для normal. В production на момент prediction ещё неизвестно, будет ли reversal.

Q: Как обнаружить data leakage?

A: (1) Проверить корреляции признаков с target (> 0.95 — подозрительно), (2) Проверить overlap train/test IDs, (3) Сравнить distribution train vs test, (4) Валидация на hold-out после всех экспериментов.

Killer¶

Q: В модели 99% accuracy, но в production 60%. Что делать?

A: (1) Проверить data leakage —fit на всём датасете?, (2) Проверить temporal leakage — правильный split?, (3) Проверить distribution shift, (4) Проверить feature availability в production, (5) A/B test vs baseline.

2. Spark (RDD/DataFrame)¶

Basic¶

Q: RDD vs DataFrame?

A: RDD — low-level, type-safe, manual optimization. DataFrame — high-level SQL-like, Catalyst optimizer, lazy evaluation. DataFrame предпочтительнее для ML.

Q: Что такое lazy evaluation в Spark?

A: Transformations не выполняются немедленно. Spark строит DAG (directed acyclic graph) и выполняет только при action (collect, count, save).

Medium¶

Q: Как работает Spark ML Pipeline?

A: Последовательность stages: (1) Transformers (StringIndexer, OneHotEncoder), (2) Estimators (модели), (3) Pipeline.fit() — обучает все stages, (4) Pipeline.transform() — применяет к данным.

Q: Как оптимизировать Spark job?

A: (1) Cache frequently used DataFrames, (2) Broadcast join для small tables, (3) Repartition для skewed data, (4) Coalesce для reduce partitions, (5) Use DataFrame API вместо RDD.

Q: Что такое shuffle в Spark?

A: Перераспределение данных между partitions. Дорогостоящая операция. Происходит при: groupBy, join, distinct, sortBy. Minimize для performance.

Killer¶

Q: Спроектируйте ML pipeline на Spark для 1TB данных.

A: (1) Read из S3/HDFS параллельно, (2) Feature engineering через DataFrame API, (3) ML Pipeline с VectorAssembler + Model, (4) Cache intermediate results, (5) Partitioning по ключу для join efficiency, (6) Write predictions partitioned по дате.

3. Комбинированные вопросы¶

Killer¶

Q: Как предотвратить data leakage в distributed ML pipeline?

A: (1) Deterministic split по key (hash partitioning), (2) Temporal ordering preserved across partitions, (3) Fit transformers per partition только на train data, (4) Avoid global statistics (imputation) перед split, (5) Validate leakage detection на sample.

Q: Spark ML vs sklearn Pipeline — различия?

A: Spark ML: distributed, lazy evaluation, работает с DataFrames, no in-memory single-machine. sklearn: single-machine, eager, numpy/pandas. Spark ML предпочтительнее для >1TB данных.

4. Дополнительные вопросы¶

Medium¶

Q: Что такое feature store и зачем он нужен?

A: Feature store — централизованное хранилище признаков для ML. Решает проблемы: (1) consistency между train и serve — одни и те же трансформации, (2) reusability — разные модели используют общие фичи, (3) time-travel — историческая реконструкция фич на дату, (4) real-time features — streaming фичи с низкой задержкой. Примеры: Feast (open-source), Tecton, Databricks Feature Store.

Q: Как предотвратить train-serving skew?

A: Train-serving skew — когда фичи на train отличаются от serve. Причины: (1) разный код для offline/online transformation, (2) stale features в inference, (3) data pipeline bugs. Решения: (1) единый pipeline для train и serve (feature store), (2) мониторинг distribution drift, (3) shadow mode — сравнение online vs offline predictions, (4) data contracts для input validation.

Q: Какие стратегии partitioning в PySpark для skewed data?

A: (1) Salting — добавить random prefix к ключу для распределения, (2) Adaptive Query Execution (AQE) в Spark 3+ — автоматический coalesce/split, (3) Broadcast join для маленькой таблицы (<10MB default), (4) Custom partitioner с repartition по computed key, (5) Two-stage aggregation — partial aggregate на partition, затем global merge.

Killer¶

Q: Спроектируйте систему feature computation для real-time ML модели.

A: Архитектура: (1) Batch features: ETL pipeline -> Feature Store (daily/hourly), (2) Streaming features: Kafka -> Flink/Spark Streaming -> Feature Store (seconds), (3) Serving: Online Feature Store (Redis/DynamoDB) с low-latency reads, (4) Point-in-time correctness: при train join features по timestamp, не по latest value, (5) Monitoring: feature freshness, missing rate, distribution drift. Ключевые решения: event time vs processing time, late arrivals handling, exactly-once semantics.

5. Data Lakehouse (Delta Lake, Iceberg, Hudi)¶

Basic¶

Q: Что такое Data Lakehouse?

A: Data Lakehouse = Data Lake (cheap storage, any format) + Data Warehouse (ACID, schema, queries). Unified architecture combining best of both worlds.

Key features: - ACID transactions on object storage (S3, GCS, ADLS) - Schema enforcement and evolution - Time travel (query historical data versions) - Incremental processing (upserts, deletes) - BI + ML workloads on same data

Q: Какие основные table formats для lakehouse?

A:

Format Creator Ecosystem Key Strength

Delta Lake Databricks Spark-native Best Spark integration, mature

Apache Iceberg Netflix Multi-engine Vendor-neutral, widest support

Apache Hudi Uber Incremental Best for CDC, streaming upserts

Medium¶

Q: Delta Lake vs Iceberg vs Hudi — когда что использовать?

A:

Delta Lake: - Best for: Databricks environment, Spark-heavy workloads - Pros: Mature, great Spark integration, Unity Catalog - Cons: Databricks-centric (though open-sourced)

Apache Iceberg: - Best for: Multi-engine environments, vendor-neutral architecture - Pros: Works with Spark, Flink, Trino, Snowflake; excellent schema evolution - Cons: Newer ecosystem than Delta

Apache Hudi: - Best for: CDC (Change Data Capture), incremental processing - Pros: Built for streaming upserts, record-level updates - Cons: Higher operational complexity

2025-2026 trend: Iceberg gaining momentum due to Snowflake/Tabular acquisition and multi-engine flexibility.

Q: Как работает ACID в data lakehouse?

A:

Transaction log approach: 1. All changes written to transaction log (JSON/Avro manifests) 2. Readers use snapshot isolation — see consistent view 3. Writers acquire locks on log (optimistic concurrency) 4. Atomic commit: log entry + data files together

Delta Lake: - _delta_log/00000000000000000000.json — transaction log entries - Each commit = add/remove file operations

Iceberg: - metadata/snapshots/ — snapshot references - metadata/manifests/ — file lists with stats

Q: Что такое schema evolution в lakehouse?

A:

Ability to modify table schema without rewriting data: - Add columns (backward compatible) - Rename columns (with mapping) - Change column types (with validation) - Drop columns (soft delete)

Python example (Delta Lake):
# Add column
spark.sql("ALTER TABLE events ADD COLUMN new_field STRING")

# Schema evolution on write
df.write.format("delta").mode("append") \
    .option("mergeSchema", "true") \
    .save("/delta/events")

Killer¶

Q: Спроектируйте ML data pipeline на lakehouse архитектуре.

A:

Architecture:
Raw Zone (Bronze) → Curated Zone (Silver) → Feature Store (Gold)
      ↓                    ↓                      ↓
  Iceberg tables      Delta tables          Iceberg/Delta
  Full history        Cleaned + joined      ML features
Key decisions: 1. Format choice: Iceberg for raw (multi-engine), Delta for curated (Spark-native) 2. Partitioning: By date + event_type for query performance 3. Compaction: Auto-optimize small files (Z-order for Delta) 4. Time travel: Train ML models on point-in-time data

ML-specific features:
# Point-in-time join for training
features_df = spark.read.format("iceberg") \
    .option("snapshot-id", snapshot_id) \
    .load("catalog.db.features")

# Incremental feature computation
new_events = spark.read.format("delta") \
    .option("startingVersion", last_processed_version) \
    .table("events")
Considerations: - VACUUM for cost optimization (old file cleanup) - Compaction for query performance (small files problem) - Schema registry integration for data contracts

6. dbt & Data Transformation Tools¶

Basic¶

Q: Что такое dbt и зачем он нужен?

A:

dbt (data build tool) — трансформация данных через SQL/Python код с version control, testing и documentation.

Ключевые преимущества: - SQL-first: аналитики пишут трансформации на SQL - Version control: Git для всех трансформаций - Testing: автоматические data tests - Documentation: auto-generated docs - Lineage: DAG зависимостей между моделями

ELT vs ETL: | Approach | Order | Tool Example | |----------|-------|--------------| | ETL | Extract → Transform → Load | Informatica, Talend | | ELT | Extract → Load → Transform | dbt, Spark SQL |

dbt focus: Transform (последний шаг ELT)

Q: Что такое dbt model?

A:

Model — SQL/Python файл, определяющий трансформацию данных.
-- models/stg_customers.sql
{{ config(materialized='table') }}

SELECT
    customer_id,
    UPPER(name) as name,
    email,
    created_at
FROM {{ source('raw', 'customers') }}
WHERE deleted_at IS NULL
Materialization types: | Type | Description | When to Use | |------|-------------|-------------| | view | CREATE VIEW | Simple transformations, always fresh | | table | CREATE TABLE | Heavy transformations, queried often | | incremental | INSERT/UPDATE | Large data, append-only | | ephemeral | CTE (no object) | Reusable logic |

Medium¶

Q: Как работает incremental materialization?

A:

Incremental — обновляет только новые/изменённые данные вместо полного пересчёта.
-- models/fct_orders.sql
{{ config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge'
) }}

SELECT * FROM {{ source('raw', 'orders') }}

{% if is_incremental() %}
    WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}
Incremental strategies: | Strategy | Description | Supported DB | |----------|-------------|--------------| | append | INSERT only | All | | merge | UPSERT | Snowflake, BigQuery, Databricks | | delete+insert | DELETE then INSERT | Postgres, Snowflake | | insert_overwrite | Partition replacement | Spark, BigQuery |

When to use incremental: - Large dataset (>10M rows) - High-frequency updates - Expensive transformations

Q: Какие типы tests есть в dbt?

A:

Built-in tests: | Test | Description | |------|-------------| | unique | No duplicates | | not_null | No NULL values | | accepted_values | Value in list | | relationships | Foreign key valid |

Schema YAML:

version: 2

models:
  - name: customers
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['active', 'inactive', 'pending']
      - name: email
        tests:
          - dbt_utils.is_email

Custom tests:

-- tests/assert_customer_count.sql
SELECT COUNT(*) FROM {{ ref('customers') }}
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
HAVING COUNT(*) < 100  -- Fail if less than 100 new customers

Q: dbt macros — что это и зачем?

A:

Macro — reusable Jinja2 функция для SQL кода.

-- macros/date_spine.sql
{% macro date_spine(start_date, end_date) %}
    WITH dates AS (
        SELECT {{ start_date }} as date
        UNION ALL
        SELECT DATEADD(day, 1, date)
        FROM dates
        WHERE date < {{ end_date }}
    )
    SELECT * FROM dates
{% endmacro %}

-- Usage
{{ date_spine("'2024-01-01'", "'2024-12-31'") }}

Common use cases: - Reusable transformations - Database-specific SQL (portability) - Complex logic encapsulation - Package functions (dbt_utils, dbt_expectations)

Killer¶

Q: Спроектируйте dbt pipeline для ML feature store.

A:

Architecture:

Raw Layer (Bronze)
  stg_events, stg_users, stg_transactions
        ↓
Intermediate Layer (Silver)
  int_user_features, int_product_features, int_session_features
        ↓
Features Layer (Gold)
  fct_user_features, fct_product_features, dim_users
        ↓
ML Feature Store
  Feature ingestion API → Online Store (Redis)

Project structure:

models/
├── staging/
│   ├── stg_events.sql
│   └── stg_users.sql
├── intermediate/
│   ├── int_user_features.sql
│   └── int_session_features.sql
├── marts/
│   ├── fct_user_features.sql
│   └── dim_users.sql
└── ml/
    └── ml_feature_table.sql  -- Export for ML

ML feature model:

-- models/ml/ml_feature_table.sql
{{ config(
    materialized='incremental',
    unique_key='user_id',
    incremental_strategy='merge'
) }}

WITH user_features AS (
    SELECT * FROM {{ ref('fct_user_features') }}
    {% if is_incremental() %}
        WHERE feature_timestamp > (SELECT MAX(feature_timestamp) FROM {{ this }})
    {% endif %}
),

aggregated AS (
    SELECT
        user_id,
        feature_timestamp,
        -- Rolling features
        SUM(order_value_7d) as total_value_7d,
        AVG(session_duration_30d) as avg_session_duration,
        COUNT(DISTINCT product_id_7d) as unique_products,

        -- Feature metadata
        CURRENT_TIMESTAMP as _loaded_at
    FROM user_features
    GROUP BY user_id, feature_timestamp
)

SELECT * FROM aggregated

Tests for ML features:

models:
  - name: ml_feature_table
    tests:
      - dbt_expectations.expect_table_row_count_to_be_between:
          min_value: 1000  # At least 1K users
    columns:
      - name: user_id
        tests:
          - unique
          - not_null
      - name: total_value_7d
        tests:
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0

Q: dbt vs Spark для data transformations — когда что?

A:

Criterion dbt Spark

Primary language SQL Python/Scala/SQL

Target users Analysts, Analytics Engineers Data Engineers, Data Scientists

Scale GB to low TB TB to PB

Transformation complexity SQL-based Complex custom logic

Compute Data warehouse Distributed cluster

Cost model Pay per query Cluster + storage

When dbt: - SQL transformations only - Data warehouse native (Snowflake, BigQuery) - Analyst-driven development - Version control and testing needed

When Spark: - Complex Python logic (ML, UDFs) - Very large data (>10TB) - Multi-source joins with complex logic - Streaming data

Hybrid approach (2025-2026): - dbt for SQL transforms → export to Parquet - Spark for ML features → write to Feature Store - dbt for final aggregations and reporting

7. Airflow & Workflow Orchestration¶

Basic¶

Q: Что такое Apache Airflow и зачем он нужен?

A:

Apache Airflow — platform для программного создания, scheduling и мониторинга workflows.

Ключевые концепции: - DAG (Directed Acyclic Graph): Описание workflow с задачами и зависимостями - Task: Единица работы (Operator, Sensor) - Operator: Тип task (PythonOperator, BashOperator, etc.) - Scheduler: Daemon, который запускает tasks - Executor: Как tasks выполняются (LocalExecutor, CeleryExecutor, KubernetesExecutor)

Принципы Airflow: - DAG как code (Python) - Idempotent tasks - Retry logic built-in - Rich monitoring UI

Q: Какие типы Operators есть в Airflow?

A:

Operator Description Use Case

PythonOperator Execute Python function Custom logic, ML training

BashOperator Execute bash command Shell scripts, CLI tools

SqlOperator Execute SQL query Database operations

DockerOperator Run Docker container Isolated environments

KubernetesPodOperator Run K8s pod Scalable ML workloads

Sensor Wait for condition File arrival, external trigger
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

with DAG('ml_training', start_date=datetime(2025, 1, 1), schedule='@daily') as dag:

    def train_model(**context):
        # ML training logic
        pass

    train_task = PythonOperator(
        task_id='train_model',
        python_callable=train_model,
        provide_context=True
    )

Medium¶

Q: Как организовать ML pipeline в Airflow?

A:

Typical ML DAG:

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.sensors.external_task import ExternalTaskSensor
from datetime import datetime, timedelta

default_args = {
    'owner': 'ml-team',
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'execution_timeout': timedelta(hours=2),
}

with DAG(
    'ml_training_pipeline',
    default_args=default_args,
    start_date=datetime(2025, 1, 1),
    schedule='0 2 * * *',  # 2 AM daily
    max_active_runs=1,  # Prevent parallel runs
    catchup=False,
) as dag:

    # Wait for data pipeline to complete
    wait_for_data = ExternalTaskSensor(
        task_id='wait_for_data',
        external_dag_id='data_pipeline',
        external_task_id='publish_features',
        timeout=3600,
        mode='reschedule',  # Free worker while waiting
    )

    def validate_data(**context):
        ds = context['ds']
        # Data quality checks
        pass

    def train_model(**context):
        # Model training
        run_id = context['run_id']
        pass

    def evaluate_model(**context):
        # Model evaluation
        pass

    def deploy_model(**context):
        # Model deployment
        pass

    validate = PythonOperator(task_id='validate_data', python_callable=validate_data)
    train = PythonOperator(task_id='train_model', python_callable=train_model)
    evaluate = PythonOperator(task_id='evaluate_model', python_callable=evaluate_model)
    deploy = PythonOperator(task_id='deploy_model', python_callable=deploy_model)

    wait_for_data >> validate >> train >> evaluate >> deploy

Q: Airflow vs Dagster vs Prefect — когда что использовать?

A:

Feature Airflow Dagster Prefect

Paradigm DAG-centric Asset-centric Flow-centric

Data awareness Low High (data lineage) Medium

Type safety None Strong (ops/assets) Python typing

Testing Manual Built-in fixtures Built-in

Local dev Requires DB First-class First-class

Best for Batch pipelines Data platforms Modern ETL

When Airflow: - Existing Airflow infrastructure - Large community and plugins - Simple batch scheduling

When Dagster: - Data-aware pipelines - Strong testing requirements - Asset-based architecture

When Prefect: - Python-native workflows - Hybrid execution (local/cloud) - Event-driven pipelines

Q: Как обрабатывать backfill и catchup?

A:

Backfill scenarios: 1. New DAG deployed — need to run historical data 2. Bug fixed — rerun affected periods 3. Data late arrival — reprocess

Approaches:

# Option 1: catchup=True in DAG
with DAG('my_dag', catchup=True, start_date=datetime(2024, 1, 1)):
    ...

# Option 2: Manual backfill CLI
# airflow dags backfill my_dag -s 2024-01-01 -e 2024-12-31

# Option 3: Backfill DAG pattern
with DAG(
    'backfill_dag',
    params={
        'start_date': Param(type='string', default=''),
        'end_date': Param(type='string', default=''),
    },
    schedule=None,  # Manual trigger only
) as dag:

    def backfill_task(**context):
        start = context['params']['start_date']
        end = context['params']['end_date']
        # Process date range
        pass

Best practices: - Set catchup=False for production DAGs - Use dedicated backfill DAGs for large backfills - Monitor resource usage during backfill - Consider parallelism limits

Killer¶

Q: Спроектируйте production ML platform на Airflow.

A:

Architecture:

graph TD
    subgraph "Airflow Cluster"
        WEB[Web UI]
        SCHED[Scheduler]
        TRIG[Triggerer]
        subgraph "KubernetesExecutor"
            P1[Pod 1: Train]
            P2[Pod 2: Evaluate]
            P3[Pod 3: Deploy]
        end
        SCHED --> P1
        SCHED --> P2
        SCHED --> P3
    end
    P1 --> FS[Feature Store]
    P2 --> MR[Model Registry]
    P3 --> AS[Artifact Storage]

    style WEB fill:#e8eaf6,stroke:#3f51b5
    style SCHED fill:#e8eaf6,stroke:#3f51b5
    style TRIG fill:#e8eaf6,stroke:#3f51b5
    style P1 fill:#e8f5e9,stroke:#4caf50
    style P2 fill:#e8f5e9,stroke:#4caf50
    style P3 fill:#e8f5e9,stroke:#4caf50
    style FS fill:#fff3e0,stroke:#ef6c00
    style MR fill:#fff3e0,stroke:#ef6c00
    style AS fill:#fff3e0,stroke:#ef6c00

Key configurations:

# airflow.cfg for production
[core]
executor = KubernetesExecutor
parallelism = 100
max_active_runs_per_dag = 3

[kubernetes]
worker_service_account = airflow-worker
image_pull_secrets = docker-registry-secret
namespace = airflow

[scheduler]
scheduler_heartbeat_sec = 5
min_file_process_interval = 30
catchup_by_default = False

DAG patterns:

# ML Training DAG with resource limits
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from kubernetes.client import models as k8s

train_task = KubernetesPodOperator(
    task_id='train_model',
    name='ml-training-{{ ds_nodash }}',
    image='ml-training:latest',
    cmds=['python', 'train.py'],
    arguments=['--date', '{{ ds }}'],
    container_resources=k8s.V1ResourceRequirements(
        requests={'memory': '8Gi', 'cpu': '4'},
        limits={'memory': '16Gi', 'cpu': '8', 'nvidia.com/gpu': '1'},
    ),
    config_file='/path/to/kube/config',
    in_cluster=True,
    get_logs=True,
)

Monitoring & Alerting:

from airflow.callbacks import on_failure_callback

def ml_failure_callback(context):
    task_id = context['task_instance'].task_id
    dag_id = context['dag'].dag_id
    execution_date = context['execution_date']
    exception = context['exception']

    # Send alert to Slack/PagerDuty
    send_alert(
        f"ML Pipeline Failed: {dag_id}.{task_id}",
        f"Time: {execution_date}\nError: {exception}"
    )

    # Log to MLflow
    mlflow.log_param("status", "failed")
    mlflow.log_param("error", str(exception))

default_args = {
    'on_failure_callback': ml_failure_callback,
    'on_retry_callback': ml_failure_callback,
}

Q: Как масштабировать Airflow для 1000+ DAGs?

A:

Scaling strategies:

Component Scaling Approach

Scheduler Multiple schedulers (Airflow 2.0+)

Executor CeleryExecutor or KubernetesExecutor

Database Connection pooling, read replicas

Web UI Separate webserver, caching

Best practices for large deployments:
DAG parsing optimization:
# Use .airflowignore to skip unused files
# Avoid top-level database queries in DAG files
# Use lazy imports

# Good: Dynamic DAG generation
for table in get_table_list():  # Called at parse time
    dag = create_dag_for_table(table)
    globals()[f'etl_{table}'] = dag
Resource isolation:

Separate DAGs by priority (queue parameter)

Use pool limits for expensive tasks

KubernetesExecutor for per-task isolation

Monitoring:

Scheduler lag metrics

DAG parse time metrics

Task duration outliers

Database connection pool usage
Multi-team governance:
# DAG file structure
dags/
├── team_a/
│   ├── data_pipeline.py
│   └── ml_training.py
├── team_b/
│   └── etl_pipeline.py
└── shared/
    └── utility_dags.py

Заблуждение: 99% accuracy на тесте = отличная модель

99% accuracy при 60% на production = data leakage. Проверяйте: (1) fit scaler/imputer до split? (2) random split для time series? (3) overlap entities в train/test? (4) признаки, недоступные в production? 95% случаев -- одна из этих четырёх причин.

Заблуждение: Spark shuffle -- просто медленная операция

Shuffle -- перераспределение данных между executors через network + disk I/O. При skewed data один executor получает 90% данных -> OOM. Решения: salting (random prefix к ключу), AQE (Spark 3+), broadcast join (<10MB). На интервью ожидают конкретные стратегии, а не "избегайте shuffle."

Заблуждение: Delta Lake и Iceberg -- одно и то же

Delta Lake = Databricks-centric, лучшая Spark интеграция. Iceberg = vendor-neutral, работает с Spark/Flink/Trino/Snowflake. Hudi = streaming upserts, CDC. Тренд 2025-2026: Iceberg набирает momentum из-за multi-engine flexibility (Snowflake acquisition).

Интервью: формат ответов¶

Data Leakage¶

Красный флаг: "Data leakage -- это когда test данные попадают в train"

Сильный ответ: "4 типа leakage: target (признак коррелирует с target, но недоступен в production), train-test contamination (fit scaler на всём датасете), temporal (будущие данные в train), group (один клиент в train и test). Обнаружение: корреляция > 0.95 с target, проверка overlap IDs, temporal ordering. Prevention: Pipeline, temporal split, GroupKFold."

Spark¶

Красный флаг: "RDD и DataFrame -- одно и то же, просто разный API"

Сильный ответ: "DataFrame с Catalyst optimizer в 10-100x быстрее RDD для structured data. Lazy evaluation строит DAG, оптимизирует план выполнения. Оптимизация: cache() для частых DataFrame, broadcast join для малых таблиц, repartition для skewed data, coalesce для reduce partitions. ML Pipeline: StringIndexer -> VectorAssembler -> Model."

Data Lakehouse¶

Красный флаг: "Lakehouse -- это просто data lake с SQL"

Сильный ответ: "Lakehouse = Data Lake (cheap storage, any format) + Data Warehouse (ACID, schema, SQL). Delta Lake/Iceberg/Hudi обеспечивают ACID через transaction log. Time travel для point-in-time ML training. Schema evolution без rewrite данных. Medallion architecture: Bronze (raw) -> Silver (cleaned) -> Gold (features)."

Format	Creator	Ecosystem	Key Strength
Delta Lake	Databricks	Spark-native	Best Spark integration, mature
Apache Iceberg	Netflix	Multi-engine	Vendor-neutral, widest support
Apache Hudi	Uber	Incremental	Best for CDC, streaming upserts

Criterion	dbt	Spark
Primary language	SQL	Python/Scala/SQL
Target users	Analysts, Analytics Engineers	Data Engineers, Data Scientists
Scale	GB to low TB	TB to PB
Transformation complexity	SQL-based	Complex custom logic
Compute	Data warehouse	Distributed cluster
Cost model	Pay per query	Cluster + storage

Operator	Description	Use Case
PythonOperator	Execute Python function	Custom logic, ML training
BashOperator	Execute bash command	Shell scripts, CLI tools
SqlOperator	Execute SQL query	Database operations
DockerOperator	Run Docker container	Isolated environments
KubernetesPodOperator	Run K8s pod	Scalable ML workloads
Sensor	Wait for condition	File arrival, external trigger

Feature	Airflow	Dagster	Prefect
Paradigm	DAG-centric	Asset-centric	Flow-centric
Data awareness	Low	High (data lineage)	Medium
Type safety	None	Strong (ops/assets)	Python typing
Testing	Manual	Built-in fixtures	Built-in
Local dev	Requires DB	First-class	First-class
Best for	Batch pipelines	Data platforms	Modern ETL

Component	Scaling Approach
Scheduler	Multiple schedulers (Airflow 2.0+)
Executor	CeleryExecutor or KubernetesExecutor
Database	Connection pooling, read replicas
Web UI	Separate webserver, caching

Дата-инженерия: подготовка к интервью¶

1. Data Leakage¶

Basic¶

Medium¶

Killer¶

2. Spark (RDD/DataFrame)¶

Basic¶

Medium¶

Killer¶

3. Комбинированные вопросы¶

Killer¶

4. Дополнительные вопросы¶

Medium¶

Killer¶

5. Data Lakehouse (Delta Lake, Iceberg, Hudi)¶

Basic¶

Medium¶

Killer¶

6. dbt & Data Transformation Tools¶

Basic¶

Medium¶

Killer¶

7. Airflow & Workflow Orchestration¶

Basic¶

Medium¶

Killer¶

Интервью: формат ответов¶

Data Leakage¶

Spark¶

Data Lakehouse¶

See Also¶