Требования к данным: обнаружение мошенничества¶

~4 минуты чтения

Предварительно: Определение задачи, Feature Engineering

Fraud detection -- одна из самых data-intensive ML-систем. Для скоринга одной транзакции за < 100 мс нужно собрать 50-200 фичей из 5+ источников: транзакционная история (velocity за 1ч/24ч/7д), профиль пользователя (account age, avg amount), device fingerprint (30+ параметров браузера), IP reputation (VPN/proxy/Tor), граф связей (users sharing device/IP). При этом лейблы приходят с задержкой 30-90 дней (chargebacks), а 40% фрода вообще никогда не репортится. Data pipeline должен обрабатывать 50K+ событий/сек в реальном времени (Kafka + Flink) и параллельно строить batch-фичи в Spark для переобучения модели.

Источники данных¶

1. Transaction Data (Данные транзакций)¶

CREATE TABLE transactions (
    transaction_id UUID PRIMARY KEY,
    user_id BIGINT NOT NULL,
    merchant_id BIGINT NOT NULL,
    card_id BIGINT,
    amount DECIMAL(15,2),
    currency VARCHAR(3),
    timestamp TIMESTAMP,
    transaction_type VARCHAR(20),  -- purchase, refund, transfer
    channel VARCHAR(20),           -- online, pos, atm, mobile
    status VARCHAR(20),            -- approved, declined, pending
    decline_reason VARCHAR(50),
    mcc_code VARCHAR(4),           -- Merchant Category Code
    country VARCHAR(2),
    city VARCHAR(100),
    ip_address INET,
    device_id VARCHAR(100),
    session_id VARCHAR(100)
);

-- Labels (приходят с задержкой)
CREATE TABLE fraud_labels (
    transaction_id UUID PRIMARY KEY,
    is_fraud BOOLEAN,
    fraud_type VARCHAR(50),
    reported_at TIMESTAMP,
    source VARCHAR(20),  -- chargeback, customer_report, internal
    confirmed_by VARCHAR(50)
);

2. User/Account Data¶

CREATE TABLE users (
    user_id BIGINT PRIMARY KEY,
    created_at TIMESTAMP,
    kyc_status VARCHAR(20),
    kyc_verified_at TIMESTAMP,
    email_verified BOOLEAN,
    phone_verified BOOLEAN,
    address_verified BOOLEAN,
    country VARCHAR(2),
    risk_score FLOAT,
    account_type VARCHAR(20),
    lifetime_value DECIMAL(15,2)
);

CREATE TABLE user_devices (
    user_id BIGINT,
    device_id VARCHAR(100),
    device_type VARCHAR(20),
    os VARCHAR(20),
    first_seen TIMESTAMP,
    last_seen TIMESTAMP,
    is_trusted BOOLEAN,
    PRIMARY KEY (user_id, device_id)
);

3. Merchant Data¶

CREATE TABLE merchants (
    merchant_id BIGINT PRIMARY KEY,
    merchant_name VARCHAR(200),
    mcc_code VARCHAR(4),
    category VARCHAR(50),
    country VARCHAR(2),
    risk_category VARCHAR(20),
    onboarded_at TIMESTAMP,
    avg_transaction_amount DECIMAL(15,2),
    chargeback_rate FLOAT,
    fraud_rate FLOAT
);

4. Device & Session Data¶

device_fingerprint = {
    "device_id": "d123",
    "user_agent": "Mozilla/5.0...",
    "screen_resolution": "1920x1080",
    "timezone": "Europe/Moscow",
    "language": "ru-RU",
    "plugins": ["pdf", "flash"],
    "canvas_hash": "abc123",
    "webgl_hash": "def456",
    "audio_hash": "ghi789",
    "fonts": ["Arial", "Helvetica"],
    "touch_support": False,
    "cookies_enabled": True,
}

session_data = {
    "session_id": "s123",
    "ip_address": "1.2.3.4",
    "ip_geo": {"country": "RU", "city": "Moscow", "isp": "MTS"},
    "is_vpn": False,
    "is_proxy": False,
    "is_tor": False,
    "referrer": "https://google.com",
    "pages_viewed": 5,
    "time_on_site_sec": 120,
    "mouse_movements": 1500,
    "keystrokes": 50,
}

Feature Engineering¶

1. Transaction Features¶

transaction_features = {
    # Basic
    "amount": 150.00,
    "amount_log": 2.18,
    "currency": "USD",
    "hour_of_day": 14,
    "day_of_week": 3,
    "is_weekend": False,
    "is_night": False,  # 22:00-06:00

    # Merchant
    "mcc_code": "5411",  # Grocery stores
    "merchant_risk_score": 0.1,
    "is_high_risk_mcc": False,

    # Velocity (user-level)
    "txn_count_1h": 2,
    "txn_count_24h": 5,
    "txn_count_7d": 15,
    "txn_amount_1h": 250.00,
    "txn_amount_24h": 500.00,
    "txn_amount_7d": 1500.00,
    "unique_merchants_24h": 3,
    "unique_countries_24h": 1,

    # Deviation from pattern
    "amount_zscore": 1.2,  # vs user's average
    "is_new_merchant": False,
    "is_new_country": False,
    "is_new_device": False,
    "time_since_last_txn_min": 30,
}

2. User Behavior Features¶

user_behavior_features = {
    # Historical patterns
    "avg_txn_amount_30d": 75.00,
    "std_txn_amount_30d": 25.00,
    "max_txn_amount_30d": 200.00,
    "typical_hour_range": [9, 22],
    "typical_mcc_codes": ["5411", "5812", "5912"],

    # Account age & activity
    "account_age_days": 365,
    "days_since_first_txn": 350,
    "total_txn_count": 500,
    "total_txn_amount": 25000.00,

    # Risk indicators
    "failed_txn_count_7d": 1,
    "declined_txn_count_7d": 0,
    "password_changes_30d": 0,
    "email_changes_30d": 0,
    "device_changes_30d": 1,

    # Fraud history
    "prev_fraud_count": 0,
    "prev_chargeback_count": 1,
    "days_since_last_fraud": None,
}

3. Network/Graph Features¶

graph_features = {
    # Device graph
    "users_sharing_device": 1,
    "devices_per_user": 2,
    "is_device_linked_to_fraud": False,

    # IP graph
    "users_sharing_ip": 3,
    "ips_per_user_24h": 1,
    "is_ip_linked_to_fraud": False,

    # Card graph
    "cards_per_user": 2,
    "users_per_card": 1,

    # Email/phone graph
    "accounts_with_same_email_domain": 10,
    "accounts_with_same_phone_prefix": 5,

    # Community detection
    "user_community_id": 123,
    "community_fraud_rate": 0.02,
}

4. Real-time Session Features¶

session_features = {
    # Behavioral biometrics
    "typing_speed_wpm": 45,
    "mouse_movement_pattern": "human",  # vs bot
    "scroll_pattern": "normal",
    "time_to_complete_form_sec": 30,

    # Navigation
    "pages_before_checkout": 5,
    "time_on_checkout_page_sec": 60,
    "cart_changes_count": 2,

    # Anomalies
    "is_unusual_time": False,
    "is_unusual_location": False,
    "is_unusual_amount": False,
    "is_unusual_merchant": False,
}

Data Pipeline¶

graph TD
    subgraph RT["REAL-TIME PIPELINE"]
        TXN_RT["Transaction Event<br/>(JSON)"] --> KAFKA["Kafka<br/>Topic"]
        KAFKA --> FLINK["Flink<br/>Stream Processing"]
        FLINK --> REDIS["Redis<br/>Features"]
        FLINK --> SCORING["Fraud Scoring<br/>Service"]
    end

    subgraph BATCH["BATCH PIPELINE"]
        TXN_BATCH["Transaction Logs<br/>(Daily)"] --> S3["S3<br/>Raw Data"]
        S3 --> SPARK["Spark<br/>ETL"]
        SPARK --> FS["Feature<br/>Store"]
        SPARK --> TRAIN["Training<br/>Dataset"]
    end

    style TXN_RT fill:#e8eaf6,stroke:#3f51b5
    style KAFKA fill:#fff3e0,stroke:#ef6c00
    style FLINK fill:#f3e5f5,stroke:#9c27b0
    style REDIS fill:#e8f5e9,stroke:#4caf50
    style SCORING fill:#e8f5e9,stroke:#4caf50
    style TXN_BATCH fill:#e8eaf6,stroke:#3f51b5
    style S3 fill:#fff3e0,stroke:#ef6c00
    style SPARK fill:#f3e5f5,stroke:#9c27b0
    style FS fill:#e8f5e9,stroke:#4caf50
    style TRAIN fill:#e8f5e9,stroke:#4caf50

Label Management¶

Label Sources¶

label_sources = {
    "chargeback": {
        "delay": "30-90 days",
        "reliability": "high",
        "coverage": "60%"  # не все фроды оспариваются
    },
    "customer_report": {
        "delay": "1-7 days",
        "reliability": "medium",
        "coverage": "30%"
    },
    "internal_investigation": {
        "delay": "1-30 days",
        "reliability": "high",
        "coverage": "10%"
    },
    "rule_based_detection": {
        "delay": "real-time",
        "reliability": "low",
        "coverage": "varies"
    }
}

Handling Label Delay¶

# Вариант 1: Ждать mature labels
def get_training_data(cutoff_days=90):
    """Use only transactions older than 90 days with confirmed labels"""
    return df[df['transaction_date'] < today - timedelta(days=cutoff_days)]

# Вариант 2: Proxy labels
def create_proxy_labels(df):
    """Use early signals as proxy for fraud"""
    df['proxy_fraud'] = (
        (df['chargeback_initiated']) |
        (df['customer_dispute']) |
        (df['failed_delivery_with_claim']) |
        (df['multiple_failed_auth_attempts'])
    )
    return df

# Вариант 3: Semi-supervised learning
def pseudo_labeling(model, unlabeled_data, threshold=0.95):
    """Use high-confidence predictions as labels"""
    predictions = model.predict_proba(unlabeled_data)
    high_conf_fraud = predictions[:, 1] > threshold
    high_conf_legit = predictions[:, 1] < (1 - threshold)
    # Use these as additional training data

Data Quality Checks¶

fraud_data_quality_checks = {
    "transactions": [
        "transaction_id is unique",
        "amount > 0",
        "timestamp is not null and not in future",
        "user_id exists in users table",
        "no duplicate transactions within 1 second",
    ],
    "features": [
        "no null values in required features",
        "velocity features are non-negative",
        "amounts are in expected currency range",
        "device_id format is valid",
    ],
    "labels": [
        "is_fraud is boolean",
        "reported_at >= transaction timestamp",
        "no contradicting labels for same transaction",
    ],
    "monitoring": [
        "feature distribution drift < threshold",
        "label rate within expected range",
        "data freshness < 1 hour",
    ]
}

Privacy & Compliance¶

PCI DSS Requirements¶

Card numbers must be tokenized
Sensitive data encrypted at rest and in transit
Access logging and auditing
Data retention policies

Right to explanation (why blocked?)
Right to erasure (delete user data)
Data minimization (only needed features)
Consent for profiling

# Data masking example
def mask_sensitive_data(transaction):
    return {
        **transaction,
        "card_number": hash(transaction["card_number"]),
        "ip_address": anonymize_ip(transaction["ip_address"]),
        "email": hash(transaction["email"]),
    }

Заблуждение: можно обучать модель на всех доступных данных

Нельзя использовать транзакции моложе 90 дней для обучения -- лейблы ещё не mature. Из 1M транзакций за последние 30 дней, 40% фродовых ещё не отмечены (chargeback не пришёл). Обучение на таких данных создаёт label noise: модель учится, что эти транзакции легитимны, хотя они фродовые. Решение -- mature window (90 дней) + proxy labels для свежих данных.

Заблуждение: device fingerprint однозначно идентифицирует устройство

Canvas hash, WebGL hash, audio hash меняются при обновлении браузера/ОС (примерно раз в 2-4 недели). VPN/инкогнито-режим скрывает IP и cookies. Фродстеры используют antidetect-браузеры (Multilogin, GoLogin), которые генерируют уникальные fingerprint за $100/мес. Device fingerprint -- один из сигналов, но не единственный. Надёжная идентификация требует комбинации 5+ факторов.

Заблуждение: velocity-фичи достаточно считать по user_id

Фродстеры создают новые аккаунты. Velocity нужно считать по 4 сущностям: user_id, device_id, ip_address, card_token. Один device_id с 10 разными user_id за час -- сильнейший сигнал. Один IP с 50 транзакциями от разных пользователей -- datacenter/botnet. Пропуск кросс-сущностных velocity -- одна из главных причин пропуска fraud rings.

Секция для интервью¶

Вопрос: "Как обрабатывать label delay в 30-90 дней?"

Слабый ответ: "Обучаем модель на тех лейблах, которые есть."

Сильный ответ: "Три стратегии: (1) Mature window -- обучаем только на транзакциях старше 90 дней, где chargeback уже пришёл или окно закрылось. Минус: модель отстаёт от новых паттернов на 3 месяца. (2) Proxy labels -- используем ранние сигналы (жалобы клиентов через 1-7 дней, failed delivery, suspicious login) как приближение к true labels. Coverage ниже, но задержка 1-7 дней вместо 90. (3) Semi-supervised -- обучаем модель на mature данных, затем используем high-confidence предсказания (score > 0.95) на свежих данных как pseudo-labels для дообучения. Комбинация трёх подходов даёт лучший результат."

Вопрос: "Какие фичи самые важные для fraud detection?"

Слабый ответ: "Сумма транзакции и время."

Сильный ответ: "По feature importance (SHAP) в production-системах: (1) Velocity-фичи -- txn_count_1h, txn_amount_24h -- отвечают за 15-20% predictive power; (2) Deviation from pattern -- amount_zscore (текущая сумма vs среднее пользователя), is_new_device, is_new_country -- ещё 15%; (3) Graph-фичи -- users_sharing_device, hops_to_known_fraudster -- 10-15%; (4) Device/session -- is_vpn, is_datacenter_ip, typing_speed. Важно: сырые фичи (amount, hour_of_day) менее информативны, чем производные (amount_zscore, time_since_last_txn). Фичи нужно считать в реальном времени (Flink -> Redis) для velocity и batch (Spark) для исторических профилей."