Ранжирование поиска: требования к данным¶

~3 минуты чтения

Предварительно: Компоненты системы

Качество поисковой системы определяется данными: для обучения LambdaMART-модели нужны миллионы пар (query, document, relevance_label), а feature store должен отдавать 100+ признаков для 1000 кандидатов за <20ms. На практике 80% времени разработки уходит на data pipeline -- сбор click logs, position debiasing, feature engineering. Модель без правильных данных бесполезна: Google обнаружил, что добавление 10 query-document features дает больший прирост NDCG (+3%), чем замена модели с LR на neural network (+1.5%).

Источники данных¶

1. Document/Item Data¶

-- Каталог товаров (e-commerce пример)
CREATE TABLE products (
    product_id BIGINT PRIMARY KEY,
    title VARCHAR(500),
    description TEXT,
    category_id INT,
    brand VARCHAR(100),
    price DECIMAL(10,2),
    currency VARCHAR(3),
    in_stock BOOLEAN,
    stock_quantity INT,
    rating FLOAT,
    review_count INT,
    created_at TIMESTAMP,
    updated_at TIMESTAMP,
    seller_id BIGINT,
    image_urls TEXT[],
    attributes JSONB
);

-- Full-text search index
CREATE INDEX products_fts_idx ON products
USING gin(to_tsvector('russian', title || ' ' || description));

2. Query Logs¶

CREATE TABLE search_logs (
    search_id UUID PRIMARY KEY,
    user_id BIGINT,
    query_text VARCHAR(500),
    query_normalized VARCHAR(500),
    session_id VARCHAR(100),
    timestamp TIMESTAMP,
    device_type VARCHAR(20),
    platform VARCHAR(20),
    result_count INT,
    page_number INT,
    filters_applied JSONB,
    latency_ms INT
);

CREATE TABLE click_logs (
    click_id UUID PRIMARY KEY,
    search_id UUID REFERENCES search_logs,
    user_id BIGINT,
    product_id BIGINT,
    position INT,
    timestamp TIMESTAMP,
    dwell_time_sec INT,
    action VARCHAR(20)  -- click, add_to_cart, purchase
);

3. User Data¶

user_profile = {
    "user_id": "123",
    "preferences": {
        "preferred_brands": ["Nike", "Adidas"],
        "price_range": [1000, 5000],
        "preferred_categories": ["shoes", "sportswear"],
    },
    "search_history": [
        {"query": "кроссовки nike", "timestamp": "2024-01-10"},
        {"query": "спортивные штаны", "timestamp": "2024-01-09"},
    ],
    "purchase_history": [...],
    "view_history": [...],
}

Feature Engineering¶

1. Query Features¶

query_features = {
    # Lexical features
    "query_length": 3,
    "num_words": 3,
    "has_brand": True,
    "has_size": True,
    "has_color": True,
    "has_price_modifier": False,  # "дешевые", "премиум"

    # Historical features
    "query_frequency": 1500,  # searches/day
    "avg_clicks_per_search": 2.3,
    "avg_conversion_rate": 0.05,
    "zero_result_rate": 0.01,

    # Semantic features
    "query_embedding": [0.1, 0.2, ...],  # 768-dim BERT
    "predicted_category": "footwear",
    "predicted_intent": "purchase",

    # Session context
    "is_refinement": False,  # refined from previous query
    "previous_query_similarity": 0.0,
}

2. Document Features¶

document_features = {
    # Text features
    "title_length": 45,
    "description_length": 500,
    "title_embedding": [0.3, 0.1, ...],
    "description_embedding": [0.2, 0.4, ...],

    # Quality features
    "rating": 4.5,
    "review_count": 1000,
    "return_rate": 0.02,
    "seller_rating": 4.8,

    # Popularity features
    "views_7d": 10000,
    "clicks_7d": 500,
    "purchases_7d": 50,
    "add_to_cart_7d": 200,

    # Business features
    "price": 5000,
    "discount_percent": 20,
    "in_stock": True,
    "days_since_listed": 30,
    "is_promoted": False,
}

3. Query-Document Features¶

query_document_features = {
    # Text matching
    "bm25_score": 15.3,
    "title_match_ratio": 0.8,  # % query terms in title
    "exact_match": False,
    "phrase_match": True,

    # Semantic matching
    "embedding_similarity": 0.85,
    "cross_encoder_score": 0.78,

    # Field-specific matching
    "brand_match": True,
    "category_match": True,
    "size_match": True,
    "color_match": True,

    # Historical interaction
    "user_prev_click": False,
    "user_prev_purchase": False,
    "category_affinity": 0.7,
    "brand_affinity": 0.9,
}

4. Context Features¶

context_features = {
    # Session
    "session_query_count": 3,
    "session_click_count": 5,
    "time_since_session_start_min": 10,

    # Time
    "hour_of_day": 14,
    "day_of_week": 3,
    "is_weekend": False,

    # Device
    "device_type": "mobile",
    "screen_size": "small",

    # Location
    "user_country": "RU",
    "user_city": "Moscow",
}

Data Pipeline¶

graph TD
    subgraph INDEXING ["Indexing Pipeline"]
        direction LR
        PDB["Product DB<br/>(Postgres)"] --> CDC["Change Capture<br/>(CDC)"]
        CDC --> IB["Index Builder<br/>(Spark)"]
        IB --> SI["Search Index<br/>(ES/Solr)"]
    end

    subgraph TRAINING ["Training Pipeline"]
        direction LR
        CL["Click Logs<br/>(Kafka)"] --> JL["Join & Label"]
        SL["Search Logs<br/>(Kafka)"] --> JL
        JL --> FE["Feature Extract<br/>(Spark)"]
        FE --> TD["Training Dataset<br/>(Parquet)"]
    end

    style INDEXING fill:#e8eaf6,stroke:#3f51b5
    style TRAINING fill:#e8f5e9,stroke:#4caf50
    style PDB fill:#e8eaf6,stroke:#3f51b5
    style CDC fill:#fff3e0,stroke:#ef6c00
    style IB fill:#fff3e0,stroke:#ef6c00
    style SI fill:#e8f5e9,stroke:#4caf50
    style CL fill:#e8eaf6,stroke:#3f51b5
    style SL fill:#e8eaf6,stroke:#3f51b5
    style JL fill:#fff3e0,stroke:#ef6c00
    style FE fill:#fff3e0,stroke:#ef6c00
    style TD fill:#e8f5e9,stroke:#4caf50

Label Generation¶

Click-based Labels¶

def generate_click_labels(search_logs, click_logs):
    """
    Generate labels from click data
    """
    labels = []

    for search in search_logs:
        impressions = get_impressions(search.search_id)
        clicks = get_clicks(search.search_id)

        for impression in impressions:
            label = {
                "search_id": search.search_id,
                "product_id": impression.product_id,
                "position": impression.position,
                "clicked": impression.product_id in clicks,
                "dwell_time": clicks.get(impression.product_id, {}).get("dwell_time", 0),
                "action": clicks.get(impression.product_id, {}).get("action", None),
            }
            labels.append(label)

    return labels

Graded Relevance Labels¶

def compute_graded_relevance(click_data):
    """
    Convert clicks to graded relevance (0-4 scale)
    """
    if click_data.action == "purchase":
        return 4  # Highly relevant
    elif click_data.action == "add_to_cart":
        return 3  # Very relevant
    elif click_data.dwell_time > 60:
        return 2  # Relevant (long dwell)
    elif click_data.clicked:
        return 1  # Somewhat relevant
    else:
        return 0  # Not relevant

Position Debiasing¶

def apply_position_debiasing(labels, method="inverse_propensity"):
    """
    Account for position bias in click data
    """
    if method == "inverse_propensity":
        # Propensity model: P(click | position)
        propensity = {1: 0.5, 2: 0.3, 3: 0.2, 4: 0.15, 5: 0.1, ...}

        for label in labels:
            if label.clicked:
                # Weight inversely proportional to propensity
                label.weight = 1 / propensity[label.position]
            else:
                label.weight = 1.0

    elif method == "randomization":
        # Only use data from randomized experiments
        labels = [l for l in labels if l.was_randomized]

    return labels

Training Data Format¶

Pointwise¶

# Each example: (query, document, label)
{
    "query_id": "q123",
    "doc_id": "d456",
    "query_features": {...},
    "doc_features": {...},
    "query_doc_features": {...},
    "label": 1,  # clicked
    "weight": 1.5,  # position debiased
}

Pairwise¶

# Each example: (query, doc_positive, doc_negative)
{
    "query_id": "q123",
    "positive_doc_id": "d456",  # clicked
    "negative_doc_id": "d789",  # not clicked, shown higher
    "query_features": {...},
    "positive_doc_features": {...},
    "negative_doc_features": {...},
}

Listwise¶

# Each example: (query, [documents], [relevance_labels])
{
    "query_id": "q123",
    "doc_ids": ["d1", "d2", "d3", "d4", "d5"],
    "query_features": {...},
    "doc_features": [{...}, {...}, {...}, {...}, {...}],
    "labels": [2, 0, 1, 3, 0],  # graded relevance
}

Search Index Schema¶

Elasticsearch Mapping¶

{
  "mappings": {
    "properties": {
      "product_id": {"type": "keyword"},
      "title": {
        "type": "text",
        "analyzer": "russian",
        "fields": {
          "keyword": {"type": "keyword"},
          "autocomplete": {
            "type": "text",
            "analyzer": "autocomplete"
          }
        }
      },
      "description": {
        "type": "text",
        "analyzer": "russian"
      },
      "brand": {"type": "keyword"},
      "category": {"type": "keyword"},
      "price": {"type": "float"},
      "rating": {"type": "float"},
      "in_stock": {"type": "boolean"},
      "popularity_score": {"type": "float"},
      "title_embedding": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine"
      }
    }
  }
}

Data Quality Checks¶

search_data_quality_checks = {
    "query_logs": [
        "query_text is not null and length > 0",
        "timestamp is valid and not in future",
        "result_count >= 0",
    ],
    "click_logs": [
        "search_id exists in search_logs",
        "position > 0 and position <= result_count",
        "dwell_time >= 0",
    ],
    "products": [
        "title is not null",
        "price > 0",
        "category_id is valid",
        "embeddings have correct dimension",
    ],
    "training_data": [
        "no null features",
        "labels in valid range",
        "positive examples exist for each query",
    ],
}

Заблуждение: клики -- надёжные метки релевантности

CTR по позиции 1 может быть 30%, а по позиции 10 -- всего 3%, вне зависимости от релевантности документа. Без position debiasing модель обучается предсказывать "что стоит наверху", а не "что релевантно". Обязательно используйте inverse propensity weighting или рандомизацию 1-5% трафика для сбора unbiased данных.

Заблуждение: pointwise loss достаточно для ранжирования

Pointwise (binary cross-entropy на click/no-click) не учитывает относительный порядок документов. Listwise loss (LambdaLoss, ListMLE) напрямую оптимизирует NDCG и дает на 5-8% лучшие результаты. На интервью обязательно обсудите разницу pointwise vs pairwise vs listwise.

Заблуждение: embeddings можно считать на лету

BERT-embedding для одного документа = ~10ms на GPU. Для 10M документов при каждом обновлении индекса это 28 часов. Embeddings считаются offline (batch Spark job) и хранятся в индексе. Query embedding считается online (~2ms с distilled модели).

На интервью¶

Типичные ошибки:

"Используем клики как binary labels 0/1" -- игнорирует position bias и graded relevance (purchase > cart > click > impression)

"Features вычисляем на лету для всех кандидатов" -- для 1000 кандидатов x 100 features = 100K операций, не уложится в 20ms без precomputation

"Query logs не нужны, достаточно item features" -- query features (frequency, avg CTR, intent) и query-document features (BM25 score, title match) обычно самые важные по feature importance

Сильные ответы:

"Для labels использую graded relevance: purchase=4, add_to_cart=3, dwell>60s=2, click=1, impression=0. Применяю inverse propensity weighting по позиции для debiasing."

"Data pipeline: CDC из Product DB -> Spark job для индексации + embedding computation (batch, раз в час). Click/Search logs через Kafka -> join + label generation -> feature extraction -> Parquet training dataset."

"Три группы features по важности: (1) query-document match (BM25, title match, brand match), (2) document quality (rating, reviews, conversion rate), (3) personalization (category affinity, brand preference). Precomputed в Feature Store (Redis), latency <5ms."