Ранжирование поиска: требования к данным¶
~3 минуты чтения
Предварительно: Компоненты системы
Качество поисковой системы определяется данными: для обучения LambdaMART-модели нужны миллионы пар (query, document, relevance_label), а feature store должен отдавать 100+ признаков для 1000 кандидатов за <20ms. На практике 80% времени разработки уходит на data pipeline -- сбор click logs, position debiasing, feature engineering. Модель без правильных данных бесполезна: Google обнаружил, что добавление 10 query-document features дает больший прирост NDCG (+3%), чем замена модели с LR на neural network (+1.5%).
Источники данных¶
1. Document/Item Data¶
-- Каталог товаров (e-commerce пример)
CREATE TABLE products (
product_id BIGINT PRIMARY KEY,
title VARCHAR(500),
description TEXT,
category_id INT,
brand VARCHAR(100),
price DECIMAL(10,2),
currency VARCHAR(3),
in_stock BOOLEAN,
stock_quantity INT,
rating FLOAT,
review_count INT,
created_at TIMESTAMP,
updated_at TIMESTAMP,
seller_id BIGINT,
image_urls TEXT[],
attributes JSONB
);
-- Full-text search index
CREATE INDEX products_fts_idx ON products
USING gin(to_tsvector('russian', title || ' ' || description));
2. Query Logs¶
CREATE TABLE search_logs (
search_id UUID PRIMARY KEY,
user_id BIGINT,
query_text VARCHAR(500),
query_normalized VARCHAR(500),
session_id VARCHAR(100),
timestamp TIMESTAMP,
device_type VARCHAR(20),
platform VARCHAR(20),
result_count INT,
page_number INT,
filters_applied JSONB,
latency_ms INT
);
CREATE TABLE click_logs (
click_id UUID PRIMARY KEY,
search_id UUID REFERENCES search_logs,
user_id BIGINT,
product_id BIGINT,
position INT,
timestamp TIMESTAMP,
dwell_time_sec INT,
action VARCHAR(20) -- click, add_to_cart, purchase
);
3. User Data¶
user_profile = {
"user_id": "123",
"preferences": {
"preferred_brands": ["Nike", "Adidas"],
"price_range": [1000, 5000],
"preferred_categories": ["shoes", "sportswear"],
},
"search_history": [
{"query": "кроссовки nike", "timestamp": "2024-01-10"},
{"query": "спортивные штаны", "timestamp": "2024-01-09"},
],
"purchase_history": [...],
"view_history": [...],
}
Feature Engineering¶
1. Query Features¶
query_features = {
# Lexical features
"query_length": 3,
"num_words": 3,
"has_brand": True,
"has_size": True,
"has_color": True,
"has_price_modifier": False, # "дешевые", "премиум"
# Historical features
"query_frequency": 1500, # searches/day
"avg_clicks_per_search": 2.3,
"avg_conversion_rate": 0.05,
"zero_result_rate": 0.01,
# Semantic features
"query_embedding": [0.1, 0.2, ...], # 768-dim BERT
"predicted_category": "footwear",
"predicted_intent": "purchase",
# Session context
"is_refinement": False, # refined from previous query
"previous_query_similarity": 0.0,
}
2. Document Features¶
document_features = {
# Text features
"title_length": 45,
"description_length": 500,
"title_embedding": [0.3, 0.1, ...],
"description_embedding": [0.2, 0.4, ...],
# Quality features
"rating": 4.5,
"review_count": 1000,
"return_rate": 0.02,
"seller_rating": 4.8,
# Popularity features
"views_7d": 10000,
"clicks_7d": 500,
"purchases_7d": 50,
"add_to_cart_7d": 200,
# Business features
"price": 5000,
"discount_percent": 20,
"in_stock": True,
"days_since_listed": 30,
"is_promoted": False,
}
3. Query-Document Features¶
query_document_features = {
# Text matching
"bm25_score": 15.3,
"title_match_ratio": 0.8, # % query terms in title
"exact_match": False,
"phrase_match": True,
# Semantic matching
"embedding_similarity": 0.85,
"cross_encoder_score": 0.78,
# Field-specific matching
"brand_match": True,
"category_match": True,
"size_match": True,
"color_match": True,
# Historical interaction
"user_prev_click": False,
"user_prev_purchase": False,
"category_affinity": 0.7,
"brand_affinity": 0.9,
}
4. Context Features¶
context_features = {
# Session
"session_query_count": 3,
"session_click_count": 5,
"time_since_session_start_min": 10,
# Time
"hour_of_day": 14,
"day_of_week": 3,
"is_weekend": False,
# Device
"device_type": "mobile",
"screen_size": "small",
# Location
"user_country": "RU",
"user_city": "Moscow",
}
Data Pipeline¶
graph TD
subgraph INDEXING ["Indexing Pipeline"]
direction LR
PDB["Product DB<br/>(Postgres)"] --> CDC["Change Capture<br/>(CDC)"]
CDC --> IB["Index Builder<br/>(Spark)"]
IB --> SI["Search Index<br/>(ES/Solr)"]
end
subgraph TRAINING ["Training Pipeline"]
direction LR
CL["Click Logs<br/>(Kafka)"] --> JL["Join & Label"]
SL["Search Logs<br/>(Kafka)"] --> JL
JL --> FE["Feature Extract<br/>(Spark)"]
FE --> TD["Training Dataset<br/>(Parquet)"]
end
style INDEXING fill:#e8eaf6,stroke:#3f51b5
style TRAINING fill:#e8f5e9,stroke:#4caf50
style PDB fill:#e8eaf6,stroke:#3f51b5
style CDC fill:#fff3e0,stroke:#ef6c00
style IB fill:#fff3e0,stroke:#ef6c00
style SI fill:#e8f5e9,stroke:#4caf50
style CL fill:#e8eaf6,stroke:#3f51b5
style SL fill:#e8eaf6,stroke:#3f51b5
style JL fill:#fff3e0,stroke:#ef6c00
style FE fill:#fff3e0,stroke:#ef6c00
style TD fill:#e8f5e9,stroke:#4caf50
Label Generation¶
Click-based Labels¶
def generate_click_labels(search_logs, click_logs):
"""
Generate labels from click data
"""
labels = []
for search in search_logs:
impressions = get_impressions(search.search_id)
clicks = get_clicks(search.search_id)
for impression in impressions:
label = {
"search_id": search.search_id,
"product_id": impression.product_id,
"position": impression.position,
"clicked": impression.product_id in clicks,
"dwell_time": clicks.get(impression.product_id, {}).get("dwell_time", 0),
"action": clicks.get(impression.product_id, {}).get("action", None),
}
labels.append(label)
return labels
Graded Relevance Labels¶
def compute_graded_relevance(click_data):
"""
Convert clicks to graded relevance (0-4 scale)
"""
if click_data.action == "purchase":
return 4 # Highly relevant
elif click_data.action == "add_to_cart":
return 3 # Very relevant
elif click_data.dwell_time > 60:
return 2 # Relevant (long dwell)
elif click_data.clicked:
return 1 # Somewhat relevant
else:
return 0 # Not relevant
Position Debiasing¶
def apply_position_debiasing(labels, method="inverse_propensity"):
"""
Account for position bias in click data
"""
if method == "inverse_propensity":
# Propensity model: P(click | position)
propensity = {1: 0.5, 2: 0.3, 3: 0.2, 4: 0.15, 5: 0.1, ...}
for label in labels:
if label.clicked:
# Weight inversely proportional to propensity
label.weight = 1 / propensity[label.position]
else:
label.weight = 1.0
elif method == "randomization":
# Only use data from randomized experiments
labels = [l for l in labels if l.was_randomized]
return labels
Training Data Format¶
Pointwise¶
# Each example: (query, document, label)
{
"query_id": "q123",
"doc_id": "d456",
"query_features": {...},
"doc_features": {...},
"query_doc_features": {...},
"label": 1, # clicked
"weight": 1.5, # position debiased
}
Pairwise¶
# Each example: (query, doc_positive, doc_negative)
{
"query_id": "q123",
"positive_doc_id": "d456", # clicked
"negative_doc_id": "d789", # not clicked, shown higher
"query_features": {...},
"positive_doc_features": {...},
"negative_doc_features": {...},
}
Listwise¶
# Each example: (query, [documents], [relevance_labels])
{
"query_id": "q123",
"doc_ids": ["d1", "d2", "d3", "d4", "d5"],
"query_features": {...},
"doc_features": [{...}, {...}, {...}, {...}, {...}],
"labels": [2, 0, 1, 3, 0], # graded relevance
}
Search Index Schema¶
Elasticsearch Mapping¶
{
"mappings": {
"properties": {
"product_id": {"type": "keyword"},
"title": {
"type": "text",
"analyzer": "russian",
"fields": {
"keyword": {"type": "keyword"},
"autocomplete": {
"type": "text",
"analyzer": "autocomplete"
}
}
},
"description": {
"type": "text",
"analyzer": "russian"
},
"brand": {"type": "keyword"},
"category": {"type": "keyword"},
"price": {"type": "float"},
"rating": {"type": "float"},
"in_stock": {"type": "boolean"},
"popularity_score": {"type": "float"},
"title_embedding": {
"type": "dense_vector",
"dims": 768,
"index": true,
"similarity": "cosine"
}
}
}
}
Data Quality Checks¶
search_data_quality_checks = {
"query_logs": [
"query_text is not null and length > 0",
"timestamp is valid and not in future",
"result_count >= 0",
],
"click_logs": [
"search_id exists in search_logs",
"position > 0 and position <= result_count",
"dwell_time >= 0",
],
"products": [
"title is not null",
"price > 0",
"category_id is valid",
"embeddings have correct dimension",
],
"training_data": [
"no null features",
"labels in valid range",
"positive examples exist for each query",
],
}
Заблуждение: клики -- надёжные метки релевантности
CTR по позиции 1 может быть 30%, а по позиции 10 -- всего 3%, вне зависимости от релевантности документа. Без position debiasing модель обучается предсказывать "что стоит наверху", а не "что релевантно". Обязательно используйте inverse propensity weighting или рандомизацию 1-5% трафика для сбора unbiased данных.
Заблуждение: pointwise loss достаточно для ранжирования
Pointwise (binary cross-entropy на click/no-click) не учитывает относительный порядок документов. Listwise loss (LambdaLoss, ListMLE) напрямую оптимизирует NDCG и дает на 5-8% лучшие результаты. На интервью обязательно обсудите разницу pointwise vs pairwise vs listwise.
Заблуждение: embeddings можно считать на лету
BERT-embedding для одного документа = ~10ms на GPU. Для 10M документов при каждом обновлении индекса это 28 часов. Embeddings считаются offline (batch Spark job) и хранятся в индексе. Query embedding считается online (~2ms с distilled модели).
На интервью¶
Типичные ошибки:
"Используем клики как binary labels 0/1" -- игнорирует position bias и graded relevance (purchase > cart > click > impression)
"Features вычисляем на лету для всех кандидатов" -- для 1000 кандидатов x 100 features = 100K операций, не уложится в 20ms без precomputation
"Query logs не нужны, достаточно item features" -- query features (frequency, avg CTR, intent) и query-document features (BM25 score, title match) обычно самые важные по feature importance
Сильные ответы:
"Для labels использую graded relevance: purchase=4, add_to_cart=3, dwell>60s=2, click=1, impression=0. Применяю inverse propensity weighting по позиции для debiasing."
"Data pipeline: CDC из Product DB -> Spark job для индексации + embedding computation (batch, раз в час). Click/Search logs через Kafka -> join + label generation -> feature extraction -> Parquet training dataset."
"Три группы features по важности: (1) query-document match (BM25, title match, brand match), (2) document quality (rating, reviews, conversion rate), (3) personalization (category affinity, brand preference). Precomputed в Feature Store (Redis), latency <5ms."