Ранжирование поиска: прохождение интервью¶
~3 минуты чтения
Предварительно: Определение задачи | Компоненты системы | Требования к данным
Кейс "Design Search Ranking" -- один из топ-5 самых частых на ML System Design интервью (Google, Amazon, LinkedIn, Spotify). Интервьюер оценивает умение декомпозировать задачу на стадии (retrieval -> ranking -> re-ranking), подбирать модель под latency-бюджет (200ms), и обосновывать выбор features. Ниже -- пошаговый план на 45-60 минут с примерами диалогов и типичными follow-up вопросами.
Interview Framework (45-60 min)¶
Timeline¶
0-5 min: Clarifying questions
5-15 min: High-level design
15-30 min: Deep dive (ranking model, features)
30-45 min: Scaling & trade-offs
45-60 min: Extensions & Q&A
Step 1: Clarifying Questions (5 min)¶
Essential Questions¶
**Scope:**
- What type of search? (e-commerce, web, people)
- What are we searching? (products, documents, users)
**Scale:**
- Index size? (1M vs 1B documents)
- QPS? (1K vs 100K)
- Latency requirement? (100ms vs 500ms)
**Features:**
- Personalization needed?
- Real-time indexing or batch?
- Autocomplete/suggestions?
- Filters and facets?
**Data:**
- What engagement data? (clicks, conversions)
- Historical search logs available?
Example Dialogue¶
You: "What type of content are we searching?"
Interviewer: "Products on an e-commerce platform like Amazon"
You: "How many products and what's the QPS?"
Interviewer: "10 million products, 50K QPS at peak"
You: "What's the latency budget?"
Interviewer: "Under 200ms end-to-end"
You: "Is personalization required?"
Interviewer: "Yes, results should be personalized to user preferences"
Step 2: High-Level Design (10 min)¶
API Design¶
# Request
GET /search?q=красные+кроссовки&category=shoes&price_max=5000&page=1
# Response
{
"query": "красные кроссовки",
"corrected_query": null,
"total_results": 1523,
"results": [
{
"product_id": "p123",
"title": "Nike Air Max Red",
"price": 4500,
"rating": 4.5,
"position": 1,
"score": 0.95
},
...
],
"facets": {
"brand": [{"name": "Nike", "count": 50}, ...],
"size": [{"name": "42", "count": 30}, ...]
},
"suggestions": ["красные кроссовки nike", "красные кроссовки adidas"],
"latency_ms": 85
}
Architecture Diagram¶
graph TD
REQ["SEARCH REQUEST"] --> SERV["SEARCH SERVICE"]
subgraph SERV_SUB["Pipeline Stages"]
QP["Query Processing"] --> RET["Retrieval<br/>(BM25+ANN)"]
RET --> RANK["Ranking<br/>(LTR)"]
RANK --> RERANK["Re-ranking<br/>(Diversity)"]
end
SERV --> QP
SERV --> IDX["Search Index<br/>(Elasticsearch)"]
SERV --> FS["Feature Store<br/>(Redis)"]
SERV --> MODEL["Ranking Model<br/>(LightGBM)"]
style REQ fill:#e8eaf6,stroke:#3f51b5
style SERV fill:#fff3e0,stroke:#ef6c00
style IDX fill:#e8f5e9,stroke:#4caf50
style FS fill:#e8f5e9,stroke:#4caf50
style MODEL fill:#e8f5e9,stroke:#4caf50
Flow Explanation¶
"The search pipeline has 4 main stages:
1. **Query Processing** (10ms)
- Normalize and tokenize query
- Spell check and correction
- Entity extraction (brand, size, color)
- Query expansion with synonyms
2. **Retrieval** (50ms)
- Hybrid: BM25 (lexical) + semantic embeddings
- From 10M products → 1000 candidates
- Apply hard filters (price, availability)
3. **Ranking** (80ms)
- Learning-to-rank model (LambdaMART)
- Rich features: query-doc, user, context
- 1000 → 100 ranked results
4. **Re-ranking** (20ms)
- Diversity (don't show all Nike)
- Freshness boost
- Promotional slots
Total: ~160ms well within 200ms budget"
Step 3: Deep Dive (15 min)¶
Retrieval Strategy¶
"Let me explain the retrieval stage in detail..."
"The challenge is: we have 10M products, can't score all with ML.
So we need fast retrieval to get ~1000 candidates.
I'd use **hybrid retrieval**:
1. **Lexical (BM25)** - 60% of candidates
- Fast, handles exact matches well
- Elasticsearch with proper analyzers
- Query: title^3, description, brand^2
2. **Semantic (Vector Search)** - 40% of candidates
- BERT embeddings for query and products
- Catches semantic similarity
- 'sneakers' matches 'кроссовки'
Merging strategy:
- Reciprocal Rank Fusion (RRF)
- Or weighted linear combination
Why hybrid?
- Lexical alone misses semantic matches
- Semantic alone misses exact matches
- Hybrid gets best of both"
Ranking Model¶
"For the ranking model, I'd use LambdaMART..."
"Why LambdaMART?
- Optimizes for NDCG directly
- Fast inference (gradient boosted trees)
- Interpretable feature importance
Features I'd include:
**Query-Document Matching (most important)**
- BM25 score
- Title match ratio
- Exact match in title
- Brand match
- Category match
**Document Quality**
- Rating (4.5 stars)
- Review count
- Conversion rate
- Return rate
**Personalization**
- User's category affinity
- User's brand preference
- Previous purchases in category
- Click history with similar items
**Context**
- Position bias correction
- Device type
- Time of day
Training approach:
- Listwise loss (LambdaLoss)
- Position debiasing with inverse propensity
- Graded relevance: purchase=4, cart=3, click=2, impression=0"
Two-Stage Ranking¶
"Actually, I'd use two-stage ranking for latency..."
"Stage 1: Coarse ranking (fast)
- Lightweight model (logistic regression)
- 20 features only
- 1000 → 200 candidates
- 20ms
Stage 2: Fine ranking (accurate)
- Full LambdaMART model
- 100+ features
- 200 → 50 final results
- 50ms
Why two-stage?
- Can't run heavy model on 1000 items in time
- Coarse filter removes obviously bad results
- Fine ranking on quality candidates"
Step 4: Scaling & Trade-offs (15 min)¶
Scaling for 50K QPS¶
"For 50K QPS, here's the architecture..."
"1. Search Service
- Stateless, horizontal scaling
- 100 pods, each handling 500 QPS
- Auto-scaling on CPU/latency
2. Elasticsearch Cluster
- 10 data nodes, 5 shards per index
- Read replicas for hot queries
- SSD storage for low latency
3. Feature Store
- Redis cluster for user features
- Precomputed document features in ES
- < 5ms feature retrieval
4. Model Serving
- LightGBM is fast (CPU)
- Batch scoring (32 items at once)
- Local caching of model
5. Caching
- Popular queries cached (30%)
- CDN for static assets
- 5-minute TTL"
Latency Breakdown¶
"Latency budget breakdown..."
Total: 200ms
Query Processing: 10ms
├── Tokenization: 2ms
├── Spell check: 3ms
├── NER: 3ms
└── Embedding: 2ms
Retrieval: 50ms
├── BM25 query: 30ms
├── Vector search: 15ms
└── Merge: 5ms
Ranking: 80ms
├── Feature fetch: 20ms
├── Coarse rank: 20ms
└── Fine rank: 40ms
Re-ranking: 20ms
├── Diversity: 10ms
├── Business rules: 5ms
└── Response build: 5ms
Buffer: 40ms
Key Trade-offs¶
"Several important trade-offs..."
"1. **Recall vs Latency**
- More candidates = better recall, slower
- Solution: Fast retrieval, heavy ranking
- Balance: 1000 candidates is sweet spot
2. **Personalization vs Freshness**
- User features can be stale
- Solution: Blend personalized + popular
3. **Relevance vs Diversity**
- Pure relevance → all same brand
- Solution: MMR re-ranking
- Balance: diversity_weight = 0.2
4. **Semantic vs Lexical**
- Semantic catches meaning but can be wrong
- Lexical is precise but misses synonyms
- Solution: Hybrid with tunable weight
5. **Online vs Offline Features**
- Real-time features fresher but expensive
- Solution: Precompute what you can"
Step 5: Extensions & Q&A (10 min)¶
Common Follow-up Questions¶
Q: How do you handle typos?
"Multi-layered approach:
1. Edit distance to known vocabulary
2. Query log mining for common corrections
3. Phonetic matching (Soundex, Metaphone)
4. Character n-gram matching
For 'ккроссовки' → 'кроссовки':
- Edit distance: 1 deletion
- High frequency in query logs
- Confidence: high → auto-correct"
Q: How do you handle zero results?
"Zero results are terrible UX. Solutions:
1. Query relaxation
- Remove least important terms
- Expand synonyms more aggressively
2. Fallback strategies
- Show related categories
- Show popular items in category
3. Prevention
- Better spell correction
- Suggest similar queries
Monitor zero-result rate as key metric"
Q: How do you handle position bias?
"Position bias: users click top results more regardless of relevance.
Problem: Training data is biased.
Solutions:
1. **Inverse Propensity Weighting**
- Weight = 1 / P(click | position)
- Down-weight clicks on top positions
2. **Randomization**
- 1% traffic: shuffle top results
- Use unbiased data for training
3. **Propensity Model**
- Train model to predict P(click | position)
- Use as feature or for debiasing"
Q: How do you A/B test ranking changes?
"Interleaving is best for ranking:
1. Team Draft Interleaving
- Alternate between control/treatment
- Count which got more clicks
2. Delta NDCG
- Compare ranking quality offline first
- Online only if offline wins
3. Guardrail metrics
- Revenue per search
- Zero result rate
- Query abandonment
Ship if: statistically significant improvement
AND no guardrail regressions"
Interview Checklist¶
Must Cover:¶
- Clarifying questions
- Two-stage pipeline (retrieval + ranking)
- Hybrid retrieval (lexical + semantic)
- Ranking model choice and features
- Latency budget breakdown
- Scaling for QPS
Good to Cover:¶
- Query understanding (NER, intent)
- Position bias handling
- Personalization approach
- Diversity in results
- A/B testing strategy
Red Flags:¶
- Not discussing two-stage (retrieval vs ranking)
- Ignoring latency constraints
- Only lexical OR only semantic (should be hybrid)
- Not mentioning position bias
- Forgetting about cold start
Sample Script¶
Interviewer: "Design search for Amazon"
You: "Great! Let me clarify a few things first.
What's the scale - how many products and QPS?"
Interviewer: "10 million products, 50K QPS"
You: "And the latency requirement?"
Interviewer: "Under 200ms"
You: "Perfect. Let me outline my approach.
[Draw architecture]
The key insight is we need two stages:
fast retrieval to get candidates,
then ML ranking for precision.
For retrieval, I'd use hybrid search:
- BM25 for lexical matching
- BERT embeddings for semantic
- Merge with RRF
For ranking, LambdaMART with features:
- Query-doc matching (BM25, title match)
- Quality signals (rating, reviews)
- Personalization (user preferences)
For 50K QPS, scale horizontally with
100 pods and Redis for user features.
Shall I dive deeper into any component?"
Заблуждение: достаточно нарисовать архитектуру и перечислить компоненты
Интервьюер ожидает конкретные цифры: latency budget по стадиям (10ms + 50ms + 80ms + 20ms = 160ms), количество кандидатов на каждом этапе (10M -> 1000 -> 200 -> 50), и обоснование выбора модели. Абстрактный ответ "ML model для ранжирования" без деталей -- это red flag.
Заблуждение: можно пропустить clarifying questions
Разница между web search и e-commerce search определяет всю архитектуру: другие метрики (NDCG vs conversion rate), другие features (PageRank vs inventory), другие latency бюджеты (200ms vs 500ms). 5 минут на уточнения экономят 20 минут на переделку.
Заблуждение: interleaving и A/B тест -- одно и то же
A/B тест делит пользователей на две группы и сравнивает метрики. Interleaving показывает одному пользователю результаты обеих систем вперемешку и считает, какая получила больше кликов. Interleaving определяет победителя в 10x меньше трафика и является стандартом для search ranking.
На интервью¶
Типичные ошибки:
"Я бы использовал neural network для ранжирования" -- без указания конкретной архитектуры (LambdaMART vs cross-encoder vs DIN) и обоснования, почему именно она подходит под latency budget
Рисует архитектуру без цифр -- нет QPS, нет latency breakdown, нет количества кандидатов на каждом этапе
"A/B тест: 50% пользователей видят новую модель" -- для search ranking interleaving (Team Draft) дает надёжные результаты в 100x меньше трафика
Сильные ответы:
"Начну с уточнений: тип поиска, масштаб (index size, QPS), latency budget, доступные сигналы. Затем нарисую 4-stage pipeline с конкретными цифрами по latency и фильтрации кандидатов."
"Для zero results: query relaxation (убрать наименее важные токены), fallback на популярные в категории, мониторинг zero-result rate как guardrail метрики (<1%)."
"Trade-off recall vs latency: при 50K QPS и 200ms budget, sweet spot -- 1000 кандидатов из retrieval. Больше 2000 -- не укладываемся в latency, меньше 500 -- теряем recall."