Масштабирование системы предсказания кликов¶
~1 минута чтения
Предварительно: Компоненты, Метрики
Google Ads обрабатывает 1M+ QPS (queries per second) при latency budget 10ms на ML inference. Каждый query запускает аукцион с 100+ кандидатов, и для каждого кандидата нужно предсказать CTR. Это 100M+ predictions/sec. При этом features должны быть fresh (< 1 min для user activity), модель обновляется ежедневно, и downtime = прямые потери revenue ($30K+ в секунду для Google).
Traffic Patterns¶
Volume¶
| Метрика | Значение |
|---|---|
| Queries per second | 1M+ |
| Candidates per query | 100-500 |
| Predictions per second | 100M+ |
| Daily predictions | 8.5T+ |
| Feature vector size | 1000+ features |
| Model size | 10-100GB (embedding tables) |
Latency Budget¶
Total ad serving: 100ms
Query understanding: 10ms
Candidate retrieval: 20ms (inverted index + ANN)
Feature assembly: 15ms (feature store lookups)
CTR prediction: 10ms (ML inference, critical)
Auction + ranking: 5ms
Ad rendering: 40ms (client-side)
ML inference breakdown (10ms budget):
Feature lookup: 3ms (pre-computed in feature store)
Embedding lookup: 2ms (in-memory embedding table)
Forward pass: 3ms (optimized model)
Post-processing: 2ms (calibration, business rules)
Model Architecture for Scale¶
Two-Tower + Cross Network¶
Serving pattern:
Offline (batch):
- User tower: compute user embeddings every 15 min
- Ad tower: compute ad embeddings on ad update
- Store in: feature store (Redis/Memcached)
Online (per request, 10ms budget):
- Fetch: user_emb + ad_emb (pre-computed)
- Cross: interaction features (real-time)
- Predict: lightweight cross-network head
This reduces online compute from O(full_model) to O(cross_head)
Embedding Table Sharding¶
Problem: Embedding tables = 80% of model size (10-100GB)
- User IDs: 2B users x 128 dim x 4 bytes = 1TB
- Ad IDs: 100M ads x 128 dim x 4 bytes = 50GB
- Feature crosses: Billions of entries
Solution: Distributed embedding lookup
- Shard by hash(entity_id) % num_shards
- Each shard: 10-20GB, fits in single machine RAM
- Parallel lookup: all shards in parallel, < 2ms
Horizontal Scaling¶
Service Architecture¶
| Component | Instances | Capacity | Scaling |
|---|---|---|---|
| Query server | 500 | 2K QPS each | HPA by latency |
| Feature store (Redis) | 100 shards | 100M ops/sec total | Shard by key |
| Embedding store | 50 shards | 50GB per shard | Shard by entity ID |
| ML inference | 1000 pods | 100K pred/sec each | HPA by CPU + latency |
| Ranking server | 200 | 5K auctions/sec each | HPA by QPS |
| Kafka (logging) | 50 partitions | 10M events/sec | Partition by user |
Feature Store Design¶
Feature freshness requirements:
Real-time (< 1 min):
- Last N clicks, queries (sliding window in Redis)
- Session features (device, location)
Near-real-time (< 15 min):
- User embeddings (recomputed by Flink)
- Click-through rates (smoothed)
Batch (< 24 hours):
- User profile features
- Ad quality scores
- Historical aggregates (7d, 30d CTR)
Storage:
Redis: Real-time + near-real-time features (100 shards, 500GB)
Memcached: Batch features (hot partition, 200GB)
Bigtable: Full feature history (cold storage, 10TB+)
Training Pipeline¶
Daily Retrain¶
Timeline:
00:00 Start data collection (yesterday's clicks)
02:00 Feature engineering (Spark, 2000 cores)
04:00 Model training (100 GPUs, 2 hours)
06:00 Validation + calibration check
06:30 Shadow mode (1% traffic)
07:00 Gradual rollout (10% -> 50% -> 100%)
08:00 Full deployment
Safeguards:
- Calibration ratio must be in [0.98, 1.02]
- AUC regression < 0.001
- Revenue impact (shadow): no decrease > 0.1%
- Automatic rollback if guardrails violated
High Availability¶
Degradation Strategy¶
| Failure | Fallback | Revenue impact |
|---|---|---|
| ML model timeout | Cached predictions (last 1h) | -5% revenue |
| Feature store down | Default features + simple model | -15% revenue |
| Embedding store down | Global avg embeddings | -25% revenue |
| Full ML pipeline down | Rule-based ranking (bid only) | -40% revenue |
Zero-Downtime Deployment¶
Blue-green deployment for model updates:
1. Load new model into "green" pods (warm-up)
2. Shift 1% traffic to green (canary)
3. Monitor: latency, calibration, revenue
4. If OK: shift 10% -> 50% -> 100% (over 1 hour)
5. Keep blue pods warm for 2 hours (rollback ready)
Rollback trigger (automatic):
- p99 latency > 15ms (budget = 10ms)
- Calibration ratio outside [0.95, 1.05]
- Revenue drop > 0.5% vs control
Cost¶
| Component | Monthly cost |
|---|---|
| ML inference (1000 pods) | $500K |
| Feature store (Redis 100 shards) | $300K |
| Embedding store (50 shards) | $150K |
| Training (100 GPUs daily) | $200K |
| Kafka + logging | $100K |
| Data storage (Bigtable) | $150K |
| Total | ~$1.4M |
ROI: Google Ads revenue ~\(280B/year. Infra cost ~\)17M/year. ROI > 16,000x.
Заблуждение: просто увеличить batch size для throughput
В ad serving latency budget = 10ms. Batching увеличивает throughput, но добавляет latency (ожидание batch fill). При 1M QPS и batch size 32, batch fill time = 32us -- ок. Но при 100K QPS, batch fill = 320us. При неравномерной нагрузке batching может нарушить p99 latency. Решение: adaptive batching -- batch size зависит от текущей нагрузки.
Заблуждение: модель можно обновлять раз в неделю
CTR distribution дрейфует быстро: новые рекламные кампании, сезонность, события. Модель без daily retrain теряет 0.1-0.3% AUC за неделю, что = $28-84M lost revenue для Google. Daily retrain + continuous feature update = industry standard. Facebook обучает ad models непрерывно (streaming).
Секция для интервью¶
Вопрос: "Как обеспечить 1M QPS при 10ms latency?"
Слабый ответ: "Много серверов и кэширование."
Сильный ответ: "Четыре ключевых решения. (1) Two-tower architecture: user и ad embeddings pre-computed offline (каждые 15 мин), online inference = только lightweight cross-network head, 3ms вместо 30ms. (2) Embedding table sharding: 100GB+ таблицы шардированы по hash(entity_id), parallel lookup < 2ms. (3) Feature store tiering: real-time features в Redis (< 1ms), batch features в Memcached, cold в Bigtable. (4) 1000 stateless inference pods с HPA по latency p99. Deployment: blue-green с canary (1% -> 10% -> 100%), automatic rollback при calibration drift > 2% или revenue drop > 0.5%. Total infra: $1.4M/мес, ROI > 16,000x."