Масштабирование системы предсказания кликов¶

~1 минута чтения

Предварительно: Компоненты, Метрики

Google Ads обрабатывает 1M+ QPS (queries per second) при latency budget 10ms на ML inference. Каждый query запускает аукцион с 100+ кандидатов, и для каждого кандидата нужно предсказать CTR. Это 100M+ predictions/sec. При этом features должны быть fresh (< 1 min для user activity), модель обновляется ежедневно, и downtime = прямые потери revenue ($30K+ в секунду для Google).

Traffic Patterns¶

Volume¶

Метрика	Значение
Queries per second	1M+
Candidates per query	100-500
Predictions per second	100M+
Daily predictions	8.5T+
Feature vector size	1000+ features
Model size	10-100GB (embedding tables)

Latency Budget¶

Total ad serving: 100ms

Query understanding:     10ms
Candidate retrieval:     20ms  (inverted index + ANN)
Feature assembly:        15ms  (feature store lookups)
CTR prediction:          10ms  (ML inference, critical)
Auction + ranking:        5ms
Ad rendering:            40ms  (client-side)

ML inference breakdown (10ms budget):
  Feature lookup:     3ms  (pre-computed in feature store)
  Embedding lookup:   2ms  (in-memory embedding table)
  Forward pass:       3ms  (optimized model)
  Post-processing:    2ms  (calibration, business rules)

Model Architecture for Scale¶

Two-Tower + Cross Network¶

Serving pattern:

Offline (batch):
  - User tower:    compute user embeddings every 15 min
  - Ad tower:      compute ad embeddings on ad update
  - Store in:      feature store (Redis/Memcached)

Online (per request, 10ms budget):
  - Fetch:         user_emb + ad_emb (pre-computed)
  - Cross:         interaction features (real-time)
  - Predict:       lightweight cross-network head

This reduces online compute from O(full_model) to O(cross_head)

Embedding Table Sharding¶

Problem: Embedding tables = 80% of model size (10-100GB)
  - User IDs: 2B users x 128 dim x 4 bytes = 1TB
  - Ad IDs: 100M ads x 128 dim x 4 bytes = 50GB
  - Feature crosses: Billions of entries

Solution: Distributed embedding lookup
  - Shard by hash(entity_id) % num_shards
  - Each shard: 10-20GB, fits in single machine RAM
  - Parallel lookup: all shards in parallel, < 2ms

Horizontal Scaling¶

Service Architecture¶

Component	Instances	Capacity	Scaling
Query server	500	2K QPS each	HPA by latency
Feature store (Redis)	100 shards	100M ops/sec total	Shard by key
Embedding store	50 shards	50GB per shard	Shard by entity ID
ML inference	1000 pods	100K pred/sec each	HPA by CPU + latency
Ranking server	200	5K auctions/sec each	HPA by QPS
Kafka (logging)	50 partitions	10M events/sec	Partition by user

Feature Store Design¶

Feature freshness requirements:

Real-time (< 1 min):
  - Last N clicks, queries (sliding window in Redis)
  - Session features (device, location)

Near-real-time (< 15 min):
  - User embeddings (recomputed by Flink)
  - Click-through rates (smoothed)

Batch (< 24 hours):
  - User profile features
  - Ad quality scores
  - Historical aggregates (7d, 30d CTR)

Storage:
  Redis:     Real-time + near-real-time features (100 shards, 500GB)
  Memcached: Batch features (hot partition, 200GB)
  Bigtable:  Full feature history (cold storage, 10TB+)

Training Pipeline¶

Daily Retrain¶

Timeline:
  00:00  Start data collection (yesterday's clicks)
  02:00  Feature engineering (Spark, 2000 cores)
  04:00  Model training (100 GPUs, 2 hours)
  06:00  Validation + calibration check
  06:30  Shadow mode (1% traffic)
  07:00  Gradual rollout (10% -> 50% -> 100%)
  08:00  Full deployment

Safeguards:
  - Calibration ratio must be in [0.98, 1.02]
  - AUC regression < 0.001
  - Revenue impact (shadow): no decrease > 0.1%
  - Automatic rollback if guardrails violated

High Availability¶

Degradation Strategy¶

Failure	Fallback	Revenue impact
ML model timeout	Cached predictions (last 1h)	-5% revenue
Feature store down	Default features + simple model	-15% revenue
Embedding store down	Global avg embeddings	-25% revenue
Full ML pipeline down	Rule-based ranking (bid only)	-40% revenue

Zero-Downtime Deployment¶

Blue-green deployment for model updates:

1. Load new model into "green" pods (warm-up)
2. Shift 1% traffic to green (canary)
3. Monitor: latency, calibration, revenue
4. If OK: shift 10% -> 50% -> 100% (over 1 hour)
5. Keep blue pods warm for 2 hours (rollback ready)

Rollback trigger (automatic):
  - p99 latency > 15ms (budget = 10ms)
  - Calibration ratio outside [0.95, 1.05]
  - Revenue drop > 0.5% vs control

Cost¶

Component	Monthly cost
ML inference (1000 pods)	$500K
Feature store (Redis 100 shards)	$300K
Embedding store (50 shards)	$150K
Training (100 GPUs daily)	$200K
Kafka + logging	$100K
Data storage (Bigtable)	$150K
Total	~$1.4M

ROI: Google Ads revenue ~$280B/year. Infra cost ~$17M/year. ROI > 16,000x.

Заблуждение: просто увеличить batch size для throughput

В ad serving latency budget = 10ms. Batching увеличивает throughput, но добавляет latency (ожидание batch fill). При 1M QPS и batch size 32, batch fill time = 32us -- ок. Но при 100K QPS, batch fill = 320us. При неравномерной нагрузке batching может нарушить p99 latency. Решение: adaptive batching -- batch size зависит от текущей нагрузки.

Заблуждение: модель можно обновлять раз в неделю

CTR distribution дрейфует быстро: новые рекламные кампании, сезонность, события. Модель без daily retrain теряет 0.1-0.3% AUC за неделю, что = $28-84M lost revenue для Google. Daily retrain + continuous feature update = industry standard. Facebook обучает ad models непрерывно (streaming).

Секция для интервью¶

Вопрос: "Как обеспечить 1M QPS при 10ms latency?"

Слабый ответ: "Много серверов и кэширование."

Сильный ответ: "Четыре ключевых решения. (1) Two-tower architecture: user и ad embeddings pre-computed offline (каждые 15 мин), online inference = только lightweight cross-network head, 3ms вместо 30ms. (2) Embedding table sharding: 100GB+ таблицы шардированы по hash(entity_id), parallel lookup < 2ms. (3) Feature store tiering: real-time features в Redis (< 1ms), batch features в Memcached, cold в Bigtable. (4) 1000 stateless inference pods с HPA по latency p99. Deployment: blue-green с canary (1% -> 10% -> 100%), automatic rollback при calibration drift > 2% или revenue drop > 0.5%. Total infra: $1.4M/мес, ROI > 16,000x."