Перейти к содержанию

Масштабирование системы предсказания кликов

~1 минута чтения

Предварительно: Компоненты, Метрики

Google Ads обрабатывает 1M+ QPS (queries per second) при latency budget 10ms на ML inference. Каждый query запускает аукцион с 100+ кандидатов, и для каждого кандидата нужно предсказать CTR. Это 100M+ predictions/sec. При этом features должны быть fresh (< 1 min для user activity), модель обновляется ежедневно, и downtime = прямые потери revenue ($30K+ в секунду для Google).

Traffic Patterns

Volume

Метрика Значение
Queries per second 1M+
Candidates per query 100-500
Predictions per second 100M+
Daily predictions 8.5T+
Feature vector size 1000+ features
Model size 10-100GB (embedding tables)

Latency Budget

Total ad serving: 100ms

Query understanding:     10ms
Candidate retrieval:     20ms  (inverted index + ANN)
Feature assembly:        15ms  (feature store lookups)
CTR prediction:          10ms  (ML inference, critical)
Auction + ranking:        5ms
Ad rendering:            40ms  (client-side)

ML inference breakdown (10ms budget):
  Feature lookup:     3ms  (pre-computed in feature store)
  Embedding lookup:   2ms  (in-memory embedding table)
  Forward pass:       3ms  (optimized model)
  Post-processing:    2ms  (calibration, business rules)

Model Architecture for Scale

Two-Tower + Cross Network

Serving pattern:

Offline (batch):
  - User tower:    compute user embeddings every 15 min
  - Ad tower:      compute ad embeddings on ad update
  - Store in:      feature store (Redis/Memcached)

Online (per request, 10ms budget):
  - Fetch:         user_emb + ad_emb (pre-computed)
  - Cross:         interaction features (real-time)
  - Predict:       lightweight cross-network head

This reduces online compute from O(full_model) to O(cross_head)

Embedding Table Sharding

Problem: Embedding tables = 80% of model size (10-100GB)
  - User IDs: 2B users x 128 dim x 4 bytes = 1TB
  - Ad IDs: 100M ads x 128 dim x 4 bytes = 50GB
  - Feature crosses: Billions of entries

Solution: Distributed embedding lookup
  - Shard by hash(entity_id) % num_shards
  - Each shard: 10-20GB, fits in single machine RAM
  - Parallel lookup: all shards in parallel, < 2ms

Horizontal Scaling

Service Architecture

Component Instances Capacity Scaling
Query server 500 2K QPS each HPA by latency
Feature store (Redis) 100 shards 100M ops/sec total Shard by key
Embedding store 50 shards 50GB per shard Shard by entity ID
ML inference 1000 pods 100K pred/sec each HPA by CPU + latency
Ranking server 200 5K auctions/sec each HPA by QPS
Kafka (logging) 50 partitions 10M events/sec Partition by user

Feature Store Design

Feature freshness requirements:

Real-time (< 1 min):
  - Last N clicks, queries (sliding window in Redis)
  - Session features (device, location)

Near-real-time (< 15 min):
  - User embeddings (recomputed by Flink)
  - Click-through rates (smoothed)

Batch (< 24 hours):
  - User profile features
  - Ad quality scores
  - Historical aggregates (7d, 30d CTR)

Storage:
  Redis:     Real-time + near-real-time features (100 shards, 500GB)
  Memcached: Batch features (hot partition, 200GB)
  Bigtable:  Full feature history (cold storage, 10TB+)

Training Pipeline

Daily Retrain

Timeline:
  00:00  Start data collection (yesterday's clicks)
  02:00  Feature engineering (Spark, 2000 cores)
  04:00  Model training (100 GPUs, 2 hours)
  06:00  Validation + calibration check
  06:30  Shadow mode (1% traffic)
  07:00  Gradual rollout (10% -> 50% -> 100%)
  08:00  Full deployment

Safeguards:
  - Calibration ratio must be in [0.98, 1.02]
  - AUC regression < 0.001
  - Revenue impact (shadow): no decrease > 0.1%
  - Automatic rollback if guardrails violated

High Availability

Degradation Strategy

Failure Fallback Revenue impact
ML model timeout Cached predictions (last 1h) -5% revenue
Feature store down Default features + simple model -15% revenue
Embedding store down Global avg embeddings -25% revenue
Full ML pipeline down Rule-based ranking (bid only) -40% revenue

Zero-Downtime Deployment

Blue-green deployment for model updates:

1. Load new model into "green" pods (warm-up)
2. Shift 1% traffic to green (canary)
3. Monitor: latency, calibration, revenue
4. If OK: shift 10% -> 50% -> 100% (over 1 hour)
5. Keep blue pods warm for 2 hours (rollback ready)

Rollback trigger (automatic):
  - p99 latency > 15ms (budget = 10ms)
  - Calibration ratio outside [0.95, 1.05]
  - Revenue drop > 0.5% vs control

Cost

Component Monthly cost
ML inference (1000 pods) $500K
Feature store (Redis 100 shards) $300K
Embedding store (50 shards) $150K
Training (100 GPUs daily) $200K
Kafka + logging $100K
Data storage (Bigtable) $150K
Total ~$1.4M

ROI: Google Ads revenue ~\(280B/year. Infra cost ~\)17M/year. ROI > 16,000x.

Заблуждение: просто увеличить batch size для throughput

В ad serving latency budget = 10ms. Batching увеличивает throughput, но добавляет latency (ожидание batch fill). При 1M QPS и batch size 32, batch fill time = 32us -- ок. Но при 100K QPS, batch fill = 320us. При неравномерной нагрузке batching может нарушить p99 latency. Решение: adaptive batching -- batch size зависит от текущей нагрузки.

Заблуждение: модель можно обновлять раз в неделю

CTR distribution дрейфует быстро: новые рекламные кампании, сезонность, события. Модель без daily retrain теряет 0.1-0.3% AUC за неделю, что = $28-84M lost revenue для Google. Daily retrain + continuous feature update = industry standard. Facebook обучает ad models непрерывно (streaming).

Секция для интервью

Вопрос: "Как обеспечить 1M QPS при 10ms latency?"

❌ Слабый ответ: "Много серверов и кэширование."

✅ Сильный ответ: "Четыре ключевых решения. (1) Two-tower architecture: user и ad embeddings pre-computed offline (каждые 15 мин), online inference = только lightweight cross-network head, 3ms вместо 30ms. (2) Embedding table sharding: 100GB+ таблицы шардированы по hash(entity_id), parallel lookup < 2ms. (3) Feature store tiering: real-time features в Redis (< 1ms), batch features в Memcached, cold в Bigtable. (4) 1000 stateless inference pods с HPA по latency p99. Deployment: blue-green с canary (1% -> 10% -> 100%), automatic rollback при calibration drift > 2% или revenue drop > 0.5%. Total infra: $1.4M/мес, ROI > 16,000x."