Масштабирование системы детекции спама¶
~2 минуты чтения
Предварительно: Компоненты, Метрики
Gmail обрабатывает 300B+ emails/day (3.5M/sec), при этом 45% -- спам. Каждое письмо проходит через 5-уровневый pipeline за < 50ms. Инфраструктурный challenge: adversarial спамеры адаптируются за часы, поэтому модели нужно обновлять непрерывно (online learning или hourly retrain). При этом стоимость инфраструктуры $500K+/мес, а один час пропущенного спама = миллионы жалоб.
Traffic Patterns¶
Volume¶
| Метрика | Значение |
|---|---|
| Daily emails | 300B+ |
| Peak TPS | 5M emails/sec |
| Spam ratio | ~45% |
| Avg email size | 75KB (text), 2MB (with attachments) |
Latency Budget¶
Total: 50ms (p99)
Connection-level (pre-content):
IP reputation 5ms (Redis lookup)
Rate limiting 1ms (local counter)
Content-level:
Header parsing 3ms
Body extraction 5ms
Feature compute 12ms (NLP + heuristics)
ML inference 10ms (ensemble)
URL scanning 8ms (reputation DB)
Decision 2ms
Async (post-decision):
Feedback logging async
Model update async (hourly batch)
Tiered Architecture¶
Tier 1: Connection-level (IP reputation)
- Blocks 30% of spam at network layer
- Latency: 5ms
- Cost: $0.000001/email
Tier 2: Header + lightweight features
- Catches 40% of remaining spam
- Latency: 10ms
- Cost: $0.00001/email
Tier 3: Full content analysis (NLP + ML)
- Catches 95%+ of remaining spam
- Latency: 30ms
- Cost: $0.0001/email
Tier 4: Deep analysis (sandbox, link following)
- For suspicious attachments/URLs only (~5% of traffic)
- Latency: 500ms-5s (async)
- Cost: $0.01/email
Impact: Tiered filtering reduces Tier 3 load by 70%, saving ~$350K/мес.
Horizontal Scaling¶
Service Architecture¶
| Component | Instances | Capacity | Scaling |
|---|---|---|---|
| SMTP receivers | 500 | 10K connections each | HPA by connections |
| Feature extractors | 200 | 20K emails/sec each | HPA by CPU |
| ML inference | 100 | 50K predictions/sec each | HPA by latency p99 |
| Reputation DB (Redis) | 30 shards | 500M entries, 5M ops/sec | Shard by IP/domain |
| URL scanner | 50 | Async, 100K URLs/sec total | Queue-based |
| Kafka (feedback) | 20 partitions | 1M events/sec | Partition by user |
Auto-scaling Policy¶
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: spam-ml-inference
spec:
minReplicas: 100
maxReplicas: 500
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
- type: Pods
pods:
metric:
name: inference_latency_p99_ms
target:
type: AverageValue
averageValue: "30"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
Adversarial Adaptation¶
Continuous Model Update Pipeline¶
Feedback loop (critical for spam):
User reports spam -> labeled data -> hourly retrain
User marks "not spam" -> FP correction -> immediate rule update
New spam pattern -> anomaly detect -> emergency model push
Timeline:
T+0: New spam campaign starts
T+15min: Anomaly detection triggers (volume spike)
T+1h: Hourly retrain incorporates new samples
T+4h: Model deployed, 95%+ detection of new pattern
Feature Store Architecture¶
| Feature type | Storage | Update frequency | Latency |
|---|---|---|---|
| IP reputation | Redis | Real-time | 1ms |
| Domain reputation | Redis | Hourly | 1ms |
| Sender history | Redis | Real-time | 2ms |
| Content embeddings | ML inference | Per-email | 10ms |
| URL reputation | Redis + async scan | Hourly + on-demand | 3ms |
| Global spam patterns | Redis (bloom filter) | Every 15min | 0.5ms |
High Availability & Failover¶
Degradation Strategy¶
| Component failure | Fallback | Spam catch rate |
|---|---|---|
| Full pipeline | All components | 99.5% |
| ML model down | Rules + reputation | 85% |
| Feature store down | Cached features + rules | 75% |
| URL scanner down | Skip URL analysis | 90% |
| Everything down | IP blocklist only | 30% |
Multi-Region¶
Primary: US-East (handles 40% traffic)
Secondary: EU-West (30%), AP-East (30%)
Each region: independent ML inference + shared reputation DB
Reputation sync: cross-region replication < 5 sec lag
Model sync: same model version across regions, deployed via rolling update
Cost¶
| Component | Monthly cost |
|---|---|
| SMTP + feature extraction (500 pods) | $150K |
| ML inference (100 GPU pods) | $120K |
| Reputation DB (Redis 30 shards) | $80K |
| Kafka + storage | $50K |
| URL scanning infra | $60K |
| Data transfer | $40K |
| Total | ~$500K |
Заблуждение: достаточно обучить модель один раз
Спамеры адаптируются за часы. Модель без retrain теряет 5-10% recall за неделю. Production-системы используют: (1) hourly retrain на свежих данных (user reports + honeypots), (2) online learning для быстрой адаптации, (3) rule-based emergency filters для zero-day атак. Gmail обновляет модели несколько раз в день.
Заблуждение: можно обрабатывать все письма одним pipeline
Tiered architecture критична: 30% спама ловится на IP-уровне за 5ms ($0.000001/email), ещё 40% -- на header-уровне за 10ms. Только 30% трафика доходит до дорогого NLP+ML (30ms, $0.0001/email). Без тиров стоимость инфраструктуры x3-5.
Секция для интервью¶
Вопрос: "Как масштабировать спам-фильтр до 300B emails/day?"
Слабый ответ: "Добавить больше серверов для ML."
Сильный ответ: "Четыре ключевых решения. (1) Tiered filtering: 30% отсекается на IP-уровне (Redis, 5ms), ещё 40% на headers (10ms), только 30% трафика идёт в full NLP+ML (30ms). Это сокращает ML load на 70%. (2) Stateless ML inference на K8s: 100 pods, HPA по latency p99, scale up 50%/min для spike handling. (3) Reputation DB на Redis Cluster (30 shards, 500M entries) -- IP, domain, sender scores обновляются в real-time из feedback loop. (4) Adversarial adaptation: hourly retrain + emergency rules для zero-day кампаний, anomaly detection на volume spikes. Общая стоимость ~$500K/мес, ROI > 100x от предотвращённого ущерба."