Перейти к содержанию

Масштабирование системы детекции спама

~2 минуты чтения

Предварительно: Компоненты, Метрики

Gmail обрабатывает 300B+ emails/day (3.5M/sec), при этом 45% -- спам. Каждое письмо проходит через 5-уровневый pipeline за < 50ms. Инфраструктурный challenge: adversarial спамеры адаптируются за часы, поэтому модели нужно обновлять непрерывно (online learning или hourly retrain). При этом стоимость инфраструктуры $500K+/мес, а один час пропущенного спама = миллионы жалоб.

Traffic Patterns

Volume

Метрика Значение
Daily emails 300B+
Peak TPS 5M emails/sec
Spam ratio ~45%
Avg email size 75KB (text), 2MB (with attachments)

Latency Budget

Total: 50ms (p99)

Connection-level (pre-content):
  IP reputation     5ms  (Redis lookup)
  Rate limiting     1ms  (local counter)

Content-level:
  Header parsing    3ms
  Body extraction   5ms
  Feature compute  12ms  (NLP + heuristics)
  ML inference     10ms  (ensemble)
  URL scanning      8ms  (reputation DB)
  Decision          2ms

Async (post-decision):
  Feedback logging  async
  Model update      async (hourly batch)

Tiered Architecture

Tier 1: Connection-level (IP reputation)
  - Blocks 30% of spam at network layer
  - Latency: 5ms
  - Cost: $0.000001/email

Tier 2: Header + lightweight features
  - Catches 40% of remaining spam
  - Latency: 10ms
  - Cost: $0.00001/email

Tier 3: Full content analysis (NLP + ML)
  - Catches 95%+ of remaining spam
  - Latency: 30ms
  - Cost: $0.0001/email

Tier 4: Deep analysis (sandbox, link following)
  - For suspicious attachments/URLs only (~5% of traffic)
  - Latency: 500ms-5s (async)
  - Cost: $0.01/email

Impact: Tiered filtering reduces Tier 3 load by 70%, saving ~$350K/мес.

Horizontal Scaling

Service Architecture

Component Instances Capacity Scaling
SMTP receivers 500 10K connections each HPA by connections
Feature extractors 200 20K emails/sec each HPA by CPU
ML inference 100 50K predictions/sec each HPA by latency p99
Reputation DB (Redis) 30 shards 500M entries, 5M ops/sec Shard by IP/domain
URL scanner 50 Async, 100K URLs/sec total Queue-based
Kafka (feedback) 20 partitions 1M events/sec Partition by user

Auto-scaling Policy

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: spam-ml-inference
spec:
  minReplicas: 100
  maxReplicas: 500
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
    - type: Pods
      pods:
        metric:
          name: inference_latency_p99_ms
        target:
          type: AverageValue
          averageValue: "30"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300

Adversarial Adaptation

Continuous Model Update Pipeline

Feedback loop (critical for spam):

User reports spam      -> labeled data    -> hourly retrain
User marks "not spam"  -> FP correction   -> immediate rule update
New spam pattern       -> anomaly detect  -> emergency model push

Timeline:
  T+0:     New spam campaign starts
  T+15min: Anomaly detection triggers (volume spike)
  T+1h:    Hourly retrain incorporates new samples
  T+4h:    Model deployed, 95%+ detection of new pattern

Feature Store Architecture

Feature type Storage Update frequency Latency
IP reputation Redis Real-time 1ms
Domain reputation Redis Hourly 1ms
Sender history Redis Real-time 2ms
Content embeddings ML inference Per-email 10ms
URL reputation Redis + async scan Hourly + on-demand 3ms
Global spam patterns Redis (bloom filter) Every 15min 0.5ms

High Availability & Failover

Degradation Strategy

Component failure Fallback Spam catch rate
Full pipeline All components 99.5%
ML model down Rules + reputation 85%
Feature store down Cached features + rules 75%
URL scanner down Skip URL analysis 90%
Everything down IP blocklist only 30%

Multi-Region

Primary: US-East (handles 40% traffic)
Secondary: EU-West (30%), AP-East (30%)

Each region: independent ML inference + shared reputation DB
Reputation sync: cross-region replication < 5 sec lag
Model sync: same model version across regions, deployed via rolling update

Cost

Component Monthly cost
SMTP + feature extraction (500 pods) $150K
ML inference (100 GPU pods) $120K
Reputation DB (Redis 30 shards) $80K
Kafka + storage $50K
URL scanning infra $60K
Data transfer $40K
Total ~$500K

Заблуждение: достаточно обучить модель один раз

Спамеры адаптируются за часы. Модель без retrain теряет 5-10% recall за неделю. Production-системы используют: (1) hourly retrain на свежих данных (user reports + honeypots), (2) online learning для быстрой адаптации, (3) rule-based emergency filters для zero-day атак. Gmail обновляет модели несколько раз в день.

Заблуждение: можно обрабатывать все письма одним pipeline

Tiered architecture критична: 30% спама ловится на IP-уровне за 5ms ($0.000001/email), ещё 40% -- на header-уровне за 10ms. Только 30% трафика доходит до дорогого NLP+ML (30ms, $0.0001/email). Без тиров стоимость инфраструктуры x3-5.

Секция для интервью

Вопрос: "Как масштабировать спам-фильтр до 300B emails/day?"

❌ Слабый ответ: "Добавить больше серверов для ML."

✅ Сильный ответ: "Четыре ключевых решения. (1) Tiered filtering: 30% отсекается на IP-уровне (Redis, 5ms), ещё 40% на headers (10ms), только 30% трафика идёт в full NLP+ML (30ms). Это сокращает ML load на 70%. (2) Stateless ML inference на K8s: 100 pods, HPA по latency p99, scale up 50%/min для spike handling. (3) Reputation DB на Redis Cluster (30 shards, 500M entries) -- IP, domain, sender scores обновляются в real-time из feedback loop. (4) Adversarial adaptation: hourly retrain + emergency rules для zero-day кампаний, anomaly detection на volume spikes. Общая стоимость ~$500K/мес, ROI > 100x от предотвращённого ущерба."