Масштабирование системы детекции спама¶

~2 минуты чтения

Предварительно: Компоненты, Метрики

Gmail обрабатывает 300B+ emails/day (3.5M/sec), при этом 45% -- спам. Каждое письмо проходит через 5-уровневый pipeline за < 50ms. Инфраструктурный challenge: adversarial спамеры адаптируются за часы, поэтому модели нужно обновлять непрерывно (online learning или hourly retrain). При этом стоимость инфраструктуры $500K+/мес, а один час пропущенного спама = миллионы жалоб.

Traffic Patterns¶

Volume¶

Метрика	Значение
Daily emails	300B+
Peak TPS	5M emails/sec
Spam ratio	~45%
Avg email size	75KB (text), 2MB (with attachments)

Latency Budget¶

Total: 50ms (p99)

Connection-level (pre-content):
  IP reputation     5ms  (Redis lookup)
  Rate limiting     1ms  (local counter)

Content-level:
  Header parsing    3ms
  Body extraction   5ms
  Feature compute  12ms  (NLP + heuristics)
  ML inference     10ms  (ensemble)
  URL scanning      8ms  (reputation DB)
  Decision          2ms

Async (post-decision):
  Feedback logging  async
  Model update      async (hourly batch)

Tiered Architecture¶

Tier 1: Connection-level (IP reputation)
  - Blocks 30% of spam at network layer
  - Latency: 5ms
  - Cost: $0.000001/email

Tier 2: Header + lightweight features
  - Catches 40% of remaining spam
  - Latency: 10ms
  - Cost: $0.00001/email

Tier 3: Full content analysis (NLP + ML)
  - Catches 95%+ of remaining spam
  - Latency: 30ms
  - Cost: $0.0001/email

Tier 4: Deep analysis (sandbox, link following)
  - For suspicious attachments/URLs only (~5% of traffic)
  - Latency: 500ms-5s (async)
  - Cost: $0.01/email

Impact: Tiered filtering reduces Tier 3 load by 70%, saving ~$350K/мес.

Horizontal Scaling¶

Service Architecture¶

Component	Instances	Capacity	Scaling
SMTP receivers	500	10K connections each	HPA by connections
Feature extractors	200	20K emails/sec each	HPA by CPU
ML inference	100	50K predictions/sec each	HPA by latency p99
Reputation DB (Redis)	30 shards	500M entries, 5M ops/sec	Shard by IP/domain
URL scanner	50	Async, 100K URLs/sec total	Queue-based
Kafka (feedback)	20 partitions	1M events/sec	Partition by user

Auto-scaling Policy¶

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: spam-ml-inference
spec:
  minReplicas: 100
  maxReplicas: 500
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
    - type: Pods
      pods:
        metric:
          name: inference_latency_p99_ms
        target:
          type: AverageValue
          averageValue: "30"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300

Adversarial Adaptation¶

Continuous Model Update Pipeline¶

Feedback loop (critical for spam):

User reports spam      -> labeled data    -> hourly retrain
User marks "not spam"  -> FP correction   -> immediate rule update
New spam pattern       -> anomaly detect  -> emergency model push

Timeline:
  T+0:     New spam campaign starts
  T+15min: Anomaly detection triggers (volume spike)
  T+1h:    Hourly retrain incorporates new samples
  T+4h:    Model deployed, 95%+ detection of new pattern

Feature Store Architecture¶

Feature type	Storage	Update frequency	Latency
IP reputation	Redis	Real-time	1ms
Domain reputation	Redis	Hourly	1ms
Sender history	Redis	Real-time	2ms
Content embeddings	ML inference	Per-email	10ms
URL reputation	Redis + async scan	Hourly + on-demand	3ms
Global spam patterns	Redis (bloom filter)	Every 15min	0.5ms

High Availability & Failover¶

Degradation Strategy¶

Component failure	Fallback	Spam catch rate
Full pipeline	All components	99.5%
ML model down	Rules + reputation	85%
Feature store down	Cached features + rules	75%
URL scanner down	Skip URL analysis	90%
Everything down	IP blocklist only	30%

Multi-Region¶

Primary: US-East (handles 40% traffic)
Secondary: EU-West (30%), AP-East (30%)

Each region: independent ML inference + shared reputation DB
Reputation sync: cross-region replication < 5 sec lag
Model sync: same model version across regions, deployed via rolling update

Cost¶

Component	Monthly cost
SMTP + feature extraction (500 pods)	$150K
ML inference (100 GPU pods)	$120K
Reputation DB (Redis 30 shards)	$80K
Kafka + storage	$50K
URL scanning infra	$60K
Data transfer	$40K
Total	~$500K

Заблуждение: достаточно обучить модель один раз

Спамеры адаптируются за часы. Модель без retrain теряет 5-10% recall за неделю. Production-системы используют: (1) hourly retrain на свежих данных (user reports + honeypots), (2) online learning для быстрой адаптации, (3) rule-based emergency filters для zero-day атак. Gmail обновляет модели несколько раз в день.

Заблуждение: можно обрабатывать все письма одним pipeline

Tiered architecture критична: 30% спама ловится на IP-уровне за 5ms ($0.000001/email), ещё 40% -- на header-уровне за 10ms. Только 30% трафика доходит до дорогого NLP+ML (30ms, $0.0001/email). Без тиров стоимость инфраструктуры x3-5.

Секция для интервью¶

Вопрос: "Как масштабировать спам-фильтр до 300B emails/day?"

Слабый ответ: "Добавить больше серверов для ML."

Сильный ответ: "Четыре ключевых решения. (1) Tiered filtering: 30% отсекается на IP-уровне (Redis, 5ms), ещё 40% на headers (10ms), только 30% трафика идёт в full NLP+ML (30ms). Это сокращает ML load на 70%. (2) Stateless ML inference на K8s: 100 pods, HPA по latency p99, scale up 50%/min для spike handling. (3) Reputation DB на Redis Cluster (30 shards, 500M entries) -- IP, domain, sender scores обновляются в real-time из feedback loop. (4) Adversarial adaptation: hourly retrain + emergency rules для zero-day кампаний, anomaly detection на volume spikes. Общая стоимость ~$500K/мес, ROI > 100x от предотвращённого ущерба."