Перейти к содержанию

Масштабирование системы модерации контента

~3 минуты чтения

Предварительно: Компоненты, Метрики

YouTube получает 500 часов видео в минуту. Каждое видео проходит через text (title, description, captions), image (thumbnails, keyframes), audio (speech-to-text, music detection) и video (scene detection, object recognition) pipelines. Общий compute: ~100K GPU-hours/day. При этом CSAM и terrorism content должны быть заблокированы ДО первого показа (pre-publish screening), а hate speech может модерироваться post-publish с SLA < 24h.

Traffic Patterns

Volume по модальности

Модальность Volume Avg processing time Peak
Text (posts, comments) 1B+/day 50ms 50K/sec
Images 500M+/day 200ms 20K/sec
Short video (< 60s) 100M+/day 2-5s 5K/sec
Long video (> 60s) 10M+/day 30-120s 500/sec
Audio (extracted) 100M+/day 1-3s 5K/sec

Latency Requirements

Pre-publish (blocking upload):
  CSAM hash matching:    < 100ms (PhotoDNA/CSAM hash DB)
  Terrorism content:     < 500ms (GIFCT shared hash DB)

Post-publish, priority:
  Violence/gore:         < 5 min to first scan
  Nudity:               < 5 min
  Hate speech:          < 30 min (NLP more expensive)

Post-publish, batch:
  Misinformation:       < 4 hours
  Copyright:            < 24 hours (ContentID)
  Spam:                 < 1 hour

Tiered Processing

Fast Path (< 100ms)

Hash matching (perceptual hashes):
  - PhotoDNA for CSAM (Microsoft, industry standard)
  - GIFCT hash DB for terrorism
  - Internal known-bad hash DB

Volume: 100% of uploads
Cost: $0.00001/item
Catch rate: 90%+ of known violations

Standard Path (< 5 min)

ML classification:
  Text:  BERT-based classifier, multi-label
  Image: ResNet/ViT classifier + OCR
  Audio: Whisper STT -> text classifier

Volume: 100% of uploads (async)
Cost: $0.001/item (text), $0.01/item (image)
Catch rate: 85-95% depending on category

Deep Path (minutes-hours)

Complex analysis:
  Video: scene-by-scene analysis, temporal patterns
  Cross-modal: text + image combined understanding
  Context: user history, community norms, cultural context
  LLM-based: GPT-4V for nuanced policy interpretation

Volume: 5-10% of uploads (borderline + high-risk)
Cost: $0.10-1.00/item
Catch rate: +5-10% on top of standard

Horizontal Scaling

Service Architecture

Component Instances GPU/CPU Scaling trigger
Hash matching 50 pods CPU Queue depth
Text classifier 100 pods CPU (ONNX) Latency p99
Image classifier 200 pods GPU (A10G) Queue depth
Video processor 300 pods GPU (A100) Queue depth
Audio STT 100 pods GPU (T4) Queue depth
Review queue API 20 pods CPU Request rate
Human review UI 10 pods CPU Active moderators

Queue-Based Architecture

Upload -> Kafka (partitioned by content_type)

Consumers:
  hash-matching-group:     50 consumers, < 100ms
  text-classification:    100 consumers, < 500ms
  image-classification:   200 consumers, < 2s
  video-processing:       300 consumers, < 120s

Priority queues:
  P0 (CSAM, terrorism):   Dedicated consumers, immediate
  P1 (violence, nudity):  High priority, < 5 min
  P2 (hate speech, spam): Normal priority, < 30 min
  P3 (misinformation):    Low priority, < 4 hours

Human-in-the-Loop Scaling

Moderator Workforce

Tier Role Volume/day Cost/review
T1: ML auto-action None (auto) ~90% of content $0.001
T2: Queue review Junior moderator ~8% of content $0.05
T3: Appeal review Senior moderator ~1.5% of content $0.20
T4: Policy edge case Policy specialist ~0.5% of content $1.00

Moderator Wellbeing

Critical for scaling:
  - Rotation: max 4 hours on CSAM/violence before mandatory break
  - Blurring: hash-match content shown blurred, moderator confirms
  - Counseling: weekly access, mandatory quarterly screening
  - AI assist: pre-classification reduces exposure to worst content

Cost

Component Monthly cost
GPU inference (video + image) $800K
CPU inference (text + hash) $150K
Human moderation (15K moderators) $30M
Storage (content + logs) $200K
Kafka + queues $50K
Total (infra only) ~$1.2M
Total (with humans) ~$31.2M

High Availability

Degradation Strategy

Failure Fallback Risk
Image classifier down Text-only + hash matching Miss 10-15% image violations
Video processor down Thumbnail + audio only Miss scene-level violations
LLM deep analysis down Standard ML only Miss nuanced violations
All ML down Hash matching + manual queue 30-40% miss rate, queue explodes

Заблуждение: GPU scaling решает проблему video moderation

Video processing -- 100x дороже image. 500 часов видео/мин (YouTube) x keyframe extraction (1 frame/sec) = 30,000 images/sec только от keyframes. Без smart sampling (scene change detection, первые/последние 10 сек, flagged segments) нужно 10x больше GPU. Решение: sample 10-20% frames + full analysis только для flagged videos. Это снижает GPU cost на 80% при < 2% loss в recall.

Заблуждение: модерация -- чисто техническая задача

Самая дорогая часть -- human moderators ($30M/мес у Meta). Scaling bottleneck не GPU, а люди: найм, обучение (2-4 недели), burnout (средний tenure < 2 года), wellbeing программы. ML automation сокращает human review load с 100% до ~10%, но 10% от 1B posts/day = 100M cases для людей. Без ML это 1B cases -- физически невозможно.

Секция для интервью

Вопрос: "Как масштабировать модерацию для YouTube (500 часов видео/мин)?"

❌ Слабый ответ: "Больше GPU для видео-классификации."

✅ Сильный ответ: "Три уровня. (1) Fast path (< 100ms): perceptual hash matching (PhotoDNA, GIFCT) ловит 90%+ known violations ещё до публикации. (2) Standard path (async, < 5 min): smart sampling -- 10-20% keyframes по scene changes, thumbnail + audio STT -> text classifier. Сокращает GPU cost на 80%. (3) Deep path (5-10% трафика): full video analysis + LLM для borderline cases. Scaling: queue-based architecture с priority (CSAM = P0, immediate; hate speech = P2, < 30 min). Human-in-the-loop: ML auto-actions 90%, junior review 8%, senior appeal 1.5%, policy specialist 0.5%. Bottleneck -- не GPU, а moderator workforce ($30M/мес у Meta, burnout, wellbeing). Total infra: $1.2M/мес, total с людьми: $31M/мес."