Масштабирование системы модерации контента¶

~3 минуты чтения

Предварительно: Компоненты, Метрики

YouTube получает 500 часов видео в минуту. Каждое видео проходит через text (title, description, captions), image (thumbnails, keyframes), audio (speech-to-text, music detection) и video (scene detection, object recognition) pipelines. Общий compute: ~100K GPU-hours/day. При этом CSAM и terrorism content должны быть заблокированы ДО первого показа (pre-publish screening), а hate speech может модерироваться post-publish с SLA < 24h.

Traffic Patterns¶

Volume по модальности¶

Модальность	Volume	Avg processing time	Peak
Text (posts, comments)	1B+/day	50ms	50K/sec
Images	500M+/day	200ms	20K/sec
Short video (< 60s)	100M+/day	2-5s	5K/sec
Long video (> 60s)	10M+/day	30-120s	500/sec
Audio (extracted)	100M+/day	1-3s	5K/sec

Latency Requirements¶

Pre-publish (blocking upload):
  CSAM hash matching:    < 100ms (PhotoDNA/CSAM hash DB)
  Terrorism content:     < 500ms (GIFCT shared hash DB)

Post-publish, priority:
  Violence/gore:         < 5 min to first scan
  Nudity:               < 5 min
  Hate speech:          < 30 min (NLP more expensive)

Post-publish, batch:
  Misinformation:       < 4 hours
  Copyright:            < 24 hours (ContentID)
  Spam:                 < 1 hour

Tiered Processing¶

Fast Path (< 100ms)¶

Hash matching (perceptual hashes):
  - PhotoDNA for CSAM (Microsoft, industry standard)
  - GIFCT hash DB for terrorism
  - Internal known-bad hash DB

Volume: 100% of uploads
Cost: $0.00001/item
Catch rate: 90%+ of known violations

Standard Path (< 5 min)¶

ML classification:
  Text:  BERT-based classifier, multi-label
  Image: ResNet/ViT classifier + OCR
  Audio: Whisper STT -> text classifier

Volume: 100% of uploads (async)
Cost: $0.001/item (text), $0.01/item (image)
Catch rate: 85-95% depending on category

Deep Path (minutes-hours)¶

Complex analysis:
  Video: scene-by-scene analysis, temporal patterns
  Cross-modal: text + image combined understanding
  Context: user history, community norms, cultural context
  LLM-based: GPT-4V for nuanced policy interpretation

Volume: 5-10% of uploads (borderline + high-risk)
Cost: $0.10-1.00/item
Catch rate: +5-10% on top of standard

Horizontal Scaling¶

Service Architecture¶

Component	Instances	GPU/CPU	Scaling trigger
Hash matching	50 pods	CPU	Queue depth
Text classifier	100 pods	CPU (ONNX)	Latency p99
Image classifier	200 pods	GPU (A10G)	Queue depth
Video processor	300 pods	GPU (A100)	Queue depth
Audio STT	100 pods	GPU (T4)	Queue depth
Review queue API	20 pods	CPU	Request rate
Human review UI	10 pods	CPU	Active moderators

Queue-Based Architecture¶

Upload -> Kafka (partitioned by content_type)

Consumers:
  hash-matching-group:     50 consumers, < 100ms
  text-classification:    100 consumers, < 500ms
  image-classification:   200 consumers, < 2s
  video-processing:       300 consumers, < 120s

Priority queues:
  P0 (CSAM, terrorism):   Dedicated consumers, immediate
  P1 (violence, nudity):  High priority, < 5 min
  P2 (hate speech, spam): Normal priority, < 30 min
  P3 (misinformation):    Low priority, < 4 hours

Human-in-the-Loop Scaling¶

Moderator Workforce¶

Tier	Role	Volume/day	Cost/review
T1: ML auto-action	None (auto)	~90% of content	$0.001
T2: Queue review	Junior moderator	~8% of content	$0.05
T3: Appeal review	Senior moderator	~1.5% of content	$0.20
T4: Policy edge case	Policy specialist	~0.5% of content	$1.00

Moderator Wellbeing¶

Critical for scaling:
  - Rotation: max 4 hours on CSAM/violence before mandatory break
  - Blurring: hash-match content shown blurred, moderator confirms
  - Counseling: weekly access, mandatory quarterly screening
  - AI assist: pre-classification reduces exposure to worst content

Cost¶

Component	Monthly cost
GPU inference (video + image)	$800K
CPU inference (text + hash)	$150K
Human moderation (15K moderators)	$30M
Storage (content + logs)	$200K
Kafka + queues	$50K
Total (infra only)	~$1.2M
Total (with humans)	~$31.2M

High Availability¶

Degradation Strategy¶

Failure	Fallback	Risk
Image classifier down	Text-only + hash matching	Miss 10-15% image violations
Video processor down	Thumbnail + audio only	Miss scene-level violations
LLM deep analysis down	Standard ML only	Miss nuanced violations
All ML down	Hash matching + manual queue	30-40% miss rate, queue explodes

Заблуждение: GPU scaling решает проблему video moderation

Video processing -- 100x дороже image. 500 часов видео/мин (YouTube) x keyframe extraction (1 frame/sec) = 30,000 images/sec только от keyframes. Без smart sampling (scene change detection, первые/последние 10 сек, flagged segments) нужно 10x больше GPU. Решение: sample 10-20% frames + full analysis только для flagged videos. Это снижает GPU cost на 80% при < 2% loss в recall.

Заблуждение: модерация -- чисто техническая задача

Самая дорогая часть -- human moderators ($30M/мес у Meta). Scaling bottleneck не GPU, а люди: найм, обучение (2-4 недели), burnout (средний tenure < 2 года), wellbeing программы. ML automation сокращает human review load с 100% до ~10%, но 10% от 1B posts/day = 100M cases для людей. Без ML это 1B cases -- физически невозможно.

Секция для интервью¶

Вопрос: "Как масштабировать модерацию для YouTube (500 часов видео/мин)?"

Слабый ответ: "Больше GPU для видео-классификации."

Сильный ответ: "Три уровня. (1) Fast path (< 100ms): perceptual hash matching (PhotoDNA, GIFCT) ловит 90%+ known violations ещё до публикации. (2) Standard path (async, < 5 min): smart sampling -- 10-20% keyframes по scene changes, thumbnail + audio STT -> text classifier. Сокращает GPU cost на 80%. (3) Deep path (5-10% трафика): full video analysis + LLM для borderline cases. Scaling: queue-based architecture с priority (CSAM = P0, immediate; hate speech = P2, < 30 min). Human-in-the-loop: ML auto-actions 90%, junior review 8%, senior appeal 1.5%, policy specialist 0.5%. Bottleneck -- не GPU, а moderator workforce ($30M/мес у Meta, burnout, wellbeing). Total infra: $1.2M/мес, total с людьми: $31M/мес."