Масштабирование системы модерации контента¶
~3 минуты чтения
Предварительно: Компоненты, Метрики
YouTube получает 500 часов видео в минуту. Каждое видео проходит через text (title, description, captions), image (thumbnails, keyframes), audio (speech-to-text, music detection) и video (scene detection, object recognition) pipelines. Общий compute: ~100K GPU-hours/day. При этом CSAM и terrorism content должны быть заблокированы ДО первого показа (pre-publish screening), а hate speech может модерироваться post-publish с SLA < 24h.
Traffic Patterns¶
Volume по модальности¶
| Модальность | Volume | Avg processing time | Peak |
|---|---|---|---|
| Text (posts, comments) | 1B+/day | 50ms | 50K/sec |
| Images | 500M+/day | 200ms | 20K/sec |
| Short video (< 60s) | 100M+/day | 2-5s | 5K/sec |
| Long video (> 60s) | 10M+/day | 30-120s | 500/sec |
| Audio (extracted) | 100M+/day | 1-3s | 5K/sec |
Latency Requirements¶
Pre-publish (blocking upload):
CSAM hash matching: < 100ms (PhotoDNA/CSAM hash DB)
Terrorism content: < 500ms (GIFCT shared hash DB)
Post-publish, priority:
Violence/gore: < 5 min to first scan
Nudity: < 5 min
Hate speech: < 30 min (NLP more expensive)
Post-publish, batch:
Misinformation: < 4 hours
Copyright: < 24 hours (ContentID)
Spam: < 1 hour
Tiered Processing¶
Fast Path (< 100ms)¶
Hash matching (perceptual hashes):
- PhotoDNA for CSAM (Microsoft, industry standard)
- GIFCT hash DB for terrorism
- Internal known-bad hash DB
Volume: 100% of uploads
Cost: $0.00001/item
Catch rate: 90%+ of known violations
Standard Path (< 5 min)¶
ML classification:
Text: BERT-based classifier, multi-label
Image: ResNet/ViT classifier + OCR
Audio: Whisper STT -> text classifier
Volume: 100% of uploads (async)
Cost: $0.001/item (text), $0.01/item (image)
Catch rate: 85-95% depending on category
Deep Path (minutes-hours)¶
Complex analysis:
Video: scene-by-scene analysis, temporal patterns
Cross-modal: text + image combined understanding
Context: user history, community norms, cultural context
LLM-based: GPT-4V for nuanced policy interpretation
Volume: 5-10% of uploads (borderline + high-risk)
Cost: $0.10-1.00/item
Catch rate: +5-10% on top of standard
Horizontal Scaling¶
Service Architecture¶
| Component | Instances | GPU/CPU | Scaling trigger |
|---|---|---|---|
| Hash matching | 50 pods | CPU | Queue depth |
| Text classifier | 100 pods | CPU (ONNX) | Latency p99 |
| Image classifier | 200 pods | GPU (A10G) | Queue depth |
| Video processor | 300 pods | GPU (A100) | Queue depth |
| Audio STT | 100 pods | GPU (T4) | Queue depth |
| Review queue API | 20 pods | CPU | Request rate |
| Human review UI | 10 pods | CPU | Active moderators |
Queue-Based Architecture¶
Upload -> Kafka (partitioned by content_type)
Consumers:
hash-matching-group: 50 consumers, < 100ms
text-classification: 100 consumers, < 500ms
image-classification: 200 consumers, < 2s
video-processing: 300 consumers, < 120s
Priority queues:
P0 (CSAM, terrorism): Dedicated consumers, immediate
P1 (violence, nudity): High priority, < 5 min
P2 (hate speech, spam): Normal priority, < 30 min
P3 (misinformation): Low priority, < 4 hours
Human-in-the-Loop Scaling¶
Moderator Workforce¶
| Tier | Role | Volume/day | Cost/review |
|---|---|---|---|
| T1: ML auto-action | None (auto) | ~90% of content | $0.001 |
| T2: Queue review | Junior moderator | ~8% of content | $0.05 |
| T3: Appeal review | Senior moderator | ~1.5% of content | $0.20 |
| T4: Policy edge case | Policy specialist | ~0.5% of content | $1.00 |
Moderator Wellbeing¶
Critical for scaling:
- Rotation: max 4 hours on CSAM/violence before mandatory break
- Blurring: hash-match content shown blurred, moderator confirms
- Counseling: weekly access, mandatory quarterly screening
- AI assist: pre-classification reduces exposure to worst content
Cost¶
| Component | Monthly cost |
|---|---|
| GPU inference (video + image) | $800K |
| CPU inference (text + hash) | $150K |
| Human moderation (15K moderators) | $30M |
| Storage (content + logs) | $200K |
| Kafka + queues | $50K |
| Total (infra only) | ~$1.2M |
| Total (with humans) | ~$31.2M |
High Availability¶
Degradation Strategy¶
| Failure | Fallback | Risk |
|---|---|---|
| Image classifier down | Text-only + hash matching | Miss 10-15% image violations |
| Video processor down | Thumbnail + audio only | Miss scene-level violations |
| LLM deep analysis down | Standard ML only | Miss nuanced violations |
| All ML down | Hash matching + manual queue | 30-40% miss rate, queue explodes |
Заблуждение: GPU scaling решает проблему video moderation
Video processing -- 100x дороже image. 500 часов видео/мин (YouTube) x keyframe extraction (1 frame/sec) = 30,000 images/sec только от keyframes. Без smart sampling (scene change detection, первые/последние 10 сек, flagged segments) нужно 10x больше GPU. Решение: sample 10-20% frames + full analysis только для flagged videos. Это снижает GPU cost на 80% при < 2% loss в recall.
Заблуждение: модерация -- чисто техническая задача
Самая дорогая часть -- human moderators ($30M/мес у Meta). Scaling bottleneck не GPU, а люди: найм, обучение (2-4 недели), burnout (средний tenure < 2 года), wellbeing программы. ML automation сокращает human review load с 100% до ~10%, но 10% от 1B posts/day = 100M cases для людей. Без ML это 1B cases -- физически невозможно.
Секция для интервью¶
Вопрос: "Как масштабировать модерацию для YouTube (500 часов видео/мин)?"
Слабый ответ: "Больше GPU для видео-классификации."
Сильный ответ: "Три уровня. (1) Fast path (< 100ms): perceptual hash matching (PhotoDNA, GIFCT) ловит 90%+ known violations ещё до публикации. (2) Standard path (async, < 5 min): smart sampling -- 10-20% keyframes по scene changes, thumbnail + audio STT -> text classifier. Сокращает GPU cost на 80%. (3) Deep path (5-10% трафика): full video analysis + LLM для borderline cases. Scaling: queue-based architecture с priority (CSAM = P0, immediate; hate speech = P2, < 30 min). Human-in-the-loop: ML auto-actions 90%, junior review 8%, senior appeal 1.5%, policy specialist 0.5%. Bottleneck -- не GPU, а moderator workforce ($30M/мес у Meta, burnout, wellbeing). Total infra: $1.2M/мес, total с людьми: $31M/мес."