Водяные знаки LLM: KGW, SynthID, детекция AI-текста¶

~9 минут чтения

Предварительно: Безопасность LLM | Гардрейлы LLM

Google DeepMind уже развернул SynthID в Gemini -- первую production-систему watermarking, которая обнаруживает AI-текст с FPR 0.1% при минимальном влиянии на качество (+2% perplexity). EU AI Act 2024 обязывает маркировать AI-контент, а KGW-watermark (Kirchenbauer, 2023) при z>4 даёт false positive rate всего 10^-5. Но paraphrasing снижает detection KGW до 40%, а перевод -- до 10%. Watermarking vs detection -- это фундаментальный trade-off между надёжностью (95%+ с ключом) и универсальностью (70-85% без доступа к модели), и его понимание критично для проектирования AI-систем.

Ключевые концепции¶

Watermarking vs Detection¶

Подход	Описание	Accuracy
Watermarking	Embed pattern during generation (нужен доступ к модели)	95%+ (with key)
Detection	Classify text as AI/human (без доступа)	70-85% (adversarial)

Зачем: обнаружение AI-текста (education, spam), provenance tracking, copyright protection, compliance (EU AI Act 2024).

Ландшафт 2026¶

Метод	Detectability	Robustness	Quality Impact	Stealth
KGW (Green-Red)	High	Low	Low	Low
SynthID	High	Medium	Low	Medium
Multi-bit	High	High	Medium	Medium
Semantic	Medium	High	Low	High

1. KGW Watermark (Kirchenbauer et al., 2023)¶

Механизм¶

Модификация logit distribution на основе hash previous token:

\[\text{logit}'(x_t) = \text{logit}(x_t) + \delta \cdot \mathbb{1}[\text{hash}(x_{t-1}) \in \mathcal{G}]\]

где \(\mathcal{G}\) -- "green" tokens (watermarked subset).

Алгоритм:

For each generation step:
1. Hash previous token to seed RNG:  seed = hash(prev_token)
2. Partition vocabulary:  green_list = random_subset(vocab, gamma=0.5)
3. Boost green list logits:  logits[green_list] += delta  (typically delta=2.0)
4. Sample from modified distribution

Result: green tokens appear ~gamma + delta/(1+delta) вместо natural ~50%

Detection (z-score)¶

\[z = \frac{|\mathcal{G}| - \gamma T}{\sqrt{T\gamma(1-\gamma)}}\]

где \(\gamma\) -- expected green ratio, \(T\) -- text length. Threshold: z > 4 -> watermarked (FPR = 10^-5).

def detect_kgw_watermark(text, key, gamma=0.5, threshold=4.0):
    tokens = tokenize(text)
    green_count = 0

    for i in range(1, len(tokens)):
        seed = hash_token(tokens[i-1], key)
        green_list = generate_green_list(vocab_size, gamma, seed)
        if tokens[i] in green_list:
            green_count += 1

    n = len(tokens) - 1
    z_score = (green_count - gamma * n) / (gamma * (1-gamma) * n) ** 0.5
    p_value = 1 - norm.cdf(z_score)

    return {"z_score": z_score, "p_value": p_value, "is_watermarked": z_score > threshold}

KGW Properties¶

Property	Value
False positive rate	10^-5 at z > 4
Min tokens for detection	~200
Quality impact (delta=2)	+5% perplexity
Robustness	Low (editing breaks hash chain)

Параметр	Типичное значение	Trade-off
Green list ratio (gamma)	0.5	Detection vs quality
Bias strength (delta)	0.5-2.0	Stronger = easier detection, worse quality
Hash context	Previous 1-4 tokens	Longer = more robust, more compute

2. SynthID (Google DeepMind, 2024)¶

First production-ready watermark system. Deployed в Gemini, Google AI Studio.

Innovations: - Tournament-based sampling: compare pairs of tokens, winner advances, watermark influences outcomes - Tunable detectability-quality tradeoff - More robust to paraphrasing than KGW - Minimal quality impact

Property	KGW	SynthID
True positive rate	99%	95%
False positive rate	1%	0.1%
Min tokens	200	100
Perplexity increase	+5%	+2%
Paraphrase robustness	40% retained	70% retained

3. Multi-bit Watermarking¶

Вместо binary watermark -> embed полезную информацию (model ID, timestamp, user ID):

\[\text{Payload} = \{m_1, m_2, \ldots, m_k\} \in \{0,1\}^k\]

Capacity:

Method	Bits per 1000 tokens	Robustness
KGW (binary)	1	Low
Multi-bit KGW	32	Low
Punctuation-based	100	Medium
SynthID	64	High

Properties: resistant to paraphrasing, survives translation, tolerates cropping.

4. Detection Methods¶

Zero-shot Detection (без доступа к модели)¶

Метод	Принцип
GPTZero	Perplexity + burstiness
DetectGPT	Probability curvature
GLTR	Token probability visualization

Supervised Detection¶

Обучение classifier на AI/human text pairs. Challenge: distribution shift -> degraded performance при новых моделях (trained on GPT-3.5 fails on GPT-4o).

Watermark Detection (при наличии ключа)¶

z-score test (см. формулу выше). Threshold z > 4 -- high confidence.

Detection Accuracy¶

Тип детектора	False Positive Rate	False Negative Rate
Statistical (KGW)	1-5%	5-15%
ML classifiers	2-10%	5-20%
Zero-shot	5-15%	10-30%

5. Attack Vectors¶

Паттерны атак¶

Атака	Эффективность	Detection retained
Minor edits	Low	KGW 70%, SynthID 85%
Paraphrasing (light)	Medium	30-50%
Paraphrasing (heavy)	High	10-30%
Translation	Very High	KGW 10%, SynthID 30%
Token substitution	Medium	Depends on method

Defense Mechanisms¶

Атака	Защита	Эффективность
Paraphrasing	Semantic watermarking	Medium
Substitution	High-entropy watermark	High
Insertion/Deletion	Local watermarking	Medium
Translation	Cross-lingual watermark	Low

6. Code Watermarking¶

Отличия от текста: code must remain executable, deterministic structure, semantic preservation при refactoring.

Подходы: - Variable naming patterns - Comment encoding - AST structure modification - Dead code insertion

Challenges: compilation/interpretation must pass, formatting tools normalize code, optimization passes exist.

7. Zero-Knowledge Proofs¶

Проблема: prove watermark detection without revealing the watermarking key.

Протокол: 1. Prover (model owner) commits to watermark presence 2. Verifier sends random challenge 3. Prover responds without revealing key 4. Verifier confirms proof

Benefits: key remains secret, forgery prevented, ownership provable.

8. Tools & Frameworks¶

Open-Source¶

Tool	Тип	Описание
UMD Watermark Detector	Research	KGW reference implementation
GPTZero	Commercial	Perplexity + burstiness
DetectGPT	Research	Zero-shot, no training
GLTR	Research	Visualization tool
WaterBench	Benchmark	Evaluation framework

Commercial¶

Service	Фокус
Originality.ai	Plagiarism + AI detection
Copyleaks	Enterprise, LMS integration
Turnitin AI	Academic integrity
Winston AI	Content marketing

Production Adoption (2026)¶

Company	Model	Watermark
Google	Gemini	SynthID (deployed)
OpenAI	GPT-4+	Not disclosed
Anthropic	Claude	Not disclosed
Meta	LLaMA	Research only

9. Regulatory Landscape¶

Region	Status (2026)
EU	AI Act: mandatory disclosure of AI content, watermarking/detection capabilities
US	State-level regulations emerging
China	Mandatory labeling of AI content
Global	UNESCO guidelines for AI ethics

Ethical Concerns¶

Проблема	Описание
False accusations	Human text flagged as AI harms reputations
Surveillance	Watermarking enables content tracking
Chilling effects	May discourage legitimate AI use
Bias in detection	Detectors biased against certain writing styles

Для интервью¶

Q: "Как работает watermarking LLM?"¶

KGW (Kirchenbauer 2023): partition vocabulary into green/red tokens via hash of previous token. Boost green token logits by delta. Detection: z-score test -- count green tokens vs expected. z > 4 = watermarked (FPR = 10^-5). SynthID (Google): tournament-based sampling, more robust, deployed в Gemini. Trade-off: stronger watermark = easier detection, but worse quality (+5% perplexity at delta=2).

Q: "Watermarking vs detection -- в чём разница?"¶

Watermarking: embed pattern during generation, нужен доступ к модели, accuracy 95%+. Detection: classify text post-hoc without access, accuracy 70-85%. Watermarking надежнее, но требует cooperation от provider. Detection -- universal, но менее точна и уязвима к adversarial attacks.

Q: "Какие атаки ломают watermark?"¶

Paraphrasing (detection drops from 95% to 30-50%), translation (10-30% retained), token substitution. KGW особенно уязвим (hash chain breaks). SynthID более robust (70% retained after paraphrase, 30% after translation). Defenses: semantic watermarking, cross-lingual watermarks, but no perfect solution.

Q: "Что такое multi-bit watermarking?"¶

Вместо binary "watermarked/not" -- embed payload (model ID, timestamp, user ID). SynthID: 64 bits per 1000 tokens. Полезно для provenance tracking и copyright attribution. Формула: payload \(\in \{0,1\}^k\). Robust to paraphrasing и translation.

Q: "Спроектируйте систему watermarking для LLM API."¶

(1) Generation: KGW/SynthID watermark injection at logit level. (2) Detection API: z-score test with secret key. (3) Multi-bit payload: model version + timestamp + user ID. (4) Monitoring: false positive tracking, quality metrics (perplexity delta < 5%). (5) Compliance: EU AI Act disclosure. (6) Challenges: robustness to paraphrasing, min 100-200 tokens для reliable detection.

Ключевые числа¶

Факт	Значение
KGW FPR at z>4	10^-5
KGW min tokens	~200
SynthID min tokens	~100
SynthID perplexity increase	+2%
KGW perplexity increase (delta=2)	+5%
Paraphrase: KGW detection retained	40%
Paraphrase: SynthID detection retained	70%
Translation: KGW detection retained	10%
Multi-bit SynthID capacity	64 bits / 1000 tokens
SynthID production deployment	Gemini (Google)

Формулы¶

K-G Watermark (logit modification)¶

\[\text{logit}'(x) = \text{logit}(x) + \delta \cdot \mathbb{1}[x \in \mathcal{G}]\]

Detection z-score¶

\[z = \frac{|\mathcal{G}| - \gamma T}{\sqrt{T\gamma(1-\gamma)}}\]

Quality Preservation¶

\[D_{KL}(P_{\text{watermarked}} \| P_{\text{original}}) < \epsilon\]

Detection Metrics¶

\[\text{TPR} = \frac{TP}{TP+FN}, \quad \text{FPR} = \frac{FP}{FP+TN}\]

Заблуждение: watermark невозможно снять

KGW-watermark теряет 60% детектируемости после paraphrasing и 90% после перевода на другой язык. Hash chain ломается при любом редактировании текста. Даже SynthID (более robust) сохраняет только 70% detection после перефразирования и 30% после перевода. Watermarking -- это deterrent, а не protection, аналогично DRM в музыке.

Заблуждение: zero-shot детекторы (GPTZero, DetectGPT) надёжны

False positive rate zero-shot детекторов 5-15%, false negative 10-30%. Модели обученные на GPT-3.5 текстах деградируют на GPT-4o из-за distribution shift. Детекторы систематически biased против определённых стилей письма -- non-native English speakers чаще помечаются как AI. Ни один коммерческий детектор не даёт гарантий, достаточных для академических sanctions.

Заблуждение: watermarking не влияет на качество текста

KGW с delta=2.0 увеличивает perplexity на 5%. SynthID -- на 2%. Multi-bit watermarking (встраивание payload) влияет ещё сильнее. На коротких текстах (<200 tokens) KGW вообще нельзя надёжно детектировать. Существует фундаментальный trade-off: чем сильнее watermark, тем легче detection, но хуже качество генерации.

Interview Questions¶

Q: Как работает KGW watermark и каковы его ограничения?

Red flag: "Модель добавляет скрытый текст в output"

Strong answer: "KGW модифицирует logit distribution: hash предыдущего токена разбивает словарь на green/red списки, green tokens получают boost delta к logits. Detection через z-score: z = (|G| - gamma*T) / sqrt(T*gamma*(1-gamma)). При z>4 FPR = 10^-5. Ограничения: (1) нужен минимум ~200 tokens; (2) hash chain ломается при редактировании -- paraphrasing снижает detection до 40%; (3) перевод -- до 10%; (4) delta=2 увеличивает perplexity на 5%."

Q: Watermarking vs post-hoc detection -- когда что применять?

Red flag: "Watermarking лучше во всём"

Strong answer: "Watermarking (KGW/SynthID) -- accuracy 95%+, но требует cooperation от провайдера модели и доступа к generation pipeline. Post-hoc detection (GPTZero/DetectGPT) -- universal (работает с любой моделью), accuracy 70-85%, но высокий FPR (5-15%) и уязвимость к distribution shift. Для API-провайдера -- watermarking (контролирует generation). Для образования/медиа -- detection (нет доступа к модели). Для compliance (EU AI Act) -- watermarking обязателен для high-risk AI."

Q: Спроектируйте систему watermarking для LLM API с 1M запросов/день.

Red flag: "Добавляем KGW watermark на каждый запрос"

Strong answer: "Четыре компонента: (1) Generation -- SynthID для robustness (70% retained after paraphrase vs 40% KGW), tournament-based sampling с tunable strength; (2) Multi-bit payload -- model version + timestamp + user hash (64 bits per 1000 tokens), для provenance tracking; (3) Detection API -- z-score test с secret key, endpoint для верификации; (4) Monitoring -- dashboard с FPR tracking (<0.1%), quality metrics (perplexity delta <2%), latency overhead (<5ms). Для коротких ответов (<100 tokens) -- fallback на metadata-based attribution."

Источники¶

Kirchenbauer et al. -- "A Watermark for Large Language Models" (2023)
Google DeepMind -- "SynthID: Scalable Watermarking for AI-Generated Content" (2024)
"Provably Robust Multi-bit Watermarking" (2024)
ICLR 2025 Workshop -- Watermarking in Generative AI
"A Survey of Text Watermarking in the Era of LLMs" (Liu et al., 2024, 179 citations)
MIT Press -- "A Survey on LLM-Generated Text Detection" (2025)
Nethermind -- "Proving AI Authorship Without Revealing the Watermark" (Jan 2026)
arXiv -- "Marking Code Without Breaking It: Code Watermarking" (2502.18851)
Hastewire -- "LLM Watermarking: The Complete Guide" (2026)