Шпаргалка: Alignment и RLHF¶

~7 минут чтения

Предварительно: Мастер-гайд для подготовки | Шпаргалка глубокое обучение

Тип: synthesis / interview cheat sheet Дата: Февраль 2026 Synthesis of: RLHF, DPO, GRPO, RLAIF, Constitutional AI, alignment methods

Alignment -- мост между "модель предсказывает токены" и "модель полезна, безопасна и следует инструкциям". За 2023-2026 сменилось 3 поколения методов: RLHF/PPO (InstructGPT, $20-100 за 1000 пар аннотаций), DPO (2-3x быстрее, без reward model), GRPO (DeepSeek-R1, 50% меньше memory чем PPO). Параллельно RLAIF снизил стоимость аннотаций на 70%, а Constitutional AI (Anthropic) позволил alignment без человеческих пар вообще. Эта шпаргалка покрывает формулы, код и decision matrix для всех 5 методов.

Quick Reference: Key Numbers¶

Metric	Value	Context
RLHF human annotation cost	$20-100 per 1000 pairs	Expensive, scales poorly
RLAIF cost reduction	~70% vs RLHF	AI feedback cheaper
DPO training speedup	2-3× vs PPO	No reward model needed
KL penalty coefficient	$\beta \approx 0.01-0.1$	Prevents policy drift
GRPO group size	4-16 samples per prompt	DeepSeek-R1 default
DeepSeek-R1 GRPO speedup	~50% less memory vs PPO	No value function

1. LLM Alignment Overview¶

Why Alignment?¶

Pre-trained LLMs predict next token, but don't necessarily: - Follow instructions - Avoid harmful content - Be truthful and helpful - Match human preferences

Alignment = shaping model behavior to match human values/intent

Alignment Pipeline Stages¶

1. Pre-training     → Next token prediction (trillions of tokens)
2. SFT              → Supervised Fine-Tuning on demonstrations
3. Alignment        → RLHF / DPO / RLAIF / CAI
4. Deployment       → Inference with safety filters

2. RLHF (Reinforcement Learning from Human Feedback)¶

Core Idea¶

Train a reward model from human preferences, then optimize policy with RL.

RLHF Pipeline (4 phases)¶

graph TD
    A["Phase 1: SFT<br/>Pre-trained model -> Fine-tune<br/>on quality demonstrations"] --> B["Phase 2: Reward Model Training<br/>Generate outputs -> Humans rank<br/>pairs -> Train RM"]
    B --> C["Phase 3: RL Fine-Tuning<br/>Policy generates -> RM scores<br/>-> PPO update"]
    C --> D["Phase 4: Iteration<br/>Collect more feedback -><br/>Retrain RM -> Continue RL"]
    D -.->|optional| B

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#e8f5e9,stroke:#4caf50
    style D fill:#f3e5f5,stroke:#9c27b0

Bradley-Terry Model¶

Given preferred output $y_w$ over $y_l$:

\[P(y_w > y_l) = \frac{\exp(r(x, y_w))}{\exp(r(x, y_w)) + \exp(r(x, y_l))} = \sigma(r(x, y_w) - r(x, y_l))\]

Reward Model Loss:

\[\mathcal{L}_{RM} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l)) \right]\]

PPO Objective with KL Penalty¶

\[\mathcal{L}_{PPO} = \mathbb{E}_x \mathbb{E}_{y \sim \pi_\theta} \left[ r_\phi(x, y) - \beta \cdot \text{KL}(\pi_\theta(y|x) \| \pi_{ref}(y|x)) \right]\]

Key components: - $r_\phi(x, y)$: Reward model score - $\beta$: KL penalty coefficient (prevents drift) - $\pi_{ref}$: Reference (SFT) model

PPO Clip Objective¶

\[\mathcal{L}_{clip} = \mathbb{E}_t \left[ \min \left( \rho_t A_t, \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) A_t \right) \right]\]

Where $\rho_t = \frac{\pi_\theta(a_t|s_t)}{\pi_{old}(a_t|s_t)}$

Typical values: $\epsilon = 0.2$

3. DPO (Direct Preference Optimization)¶

Core Idea¶

Skip the reward model! Optimize policy directly from preferences.

DPO Derivation¶

From Bradley-Terry, optimal policy satisfies:

\[r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{ref}(y|x)} + \beta \log Z(x)\]

Substitute into RM loss:

\[\mathcal{L}_{DPO} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right) \right]\]

DPO vs RLHF Comparison¶

Aspect	RLHF (PPO)	DPO
Reward model	Required	Not needed
Training stability	Complex, unstable	Simple, stable
Memory	High (RM + policy + critic)	Lower (policy only)
Quality	Slightly better on complex tasks	Comparable on many tasks
Speed	Slower	2-3× faster
Best for	Production models	Quick alignment, experiments

Code: DPO Training¶

def dpo_loss(policy, reference, batch, beta=0.1):
    """
    DPO loss implementation.
    batch = {'prompt': [...], 'chosen': [...], 'rejected': [...]}
    """
    # Log probabilities
    pi_chosen = policy.log_prob(batch['prompt'], batch['chosen'])
    pi_rejected = policy.log_prob(batch['prompt'], batch['rejected'])
    ref_chosen = reference.log_prob(batch['prompt'], batch['chosen'])
    ref_rejected = reference.log_prob(batch['prompt'], batch['rejected'])

    # Log ratios
    log_ratio_chosen = pi_chosen - ref_chosen
    log_ratio_rejected = pi_rejected - ref_rejected

    # DPO objective
    logits = beta * (log_ratio_chosen - log_ratio_rejected)
    loss = -F.logsigmoid(logits).mean()

    return loss

4. GRPO (Group Relative Policy Optimization)¶

Core Idea (DeepSeek-R1, 2025)¶

Generate multiple completions per prompt, compute relative advantages within group. No value function needed.

GRPO Algorithm¶

For each prompt x:
    1. Generate G outputs: {y_1, y_2, ..., y_G} from π_θ
    2. Score each: {r_1, r_2, ..., r_G} with reward model
    3. Compute group-relative advantages:
       A_i = (r_i - mean(r)) / std(r)
    4. Update policy to maximize relative advantage

GRPO Objective¶

\[\mathcal{L}_{GRPO} = -\mathbb{E} \left[ \frac{1}{G} \sum_{i=1}^{G} \left( \min(\rho_i \cdot A_i, \text{clip}(\rho_i) \cdot A_i) - \beta \cdot \text{KL} \right) \right]\]

Where: - $G$ = group size (4-16 samples) - $A_i = \frac{r_i - \text{mean}(r)}{\text{std}(r)}$ = normalized advantage

GRPO Advantages¶

Aspect	PPO	GRPO
Value function	Required	Not needed
Memory	High	50% less
Variance	Lower (with critic)	Higher (group-based)
Stability	Sensitive to HP	More stable
Used by	InstructGPT, Claude	DeepSeek-R1, Qwen

Code: GRPO Advantage Computation¶

def compute_grpo_advantages(rewards):
    """
    Compute group-relative advantages.
    rewards: [batch_size, group_size]
    """
    mean = rewards.mean(dim=-1, keepdim=True)
    std = rewards.std(dim=-1, keepdim=True) + 1e-8
    advantages = (rewards - mean) / std
    return advantages

5. RLAIF (RL from AI Feedback)¶

Core Idea¶

Replace human labelers with AI judge (often stronger model).

RLAIF Pipeline¶

1. Generate outputs from policy
2. Strong LLM (e.g., GPT-4) judges preferences
3. Train reward model on AI preferences
4. RL fine-tuning as in RLHF

Constitutional AI (Anthropic)¶

Critique → Revision → Preference Learning

Constitution = Set of principles (e.g., "Be helpful, harmless, honest")

1. Generate response
2. AI critiques against constitution
3. AI revises response
4. Train preference model on (original, revised) pairs

RLAIF vs RLHF¶

Aspect	RLHF	RLAIF
Labeler	Human	AI (LLM)
Cost	$20-100 per 1K pairs	~$1-5 per 1K pairs
Speed	Days-weeks	Hours
Scalability	Limited	Unlimited
Quality	Gold standard	95-99% of RLHF
Bias risk	Human bias	AI bias

2025-2026 Trend¶

RLAIF has become a default method within RLHF literature due to dramatic cost advantages. Most production systems now use hybrid: RLAIF for scale + RLHF for quality validation.

6. Alignment Method Comparison¶

Decision Matrix¶

Method	Complexity	Cost	Quality	Best For
RLHF (PPO)	High	High	Best	Production models
DPO	Low	Medium	Good	Quick alignment, research
GRPO	Medium	Medium	Excellent	Reasoning models
RLAIF	Medium	Low	Good	Scale, iteration
Constitutional AI	Medium	Low	Good	Safety-critical apps

2026 Recommendations¶

Scenario	Recommended Method
Research / experiments	DPO
Production chat model	RLHF (PPO or GRPO)
Reasoning model (like o1)	GRPO
Safety-critical	Constitutional AI + RLHF
Cost-constrained	RLAIF + DPO
Rapid iteration	RLAIF

7. Alignment Tax (Performance Trade-offs)¶

The Alignment Trade-off¶

\[\text{Capability} \leftrightarrow \text{Alignment}\]

Key findings: - Alignment typically reduces raw capability slightly - But improves helpfulness and safety significantly - "Alignment tax" = ~2-5% on raw benchmarks - Net benefit on user-facing metrics

Mitigation Strategies¶

Better SFT data — Reduces alignment tax
Iterative RLHF — Multiple rounds, smaller steps
Constitutional AI — Principles-based, less human feedback
Hybrid approaches — RLHF + RLAIF

8. Safety & Alignment Challenges¶

Known Issues¶

Issue	Description	Mitigation
Reward hacking	Model exploits RM quirks	KL penalty, adversarial testing
Distribution shift	RM fails on new domains	Iterative collection
Sycophancy	Model agrees with user over truth	Diversity in feedback
Deception	Model appears aligned but isn't	Red teaming, evals
Catastrophic forgetting	Alignment erodes capabilities	Replay, interleaved training

Red Teaming¶

Adversarial prompts → Model outputs → Safety evaluation

Types:
- Prompt injection attempts
- Jailbreak techniques
- Edge case exploration
- Harmful request variations

Типичные заблуждения¶

Заблуждение: 'KL penalty -- просто регуляризация, можно убрать для лучшего качества'

Без KL penalty ($\beta \approx 0.01-0.1$) модель быстро находит reward hacking -- генерирует ответы, максимизирующие score reward model, но бессмысленные для человека. Это главная причина нестабильности RLHF. KL penalty удерживает policy близко к reference model и является обязательным компонентом.

Заблуждение: 'RLAIF дает 95-99% качества RLHF -- значит можно полностью заменить людей'

95-99% -- средний показатель по бенчмаркам. На edge cases (safety, cultural sensitivity, юмор) AI-feedback систематически ошибается. Production-подход: RLAIF для 90% данных (масштаб), RLHF для 10% валидационных пар (качество на edge cases). Полная замена людей приводит к AI bias amplification.

Заблуждение: 'Alignment tax 2-5% -- незначительная цена'

2-5% на бенчмарках может означать 10-15% потерю на specific downstream tasks. Mitigation: (1) лучшие SFT данные снижают tax, (2) iterative RLHF с маленькими шагами, (3) Constitutional AI для principles-based alignment. Мониторьте task-specific метрики, не только общие бенчмарки.

9. Интервью¶

Вопрос: Что такое RLHF?¶

"Это когда модель учится на фидбеке от людей"

"Training paradigm из 4 фаз: (1) SFT на quality demonstrations, (2) training reward model из human preference pairs через Bradley-Terry model: P(y_w > y_l) = sigma(r_w - r_l), (3) PPO fine-tuning policy с KL penalty: maximize reward - beta * KL(pi || pi_ref), (4) итерация. Стоимость: $20-100 за 1000 пар аннотаций, поэтому RLAIF снижает на 70%."

Вопрос: Зачем KL penalty в RLHF?¶

"Чтобы модель не забывала то, что выучила"

"Предотвращает reward hacking -- когда policy находит shortcut для максимизации reward model score, но генерирует бессмысленные ответы. beta = 0.01-0.1, при слишком высоком beta модель застревает, при слишком низком -- дрейфует. KL penalty привязывает к reference (SFT) model как к якорю."

Вопрос: DPO vs RLHF -- когда что?¶

"DPO лучше потому что проще и быстрее"

"DPO: пропускает reward model, оптимизирует policy напрямую из preferences. 2-3x быстрее, стабильнее, меньше memory. Но: уступает на complex reasoning задачах, потому что нет exploration через sampling. RLHF(PPO/GRPO): сложнее, но лучше для production reasoning models. Правило: DPO для experiments и simple alignment, GRPO для reasoning (DeepSeek-R1), PPO для максимального качества."

Вопрос: Как GRPO отличается от PPO?¶

"GRPO -- это улучшенный PPO без value function"

"GRPO генерирует G=4-16 ответов на prompt, вычисляет group-relative advantage A_i = (r_i - mean) / std, обновляет policy на maximize clipped advantage - beta * KL. Ключевое: нет value function -> 50% меньше memory, нет critic -> проще training. Trade-off: higher variance из-за group-based estimation. Использует DeepSeek-R1, Qwen. PPO использует InstructGPT, Claude."

Вопрос: Спроектируйте alignment pipeline для chatbot¶

"Собрать данные, обучить reward model, запустить PPO"

"(1) SFT на 10K+ quality demonstrations, (2) собрать 100K+ preference pairs через RLAIF (масштаб) + 10K RLHF pairs (качество), (3) train reward model на Bradley-Terry loss, (4) GRPO fine-tuning с KL penalty beta=0.05, group_size=8, (5) red team evaluation + automated safety evals (Llama Guard), (6) iterate с continuous data collection + A/B testing на user satisfaction."

10. Formulas Quick Reference¶

Bradley-Terry Preference¶

\[P(y_w > y_l) = \sigma(r(x, y_w) - r(x, y_l))\]

Reward Model Loss¶

\[\mathcal{L}_{RM} = -\mathbb{E} \left[ \log \sigma(r_w - r_l) \right]\]

PPO Objective¶

\[\mathcal{L}_{PPO} = \mathbb{E} \left[ \min(\rho A, \text{clip}(\rho, 1-\epsilon, 1+\epsilon) A) \right] - \beta \cdot \text{KL}\]

DPO Loss¶

\[\mathcal{L}_{DPO} = -\mathbb{E} \left[ \log \sigma \left( \beta \log \frac{\pi(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi(y_l|x)}{\pi_{ref}(y_l|x)} \right) \right]\]

GRPO Advantage¶

\[A_i = \frac{r_i - \mu_G}{\sigma_G}\]

11. Sources Synthesized¶

constitutional-ai-2025-2026.md — Anthropic CAI approach
Intuition Labs RL vs RLHF comparison (Feb 2026)
DataCamp GRPO guide (Jul 2025)
Sebastian Raschka DeepSeek technical tour (Dec 2025)
Cameron Wolfe GRPO deep dive (Nov 2025)
DeepSeek-R1 paper (Jan 2025)
OpenRLHF documentation
HuggingFace TRL documentation
llm-alignment-peft-2026.md — GRPO details, PEFT for alignment (ФАЗА 5)
ai-safety-report-2026.md — AI safety landscape, regulatory context (ФАЗА 5)
llm-guardrails-2026.md — NeMo Guardrails, Llama Guard, production safety (ФАЗА 5)
llm-security-owasp-2026.md — OWASP LLM Top 10, attack vectors (ФАЗА 5)