Шпаргалка: Alignment и RLHF¶
~7 минут чтения
Предварительно: Мастер-гайд для подготовки | Шпаргалка глубокое обучение
Тип: synthesis / interview cheat sheet Дата: Февраль 2026 Synthesis of: RLHF, DPO, GRPO, RLAIF, Constitutional AI, alignment methods
Alignment -- мост между "модель предсказывает токены" и "модель полезна, безопасна и следует инструкциям". За 2023-2026 сменилось 3 поколения методов: RLHF/PPO (InstructGPT, $20-100 за 1000 пар аннотаций), DPO (2-3x быстрее, без reward model), GRPO (DeepSeek-R1, 50% меньше memory чем PPO). Параллельно RLAIF снизил стоимость аннотаций на 70%, а Constitutional AI (Anthropic) позволил alignment без человеческих пар вообще. Эта шпаргалка покрывает формулы, код и decision matrix для всех 5 методов.
Quick Reference: Key Numbers¶
| Metric | Value | Context |
|---|---|---|
| RLHF human annotation cost | $20-100 per 1000 pairs | Expensive, scales poorly |
| RLAIF cost reduction | ~70% vs RLHF | AI feedback cheaper |
| DPO training speedup | 2-3× vs PPO | No reward model needed |
| KL penalty coefficient | \(\beta \approx 0.01-0.1\) | Prevents policy drift |
| GRPO group size | 4-16 samples per prompt | DeepSeek-R1 default |
| DeepSeek-R1 GRPO speedup | ~50% less memory vs PPO | No value function |
1. LLM Alignment Overview¶
Why Alignment?¶
Pre-trained LLMs predict next token, but don't necessarily: - Follow instructions - Avoid harmful content - Be truthful and helpful - Match human preferences
Alignment = shaping model behavior to match human values/intent
Alignment Pipeline Stages¶
1. Pre-training → Next token prediction (trillions of tokens)
2. SFT → Supervised Fine-Tuning on demonstrations
3. Alignment → RLHF / DPO / RLAIF / CAI
4. Deployment → Inference with safety filters
2. RLHF (Reinforcement Learning from Human Feedback)¶
Core Idea¶
Train a reward model from human preferences, then optimize policy with RL.
RLHF Pipeline (4 phases)¶
graph TD
A["Phase 1: SFT<br/>Pre-trained model -> Fine-tune<br/>on quality demonstrations"] --> B["Phase 2: Reward Model Training<br/>Generate outputs -> Humans rank<br/>pairs -> Train RM"]
B --> C["Phase 3: RL Fine-Tuning<br/>Policy generates -> RM scores<br/>-> PPO update"]
C --> D["Phase 4: Iteration<br/>Collect more feedback -><br/>Retrain RM -> Continue RL"]
D -.->|optional| B
style A fill:#e8eaf6,stroke:#3f51b5
style B fill:#fff3e0,stroke:#ef6c00
style C fill:#e8f5e9,stroke:#4caf50
style D fill:#f3e5f5,stroke:#9c27b0
Bradley-Terry Model¶
Given preferred output \(y_w\) over \(y_l\):
Reward Model Loss:
PPO Objective with KL Penalty¶
Key components: - \(r_\phi(x, y)\): Reward model score - \(\beta\): KL penalty coefficient (prevents drift) - \(\pi_{ref}\): Reference (SFT) model
PPO Clip Objective¶
Where \(\rho_t = \frac{\pi_\theta(a_t|s_t)}{\pi_{old}(a_t|s_t)}\)
Typical values: \(\epsilon = 0.2\)
3. DPO (Direct Preference Optimization)¶
Core Idea¶
Skip the reward model! Optimize policy directly from preferences.
DPO Derivation¶
From Bradley-Terry, optimal policy satisfies:
Substitute into RM loss:
DPO vs RLHF Comparison¶
| Aspect | RLHF (PPO) | DPO |
|---|---|---|
| Reward model | Required | Not needed |
| Training stability | Complex, unstable | Simple, stable |
| Memory | High (RM + policy + critic) | Lower (policy only) |
| Quality | Slightly better on complex tasks | Comparable on many tasks |
| Speed | Slower | 2-3× faster |
| Best for | Production models | Quick alignment, experiments |
Code: DPO Training¶
def dpo_loss(policy, reference, batch, beta=0.1):
"""
DPO loss implementation.
batch = {'prompt': [...], 'chosen': [...], 'rejected': [...]}
"""
# Log probabilities
pi_chosen = policy.log_prob(batch['prompt'], batch['chosen'])
pi_rejected = policy.log_prob(batch['prompt'], batch['rejected'])
ref_chosen = reference.log_prob(batch['prompt'], batch['chosen'])
ref_rejected = reference.log_prob(batch['prompt'], batch['rejected'])
# Log ratios
log_ratio_chosen = pi_chosen - ref_chosen
log_ratio_rejected = pi_rejected - ref_rejected
# DPO objective
logits = beta * (log_ratio_chosen - log_ratio_rejected)
loss = -F.logsigmoid(logits).mean()
return loss
4. GRPO (Group Relative Policy Optimization)¶
Core Idea (DeepSeek-R1, 2025)¶
Generate multiple completions per prompt, compute relative advantages within group. No value function needed.
GRPO Algorithm¶
For each prompt x:
1. Generate G outputs: {y_1, y_2, ..., y_G} from π_θ
2. Score each: {r_1, r_2, ..., r_G} with reward model
3. Compute group-relative advantages:
A_i = (r_i - mean(r)) / std(r)
4. Update policy to maximize relative advantage
GRPO Objective¶
Where: - \(G\) = group size (4-16 samples) - \(A_i = \frac{r_i - \text{mean}(r)}{\text{std}(r)}\) = normalized advantage
GRPO Advantages¶
| Aspect | PPO | GRPO |
|---|---|---|
| Value function | Required | Not needed |
| Memory | High | 50% less |
| Variance | Lower (with critic) | Higher (group-based) |
| Stability | Sensitive to HP | More stable |
| Used by | InstructGPT, Claude | DeepSeek-R1, Qwen |
Code: GRPO Advantage Computation¶
def compute_grpo_advantages(rewards):
"""
Compute group-relative advantages.
rewards: [batch_size, group_size]
"""
mean = rewards.mean(dim=-1, keepdim=True)
std = rewards.std(dim=-1, keepdim=True) + 1e-8
advantages = (rewards - mean) / std
return advantages
5. RLAIF (RL from AI Feedback)¶
Core Idea¶
Replace human labelers with AI judge (often stronger model).
RLAIF Pipeline¶
1. Generate outputs from policy
2. Strong LLM (e.g., GPT-4) judges preferences
3. Train reward model on AI preferences
4. RL fine-tuning as in RLHF
Constitutional AI (Anthropic)¶
Critique → Revision → Preference Learning
Constitution = Set of principles (e.g., "Be helpful, harmless, honest")
1. Generate response
2. AI critiques against constitution
3. AI revises response
4. Train preference model on (original, revised) pairs
RLAIF vs RLHF¶
| Aspect | RLHF | RLAIF |
|---|---|---|
| Labeler | Human | AI (LLM) |
| Cost | $20-100 per 1K pairs | ~$1-5 per 1K pairs |
| Speed | Days-weeks | Hours |
| Scalability | Limited | Unlimited |
| Quality | Gold standard | 95-99% of RLHF |
| Bias risk | Human bias | AI bias |
2025-2026 Trend¶
RLAIF has become a default method within RLHF literature due to dramatic cost advantages. Most production systems now use hybrid: RLAIF for scale + RLHF for quality validation.
6. Alignment Method Comparison¶
Decision Matrix¶
| Method | Complexity | Cost | Quality | Best For |
|---|---|---|---|---|
| RLHF (PPO) | High | High | Best | Production models |
| DPO | Low | Medium | Good | Quick alignment, research |
| GRPO | Medium | Medium | Excellent | Reasoning models |
| RLAIF | Medium | Low | Good | Scale, iteration |
| Constitutional AI | Medium | Low | Good | Safety-critical apps |
2026 Recommendations¶
| Scenario | Recommended Method |
|---|---|
| Research / experiments | DPO |
| Production chat model | RLHF (PPO or GRPO) |
| Reasoning model (like o1) | GRPO |
| Safety-critical | Constitutional AI + RLHF |
| Cost-constrained | RLAIF + DPO |
| Rapid iteration | RLAIF |
7. Alignment Tax (Performance Trade-offs)¶
The Alignment Trade-off¶
Key findings: - Alignment typically reduces raw capability slightly - But improves helpfulness and safety significantly - "Alignment tax" = ~2-5% on raw benchmarks - Net benefit on user-facing metrics
Mitigation Strategies¶
- Better SFT data — Reduces alignment tax
- Iterative RLHF — Multiple rounds, smaller steps
- Constitutional AI — Principles-based, less human feedback
- Hybrid approaches — RLHF + RLAIF
8. Safety & Alignment Challenges¶
Known Issues¶
| Issue | Description | Mitigation |
|---|---|---|
| Reward hacking | Model exploits RM quirks | KL penalty, adversarial testing |
| Distribution shift | RM fails on new domains | Iterative collection |
| Sycophancy | Model agrees with user over truth | Diversity in feedback |
| Deception | Model appears aligned but isn't | Red teaming, evals |
| Catastrophic forgetting | Alignment erodes capabilities | Replay, interleaved training |
Red Teaming¶
Adversarial prompts → Model outputs → Safety evaluation
Types:
- Prompt injection attempts
- Jailbreak techniques
- Edge case exploration
- Harmful request variations
Типичные заблуждения¶
Заблуждение: 'KL penalty -- просто регуляризация, можно убрать для лучшего качества'
Без KL penalty (\(\beta \approx 0.01-0.1\)) модель быстро находит reward hacking -- генерирует ответы, максимизирующие score reward model, но бессмысленные для человека. Это главная причина нестабильности RLHF. KL penalty удерживает policy близко к reference model и является обязательным компонентом.
Заблуждение: 'RLAIF дает 95-99% качества RLHF -- значит можно полностью заменить людей'
95-99% -- средний показатель по бенчмаркам. На edge cases (safety, cultural sensitivity, юмор) AI-feedback систематически ошибается. Production-подход: RLAIF для 90% данных (масштаб), RLHF для 10% валидационных пар (качество на edge cases). Полная замена людей приводит к AI bias amplification.
Заблуждение: 'Alignment tax 2-5% -- незначительная цена'
2-5% на бенчмарках может означать 10-15% потерю на specific downstream tasks. Mitigation: (1) лучшие SFT данные снижают tax, (2) iterative RLHF с маленькими шагами, (3) Constitutional AI для principles-based alignment. Мониторьте task-specific метрики, не только общие бенчмарки.
9. Интервью¶
Вопрос: Что такое RLHF?¶
"Это когда модель учится на фидбеке от людей"
"Training paradigm из 4 фаз: (1) SFT на quality demonstrations, (2) training reward model из human preference pairs через Bradley-Terry model: P(y_w > y_l) = sigma(r_w - r_l), (3) PPO fine-tuning policy с KL penalty: maximize reward - beta * KL(pi || pi_ref), (4) итерация. Стоимость: $20-100 за 1000 пар аннотаций, поэтому RLAIF снижает на 70%."
Вопрос: Зачем KL penalty в RLHF?¶
"Чтобы модель не забывала то, что выучила"
"Предотвращает reward hacking -- когда policy находит shortcut для максимизации reward model score, но генерирует бессмысленные ответы. beta = 0.01-0.1, при слишком высоком beta модель застревает, при слишком низком -- дрейфует. KL penalty привязывает к reference (SFT) model как к якорю."
Вопрос: DPO vs RLHF -- когда что?¶
"DPO лучше потому что проще и быстрее"
"DPO: пропускает reward model, оптимизирует policy напрямую из preferences. 2-3x быстрее, стабильнее, меньше memory. Но: уступает на complex reasoning задачах, потому что нет exploration через sampling. RLHF(PPO/GRPO): сложнее, но лучше для production reasoning models. Правило: DPO для experiments и simple alignment, GRPO для reasoning (DeepSeek-R1), PPO для максимального качества."
Вопрос: Как GRPO отличается от PPO?¶
"GRPO -- это улучшенный PPO без value function"
"GRPO генерирует G=4-16 ответов на prompt, вычисляет group-relative advantage A_i = (r_i - mean) / std, обновляет policy на maximize clipped advantage - beta * KL. Ключевое: нет value function -> 50% меньше memory, нет critic -> проще training. Trade-off: higher variance из-за group-based estimation. Использует DeepSeek-R1, Qwen. PPO использует InstructGPT, Claude."
Вопрос: Спроектируйте alignment pipeline для chatbot¶
"Собрать данные, обучить reward model, запустить PPO"
"(1) SFT на 10K+ quality demonstrations, (2) собрать 100K+ preference pairs через RLAIF (масштаб) + 10K RLHF pairs (качество), (3) train reward model на Bradley-Terry loss, (4) GRPO fine-tuning с KL penalty beta=0.05, group_size=8, (5) red team evaluation + automated safety evals (Llama Guard), (6) iterate с continuous data collection + A/B testing на user satisfaction."
10. Formulas Quick Reference¶
Bradley-Terry Preference¶
Reward Model Loss¶
PPO Objective¶
DPO Loss¶
GRPO Advantage¶
11. Sources Synthesized¶
constitutional-ai-2025-2026.md— Anthropic CAI approach- Intuition Labs RL vs RLHF comparison (Feb 2026)
- DataCamp GRPO guide (Jul 2025)
- Sebastian Raschka DeepSeek technical tour (Dec 2025)
- Cameron Wolfe GRPO deep dive (Nov 2025)
- DeepSeek-R1 paper (Jan 2025)
- OpenRLHF documentation
- HuggingFace TRL documentation
llm-alignment-peft-2026.md— GRPO details, PEFT for alignment (ФАЗА 5)ai-safety-report-2026.md— AI safety landscape, regulatory context (ФАЗА 5)llm-guardrails-2026.md— NeMo Guardrails, Llama Guard, production safety (ФАЗА 5)llm-security-owasp-2026.md— OWASP LLM Top 10, attack vectors (ФАЗА 5)