Перейти к содержанию

Шпаргалка: Alignment и RLHF

~7 минут чтения

Предварительно: Мастер-гайд для подготовки | Шпаргалка глубокое обучение

Тип: synthesis / interview cheat sheet Дата: Февраль 2026 Synthesis of: RLHF, DPO, GRPO, RLAIF, Constitutional AI, alignment methods

Alignment -- мост между "модель предсказывает токены" и "модель полезна, безопасна и следует инструкциям". За 2023-2026 сменилось 3 поколения методов: RLHF/PPO (InstructGPT, $20-100 за 1000 пар аннотаций), DPO (2-3x быстрее, без reward model), GRPO (DeepSeek-R1, 50% меньше memory чем PPO). Параллельно RLAIF снизил стоимость аннотаций на 70%, а Constitutional AI (Anthropic) позволил alignment без человеческих пар вообще. Эта шпаргалка покрывает формулы, код и decision matrix для всех 5 методов.


Quick Reference: Key Numbers

Metric Value Context
RLHF human annotation cost $20-100 per 1000 pairs Expensive, scales poorly
RLAIF cost reduction ~70% vs RLHF AI feedback cheaper
DPO training speedup 2-3× vs PPO No reward model needed
KL penalty coefficient \(\beta \approx 0.01-0.1\) Prevents policy drift
GRPO group size 4-16 samples per prompt DeepSeek-R1 default
DeepSeek-R1 GRPO speedup ~50% less memory vs PPO No value function

1. LLM Alignment Overview

Why Alignment?

Pre-trained LLMs predict next token, but don't necessarily: - Follow instructions - Avoid harmful content - Be truthful and helpful - Match human preferences

Alignment = shaping model behavior to match human values/intent

Alignment Pipeline Stages

1. Pre-training     → Next token prediction (trillions of tokens)
2. SFT              → Supervised Fine-Tuning on demonstrations
3. Alignment        → RLHF / DPO / RLAIF / CAI
4. Deployment       → Inference with safety filters

2. RLHF (Reinforcement Learning from Human Feedback)

Core Idea

Train a reward model from human preferences, then optimize policy with RL.

RLHF Pipeline (4 phases)

graph TD
    A["Phase 1: SFT<br/>Pre-trained model -> Fine-tune<br/>on quality demonstrations"] --> B["Phase 2: Reward Model Training<br/>Generate outputs -> Humans rank<br/>pairs -> Train RM"]
    B --> C["Phase 3: RL Fine-Tuning<br/>Policy generates -> RM scores<br/>-> PPO update"]
    C --> D["Phase 4: Iteration<br/>Collect more feedback -><br/>Retrain RM -> Continue RL"]
    D -.->|optional| B

    style A fill:#e8eaf6,stroke:#3f51b5
    style B fill:#fff3e0,stroke:#ef6c00
    style C fill:#e8f5e9,stroke:#4caf50
    style D fill:#f3e5f5,stroke:#9c27b0

Bradley-Terry Model

Given preferred output \(y_w\) over \(y_l\):

\[P(y_w > y_l) = \frac{\exp(r(x, y_w))}{\exp(r(x, y_w)) + \exp(r(x, y_l))} = \sigma(r(x, y_w) - r(x, y_l))\]

Reward Model Loss:

\[\mathcal{L}_{RM} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l)) \right]\]

PPO Objective with KL Penalty

\[\mathcal{L}_{PPO} = \mathbb{E}_x \mathbb{E}_{y \sim \pi_\theta} \left[ r_\phi(x, y) - \beta \cdot \text{KL}(\pi_\theta(y|x) \| \pi_{ref}(y|x)) \right]\]

Key components: - \(r_\phi(x, y)\): Reward model score - \(\beta\): KL penalty coefficient (prevents drift) - \(\pi_{ref}\): Reference (SFT) model

PPO Clip Objective

\[\mathcal{L}_{clip} = \mathbb{E}_t \left[ \min \left( \rho_t A_t, \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) A_t \right) \right]\]

Where \(\rho_t = \frac{\pi_\theta(a_t|s_t)}{\pi_{old}(a_t|s_t)}\)

Typical values: \(\epsilon = 0.2\)


3. DPO (Direct Preference Optimization)

Core Idea

Skip the reward model! Optimize policy directly from preferences.

DPO Derivation

From Bradley-Terry, optimal policy satisfies:

\[r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{ref}(y|x)} + \beta \log Z(x)\]

Substitute into RM loss:

\[\mathcal{L}_{DPO} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right) \right]\]

DPO vs RLHF Comparison

Aspect RLHF (PPO) DPO
Reward model Required Not needed
Training stability Complex, unstable Simple, stable
Memory High (RM + policy + critic) Lower (policy only)
Quality Slightly better on complex tasks Comparable on many tasks
Speed Slower 2-3× faster
Best for Production models Quick alignment, experiments

Code: DPO Training

def dpo_loss(policy, reference, batch, beta=0.1):
    """
    DPO loss implementation.
    batch = {'prompt': [...], 'chosen': [...], 'rejected': [...]}
    """
    # Log probabilities
    pi_chosen = policy.log_prob(batch['prompt'], batch['chosen'])
    pi_rejected = policy.log_prob(batch['prompt'], batch['rejected'])
    ref_chosen = reference.log_prob(batch['prompt'], batch['chosen'])
    ref_rejected = reference.log_prob(batch['prompt'], batch['rejected'])

    # Log ratios
    log_ratio_chosen = pi_chosen - ref_chosen
    log_ratio_rejected = pi_rejected - ref_rejected

    # DPO objective
    logits = beta * (log_ratio_chosen - log_ratio_rejected)
    loss = -F.logsigmoid(logits).mean()

    return loss

4. GRPO (Group Relative Policy Optimization)

Core Idea (DeepSeek-R1, 2025)

Generate multiple completions per prompt, compute relative advantages within group. No value function needed.

GRPO Algorithm

For each prompt x:
    1. Generate G outputs: {y_1, y_2, ..., y_G} from π_θ
    2. Score each: {r_1, r_2, ..., r_G} with reward model
    3. Compute group-relative advantages:
       A_i = (r_i - mean(r)) / std(r)
    4. Update policy to maximize relative advantage

GRPO Objective

\[\mathcal{L}_{GRPO} = -\mathbb{E} \left[ \frac{1}{G} \sum_{i=1}^{G} \left( \min(\rho_i \cdot A_i, \text{clip}(\rho_i) \cdot A_i) - \beta \cdot \text{KL} \right) \right]\]

Where: - \(G\) = group size (4-16 samples) - \(A_i = \frac{r_i - \text{mean}(r)}{\text{std}(r)}\) = normalized advantage

GRPO Advantages

Aspect PPO GRPO
Value function Required Not needed
Memory High 50% less
Variance Lower (with critic) Higher (group-based)
Stability Sensitive to HP More stable
Used by InstructGPT, Claude DeepSeek-R1, Qwen

Code: GRPO Advantage Computation

def compute_grpo_advantages(rewards):
    """
    Compute group-relative advantages.
    rewards: [batch_size, group_size]
    """
    mean = rewards.mean(dim=-1, keepdim=True)
    std = rewards.std(dim=-1, keepdim=True) + 1e-8
    advantages = (rewards - mean) / std
    return advantages

5. RLAIF (RL from AI Feedback)

Core Idea

Replace human labelers with AI judge (often stronger model).

RLAIF Pipeline

1. Generate outputs from policy
2. Strong LLM (e.g., GPT-4) judges preferences
3. Train reward model on AI preferences
4. RL fine-tuning as in RLHF

Constitutional AI (Anthropic)

Critique → Revision → Preference Learning

Constitution = Set of principles (e.g., "Be helpful, harmless, honest")

1. Generate response
2. AI critiques against constitution
3. AI revises response
4. Train preference model on (original, revised) pairs

RLAIF vs RLHF

Aspect RLHF RLAIF
Labeler Human AI (LLM)
Cost $20-100 per 1K pairs ~$1-5 per 1K pairs
Speed Days-weeks Hours
Scalability Limited Unlimited
Quality Gold standard 95-99% of RLHF
Bias risk Human bias AI bias

2025-2026 Trend

RLAIF has become a default method within RLHF literature due to dramatic cost advantages. Most production systems now use hybrid: RLAIF for scale + RLHF for quality validation.


6. Alignment Method Comparison

Decision Matrix

Method Complexity Cost Quality Best For
RLHF (PPO) High High Best Production models
DPO Low Medium Good Quick alignment, research
GRPO Medium Medium Excellent Reasoning models
RLAIF Medium Low Good Scale, iteration
Constitutional AI Medium Low Good Safety-critical apps

2026 Recommendations

Scenario Recommended Method
Research / experiments DPO
Production chat model RLHF (PPO or GRPO)
Reasoning model (like o1) GRPO
Safety-critical Constitutional AI + RLHF
Cost-constrained RLAIF + DPO
Rapid iteration RLAIF

7. Alignment Tax (Performance Trade-offs)

The Alignment Trade-off

\[\text{Capability} \leftrightarrow \text{Alignment}\]

Key findings: - Alignment typically reduces raw capability slightly - But improves helpfulness and safety significantly - "Alignment tax" = ~2-5% on raw benchmarks - Net benefit on user-facing metrics

Mitigation Strategies

  1. Better SFT data — Reduces alignment tax
  2. Iterative RLHF — Multiple rounds, smaller steps
  3. Constitutional AI — Principles-based, less human feedback
  4. Hybrid approaches — RLHF + RLAIF

8. Safety & Alignment Challenges

Known Issues

Issue Description Mitigation
Reward hacking Model exploits RM quirks KL penalty, adversarial testing
Distribution shift RM fails on new domains Iterative collection
Sycophancy Model agrees with user over truth Diversity in feedback
Deception Model appears aligned but isn't Red teaming, evals
Catastrophic forgetting Alignment erodes capabilities Replay, interleaved training

Red Teaming

Adversarial prompts → Model outputs → Safety evaluation

Types:
- Prompt injection attempts
- Jailbreak techniques
- Edge case exploration
- Harmful request variations

Типичные заблуждения

Заблуждение: 'KL penalty -- просто регуляризация, можно убрать для лучшего качества'

Без KL penalty (\(\beta \approx 0.01-0.1\)) модель быстро находит reward hacking -- генерирует ответы, максимизирующие score reward model, но бессмысленные для человека. Это главная причина нестабильности RLHF. KL penalty удерживает policy близко к reference model и является обязательным компонентом.

Заблуждение: 'RLAIF дает 95-99% качества RLHF -- значит можно полностью заменить людей'

95-99% -- средний показатель по бенчмаркам. На edge cases (safety, cultural sensitivity, юмор) AI-feedback систематически ошибается. Production-подход: RLAIF для 90% данных (масштаб), RLHF для 10% валидационных пар (качество на edge cases). Полная замена людей приводит к AI bias amplification.

Заблуждение: 'Alignment tax 2-5% -- незначительная цена'

2-5% на бенчмарках может означать 10-15% потерю на specific downstream tasks. Mitigation: (1) лучшие SFT данные снижают tax, (2) iterative RLHF с маленькими шагами, (3) Constitutional AI для principles-based alignment. Мониторьте task-specific метрики, не только общие бенчмарки.


9. Интервью

Вопрос: Что такое RLHF?

❌ "Это когда модель учится на фидбеке от людей"

✅ "Training paradigm из 4 фаз: (1) SFT на quality demonstrations, (2) training reward model из human preference pairs через Bradley-Terry model: P(y_w > y_l) = sigma(r_w - r_l), (3) PPO fine-tuning policy с KL penalty: maximize reward - beta * KL(pi || pi_ref), (4) итерация. Стоимость: $20-100 за 1000 пар аннотаций, поэтому RLAIF снижает на 70%."

Вопрос: Зачем KL penalty в RLHF?

❌ "Чтобы модель не забывала то, что выучила"

✅ "Предотвращает reward hacking -- когда policy находит shortcut для максимизации reward model score, но генерирует бессмысленные ответы. beta = 0.01-0.1, при слишком высоком beta модель застревает, при слишком низком -- дрейфует. KL penalty привязывает к reference (SFT) model как к якорю."

Вопрос: DPO vs RLHF -- когда что?

❌ "DPO лучше потому что проще и быстрее"

✅ "DPO: пропускает reward model, оптимизирует policy напрямую из preferences. 2-3x быстрее, стабильнее, меньше memory. Но: уступает на complex reasoning задачах, потому что нет exploration через sampling. RLHF(PPO/GRPO): сложнее, но лучше для production reasoning models. Правило: DPO для experiments и simple alignment, GRPO для reasoning (DeepSeek-R1), PPO для максимального качества."

Вопрос: Как GRPO отличается от PPO?

❌ "GRPO -- это улучшенный PPO без value function"

✅ "GRPO генерирует G=4-16 ответов на prompt, вычисляет group-relative advantage A_i = (r_i - mean) / std, обновляет policy на maximize clipped advantage - beta * KL. Ключевое: нет value function -> 50% меньше memory, нет critic -> проще training. Trade-off: higher variance из-за group-based estimation. Использует DeepSeek-R1, Qwen. PPO использует InstructGPT, Claude."

Вопрос: Спроектируйте alignment pipeline для chatbot

❌ "Собрать данные, обучить reward model, запустить PPO"

✅ "(1) SFT на 10K+ quality demonstrations, (2) собрать 100K+ preference pairs через RLAIF (масштаб) + 10K RLHF pairs (качество), (3) train reward model на Bradley-Terry loss, (4) GRPO fine-tuning с KL penalty beta=0.05, group_size=8, (5) red team evaluation + automated safety evals (Llama Guard), (6) iterate с continuous data collection + A/B testing на user satisfaction."


10. Formulas Quick Reference

Bradley-Terry Preference

\[P(y_w > y_l) = \sigma(r(x, y_w) - r(x, y_l))\]

Reward Model Loss

\[\mathcal{L}_{RM} = -\mathbb{E} \left[ \log \sigma(r_w - r_l) \right]\]

PPO Objective

\[\mathcal{L}_{PPO} = \mathbb{E} \left[ \min(\rho A, \text{clip}(\rho, 1-\epsilon, 1+\epsilon) A) \right] - \beta \cdot \text{KL}\]

DPO Loss

\[\mathcal{L}_{DPO} = -\mathbb{E} \left[ \log \sigma \left( \beta \log \frac{\pi(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi(y_l|x)}{\pi_{ref}(y_l|x)} \right) \right]\]

GRPO Advantage

\[A_i = \frac{r_i - \mu_G}{\sigma_G}\]

11. Sources Synthesized

  1. constitutional-ai-2025-2026.md — Anthropic CAI approach
  2. Intuition Labs RL vs RLHF comparison (Feb 2026)
  3. DataCamp GRPO guide (Jul 2025)
  4. Sebastian Raschka DeepSeek technical tour (Dec 2025)
  5. Cameron Wolfe GRPO deep dive (Nov 2025)
  6. DeepSeek-R1 paper (Jan 2025)
  7. OpenRLHF documentation
  8. HuggingFace TRL documentation
  9. llm-alignment-peft-2026.md — GRPO details, PEFT for alignment (ФАЗА 5)
  10. ai-safety-report-2026.md — AI safety landscape, regulatory context (ФАЗА 5)
  11. llm-guardrails-2026.md — NeMo Guardrails, Llama Guard, production safety (ФАЗА 5)
  12. llm-security-owasp-2026.md — OWASP LLM Top 10, attack vectors (ФАЗА 5)