Прогресс RLHF: RLBFF, RTO, GRPO и новые методы¶

~9 минут чтения

Предварительно: Методы Alignment LLM, LoRA и варианты файнтюнинга

Классический RLHF (PPO + reward model) доминировал в 2022-2023, но к 2025 году ландшафт радикально изменился. RLBFF (Wang et al., 2025) занял первое место на JudgeBench (81.4%) и RM-Bench (86.2%), объединив human preferences с верифицируемыми правилами. RTO (Zhong et al., 2024) показал, что token-level rewards дают +7.5 на AlpacaEval 2 против sentence-level. GRPO (DeepSeek, 2025) убрал critic network целиком, а его вариант 2-GRPO достигает 98.1% performance при 12.5% compute. Эти методы определяют, как будет обучаться следующее поколение LLM.

URL: arXiv papers (февраль 2024 -- сентябрь 2025) Тип: academic papers Дата: 2024-2025 Авторы: Multiple (см. ниже)

Ключевые источники¶

New Methods 2025¶

RLBFF: Binary Flexible Feedback — Wang et al., Sep 2025
Результат: #1 на JudgeBench (81.4%), RM-Bench (86.2%)
Combines human preferences + verifiable rules
Open source recipe for Qwen3-32B
RTO: Reinforced Token Optimization — Zhong et al., Apr 2024
DPO + PPO hybrid
Token-level reward learning
+7.5 AlpacaEval 2, +4.1 Arena-Hard

Theoretical Advances¶

Iterative Preference Learning — Xiong et al., Dec 2023
Reverse-KL regularized contextual bandit formulation
Offline, online, hybrid settings
Iterative DPO for online
Online Iterative RLHF with General Preference Model — Ye et al., Feb 2024
No reward function assumption
General preference oracle (not Bradley-Terry)
Sample-efficient algorithms

Frameworks & Benchmarks¶

Uni-RLHF: Universal Platform — Yuan et al., Feb 2024
Multi-feedback annotation platform
15M+ steps across 30+ tasks
Modular offline RLHF baselines
RLHF-Blender — Metz et al., Aug 2023
Configurable interactive interface
Diverse feedback types study
Human factors investigation

Recent Advances¶

Multi-Task Reward Learning — Wu et al., Jun 2025
Classification + regression reward models
Learnable weights for balancing
Human ratings in reward-free environments

Ключевые идеи¶

RLBFF: Binary Flexible Feedback¶

Проблема: - RLHF: непрозрачно, reward hacking - RLVR: ограничено correctness tasks

Решение: Extract binary principles from natural language

Принципы: - "Accuracy of information: yes/no" - "Code readability: yes/no" - "Response safety: yes/no"

Training:

\[ \text{RM}(x, y) = P(\text{principle satisfied} \mid x, y) \]

Benefits: - Interpretable (explicit principles) - Verifiable (rule-based) - Customizable (inference-time principle selection)

RTO: DPO + PPO Hybrid¶

Инсайт: DPO дает token-wise characterization despite sparse sentence rewards

Algorithm: 1. Stage 1: DPO для token-wise reward learning 2. Stage 2: PPO с token-wise rewards

MDP formulation:

\[ \mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma) \]

где: - $\mathcal{S}$ — partial sequences (token states) - $\mathcal{A}$ — next token actions - $R(s_t, a_t)$ — learned token-wise reward

Results: | Benchmark | PPO | RTO | Δ | |-----------|-----|-----|---| | AlpacaEval 2 | baseline | +7.5 | +7.5 | | Arena-Hard | baseline | +4.1 | +4.1 |

Reverse-KL Regularized RLHF¶

Формулировка (Iterative Preference Learning):

\[ \min_\pi \max_{\pi_{\text{ref}}} \mathbb{E}_{\pi}[\log \pi(y|x)] - \mathbb{E}_{\pi_{\text{ref}}}[\log \pi(y|x)] - \beta D_{KL}(\pi \| \pi_{\text{ref}}) \]

Settings: 1. Offline: Fixed dataset $\mathcal{D}$ 2. Online: Query oracle during training 3. Hybrid: Start offline, switch to online

Key insight: Exploration limitation в existing PPO/DPO

General Preference Oracle¶

Проблема: Bradley-Terry model restrictive

\[ P(y \succ y') = \frac{\exp(f(y))}{\exp(f(y)) + \exp(f(y'))} \]

General oracle: Любой preference relation satisfying consistency axioms

Algorithm: - Natural policy gradient with general oracle - Sample complexity bounds

Формулы и математика¶

Standard PPO Objective¶

\[ L^{CLIP}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right] \]

где $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$.

DPO Objective (Bradley-Terry derived)¶

\[ \mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\left(\beta \left(\log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right)\right] \]

где $\pi_{\text{ref}}$ — замороженная reference policy (SFT модель).

RTO Token-wise Reward¶

\[ r_t = R_\phi(x_{1:t}, a_t) \quad \text{(learned from preferences)} \]

\[ \mathcal{L}_{\text{PPO}} = \mathbb{E}\left[\sum_{t} r_t - \beta D_{KL}(\pi_\theta(\cdot|s_t) \| \pi_{\text{ref}}(\cdot|s_t))\right] \]

RLBFF Entailment Task¶

\[ \text{RM}(x, y, p) = P(p \mid x, y) \]

где $p$ — principle (e.g., "response is accurate").

Training data: $(x, y, \text{feedback}) \rightarrow (x, y, \{p_1, p_2, \dots\})$

Связанные работы¶

LoRA/QLoRA — PEFT для RLHF policy models
DPO vs ORPO vs KTO — alignment methods comparison

Interview Questions¶

Q: Объясните RLHF pipeline от preference data до aligned model.

Red flag: "Собираем feedback, обучаем модель" (без упоминания reward model и PPO как отдельных стадий)

Strong answer: "Три стадии: (1) SFT на instruction data, (2) Reward Model -- Bradley-Terry loss на human comparisons, (3) PPO оптимизация с KL penalty от reference policy. RTO расширяет это до token-level rewards (+7.5 AlpacaEval 2). RLBFF добавляет interpretable binary principles (81.4% JudgeBench). Ключевая проблема -- reward hacking: модель максимизирует reward model, а не реальную полезность."

Q: DPO vs PPO -- trade-offs и когда что использовать?

Red flag: "DPO лучше потому что проще" (игнорирует quality ceiling)

Strong answer: "PPO: online RL, нужна reward model + critic, sample-inefficient, но quality ceiling выше. DPO: offline, skip reward model, stable (failure 5-10% vs 20-30%), но 90-95% quality от RLHF. 2-GRPO показал, что GRPO с 2 rollouts эквивалентен DPO + O(1/n), достигая 98.1% performance при 12.5% compute. Для production: DPO для MVP, PPO/GRPO для frontier."

Q: Как token-level rewards (RTO) улучшают alignment?

Red flag: "Rewards на уровне токенов точнее" (без объяснения механизма)

Strong answer: "Стандартный RLHF дает sparse sentence-level reward -- все токены получают одинаковый сигнал. RTO извлекает token-wise rewards из DPO (Stage 1), затем использует их в PPO (Stage 2). Это дает credit assignment: модель знает какой именно токен вызвал негативный reward. Результат: +7.5 AlpacaEval 2, +4.1 Arena-Hard. MDP формулировка: state = partial sequence, action = next token, reward = learned R(x_{1:t}, a_t)."

Практическое применение¶

RLHF Pipeline (Production)¶

graph TD
    subgraph S1["1. Preference Collection"]
        A1["Human annotators<br/>(crowdsourcing)"]
        A2["User feedback<br/>(implicit/explicit)"]
        A3["Synthetic preferences<br/>(AI-generated)"]
    end
    subgraph S2["2. Reward Model Training"]
        B1["Bradley-Terry model"]
        B2["Cross-entropy loss"]
        B3["Validation on held-out"]
    end
    subgraph S3["3. Policy Optimization"]
        C1["PPO (online)"]
        C2["DPO (offline)"]
        C3["RTO (hybrid, token-level)"]
    end
    subgraph S4["4. Evaluation"]
        D1["LLM-as-judge<br/>(MT-Bench)"]
        D2["Human eval<br/>(win rate)"]
        D3["Safety benchmarks<br/>(Beavertails)"]
    end
    S1 --> S2 --> S3 --> S4
    style A1 fill:#e8eaf6,stroke:#3f51b5
    style A2 fill:#e8eaf6,stroke:#3f51b5
    style A3 fill:#e8eaf6,stroke:#3f51b5
    style B1 fill:#fff3e0,stroke:#ef6c00
    style B2 fill:#fff3e0,stroke:#ef6c00
    style B3 fill:#fff3e0,stroke:#ef6c00
    style C1 fill:#e8f5e9,stroke:#4caf50
    style C2 fill:#e8f5e9,stroke:#4caf50
    style C3 fill:#e8f5e9,stroke:#4caf50
    style D1 fill:#f3e5f5,stroke:#9c27b0
    style D2 fill:#f3e5f5,stroke:#9c27b0
    style D3 fill:#f3e5f5,stroke:#9c27b0

Hyperparameters¶

Parameter	Typical	Notes
KL penalty (β)	0.01-0.2	Higher = more conservative
PPO clip (ε)	0.2	Prevent large updates
Learning rate	1e-5 to 1e-6	Policy networks sensitive
Batch size	512-2048	Larger = stable but slow
Epochs	1-5	Overfitting to preferences

Мои заметки¶

Trends 2024-2025: 1. Token-level RLHF — RTO (sentence → token) 2. Multi-feedback — RLBFF (human + verifiable) 3. Online learning — continuous preference collection 4. Theoretical understanding — bandit formulations

Production considerations: - Reward model stability — needs periodic retraining - Preference diversity — cultural bias issues - Safety — red teaming before deployment - Cost — human labeling expensive

Open questions: - [ ] Constitutional AI vs RLHF comparison - [ ] Multi-objective RLHF (safety + helpfulness) - [ ] Recursive alignment (self-improvement) - [ ] Scalable preference collection

Gaps remaining: - [ ] Constitutional AI details - [ ] ORPO/KTO implementation guides - [ ] Multi-agent RLHF - [ ] RLHF for multimodal models

Code skills: - Implement preference dataset loader - PPO update loop for language models - DPO training from scratch - Reward model evaluation pipeline

GRPO Advances 2025-2026 (Updated Feb 2026)¶

What is GRPO?¶

Group Relative Policy Optimization (GRPO) — RL algorithm using group-level statistics instead of critic networks.

Key Difference from PPO: $$ \text{PPO}: A(s,a) = Q(s,a) - V(s) \quad \text{(requires critic)} $$

\[ \text{GRPO}: A_i = \frac{r_i - \text{mean}(r_{\text{group}})}{\text{std}(r_{\text{group}})} \quad \text{(no critic)} \]

GRPO Variants 2025-2026¶

Variant	Key Innovation	Improvement
GDRO	Distributionally robust optimization	+10.6% pass@8
GRPO-CARE	Consistency-aware rewards	+6.7% accuracy, +24.5% consistency
Scaf-GRPO	Scaffolded hints for hard problems	+44.3% AIME24
2-GRPO	Minimal two-rollout config	98.1% performance, 12.5% compute

GDRO (Jan 2026)¶

Problem: GRPO uses uniform sampling — wastes compute on solved problems.

Solution: Two independent GDRO games: 1. Prompt-GDRO — Upweight hard prompts dynamically 2. Rollout-GDRO — Reallocate rollouts for gradient variance reduction

Formula: $$ \text{Prompt-GDRO}: w_g \propto \exp(\eta \cdot \text{difficulty}_g) $$

Results: +10.6% (Prompt), +10.1% (Rollout) on pass@8

GRPO-CARE (Jun 2025)¶

Problem: GRPO improves accuracy but reduces logical coherence (57.9% consistency).

Solution: Two-tiered reward: 1. Base reward for answer correctness 2. Adaptive consistency bonus for reasoning coherence

Consistency Bonus: $$ \text{Bonus} = \text{sigmoid}(P_{\text{ref}}(\text{answer} | \text{reasoning}) - P_{\text{group_mean}}) $$

Results: +6.7% accuracy, +24.5% consistency improvement

Scaf-GRPO (Oct 2025)¶

Problem: "Learning cliff" — zero reward for hard problems stalls learning.

Solution: Scaffolded hints when learning plateaus: - Tier 1: Abstract concepts - Tier 2: Partial steps - Tier 3: Concrete guidance

Algorithm:

if model_stuck(problem):
    inject_hint(problem, tier=1)
    if still_stuck:
        inject_hint(problem, tier=2)

Results: +44.3% on AIME24 benchmark

2-GRPO (Oct 2025)¶

Insight: GRPO is secretly DPO with contrastive objective.

Proof: Group size affects only Monte Carlo estimators, not the fundamental objective.

Results: - 2-GRPO: 98.1% of 16-GRPO performance - 12.5% rollouts, 21% training time

Formula: $$ \mathcal{L}{\text{2-GRPO}} \approx \mathcal{L} + O(1/n) $$}

Interview Questions (GRPO)¶

Q: Сравните GRPO и PPO для post-training LLM.

Red flag: "GRPO -- это улучшенная версия PPO" (путает принципиально разные подходы)

Strong answer: "PPO: требует critic network (value function), sample-inefficient, но стабилен. GRPO: no critic, advantage через group statistics (r_i - mean)/std, compute-efficient. GRPO при memory constraints или когда critic дорогой. PPO при длительном обучении с stable value estimates. 2-GRPO доказал эквивалентность DPO при K=2."

Q: Как GDRO улучшает GRPO?

Red flag: "GDRO -- это GRPO с dropout" (путает с regularization)

Strong answer: "GDRO решает проблему uniform sampling: GRPO тратит compute на уже решённые задачи. Prompt-GDRO динамически повышает вес сложных промптов, Rollout-GDRO перераспределяет rollouts для снижения gradient variance. Два независимых distributionally robust games. Результат: +10.6% pass@8 на Prompt-GDRO, +10.1% на Rollout-GDRO."

Q: Что такое learning cliff и как Scaf-GRPO его решает?

Red flag: "Это когда модель перестает учиться" (без объяснения механизма zero gradient)

Strong answer: "Learning cliff: задача слишком сложная, модель всегда проваливается, reward = 0, gradient = 0, обучение стоит. Scaf-GRPO: inject tiered hints при plateau -- Tier 1 (абстрактные концепции), Tier 2 (частичные шаги), Tier 3 (конкретные подсказки). Даёт достаточно guidance для обучения, не решая задачу целиком. +44.3% на AIME24."

Заблуждение: больше rollouts в GRPO всегда лучше

2-GRPO (arXiv:2510.00977) доказал, что group size влияет только на Monte Carlo estimators, а не на фундаментальный objective. 2 rollouts дают 98.1% performance от 16 rollouts при 12.5% compute. GRPO с K=2 математически эквивалентен DPO + O(1/n). Не тратьте compute на большие группы без бенчмарков.

Заблуждение: DPO полностью заменил PPO для alignment

DPO доминирует в open-source, но все frontier модели (GPT-4, Claude, Gemini) используют PPO или его варианты на финальном этапе. DPO offline -- не может адаптироваться к distribution shift при deployment. Iterative DPO (online) частично решает это, но PPO с token-level rewards (RTO) дает +7.5 AlpacaEval 2 поверх DPO baseline.

Заблуждение: RLHF reward model стабильна после обучения

Reward model деградирует по мере того как policy уходит от training distribution. На данных Anthropic (2024), reward model accuracy падает с 85% до 65% за 3 месяца без retraining. RLBFF частично решает это через verifiable binary principles (accuracy, safety) вместо monolithic reward.

Sources¶

arXiv:2601.19280 -- GDRO (Jan 2026)
arXiv:2506.16141 — GRPO-CARE (Jun 2025)
arXiv:2510.19807 — Scaf-GRPO (Oct 2025)
arXiv:2510.00977 — 2-GRPO (Oct 2025)