Перейти к содержанию

Alignment и PEFT: RLHF, DPO, GRPO, LoRA, QLoRA

~3 минуты чтения

Предварительно: Прогресс RLHF | Методы Alignment

Полный fine-tuning модели на 70B параметров стоит $50,000-100,000 и требует кластера из 16 H100. LoRA снижает затраты в 10-20 раз, сохраняя 90-95% качества. DPO заменил нестабильный PPO-цикл RLHF в большинстве production-систем 2025-2026: Llama 3, Mistral, Qwen -- все используют DPO или его варианты. Понимание trade-offs между alignment-методами и PEFT -- обязательный навык для ML-инженера, работающего с LLM.

URL: Zylos AI, Hugging Face TRL, Sebastian Raschka Тип: alignment / rlhf / dpo / lora / peft Дата: Февраль 2026 Сбор: Ralph Research ФАЗА 5


Part 1: Overview

Fine-Tuning Landscape 2026

Three Main Categories: 1. Alignment Methods: RLHF, DPO, GRPO — train models to follow human preferences 2. Parameter-Efficient Methods: LoRA, QLoRA — reduce compute costs 10-20× 3. Supervised Fine-Tuning (SFT): Adapt pre-trained models to specific tasks

Key Insight 2026:

Data quality has emerged as the most critical success factor, consistently outweighing hyperparameter optimization in importance.


Part 2: Alignment Methods

2.1 Direct Preference Optimization (DPO)

Concept: Simplifies alignment by directly updating the language model using preference data, without requiring a separate reward model.

Advantages over RLHF: - Stable and computationally lightweight - No RL loop instability (PPO issues) - No sampling from LM during training - Minimal hyperparameter tuning

Performance: - Exceeds PPO-based RLHF in sentiment control - Matches or improves response quality in summarization/dialogue - Substantially simpler to implement

Implementation:

from trl import DPOTrainer

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    train_dataset=preference_dataset,
    beta=0.1,  # KL penalty
)
trainer.train()

2.2 Reinforcement Learning from Human Feedback (RLHF)

Three-Step Process: 1. Collect human feedback on LLM behavior 2. Train reward model on feedback 3. Fine-tune LLM using RL (PPO, DPO, GRPO)

Challenges: - More complex than DPO - Requires separate reward model training - PPO optimization can be unstable

When to Use: - Complex alignment requiring nuanced reward modeling - When DPO is insufficient for specific use cases

2.3 GRPO (Group Relative Policy Optimization)

Concept: Introduced by DeepSeek in January 2025 (DeepSeek-R1 paper, arXiv:2501.12948), gained wide adoption in 2025-2026.

Key Innovation: Uses group-relative advantages instead of absolute rewards.

Status: Supported in Hugging Face TRL library.

2.4 RLAIF (AI Supervision)

Concept: Use AI models to fine-tune other AI models, reducing human annotation needs.

Related (but distinct) Concepts: Scalable Oversight (broader field), Superalignment (OpenAI's program for superintelligent AI alignment — different scope)

Applications: - Reducing annotation costs - Scaling alignment to larger models - Self-improvement loops


Part 3: Parameter-Efficient Fine-Tuning (PEFT)

3.1 Cost and Memory Comparison

Method Memory Usage Quality Cost Reduction
Full Fine-Tuning 16 GB/B params 100%
LoRA ~3 GB/B params 90-95% 10-20×
QLoRA ~1 GB/B params 80-90% 10-20×

Hardware Implications:

Model Size Full FT LoRA QLoRA
7B 2× A100 80GB RTX 4090 (24GB) RTX 3090 (24GB)
70B 16× H100 (or 8× with ZeRO-Infinity offload) 2× A100 (80GB) 1× A100 (80GB)

3.2 LoRA (Low-Rank Adaptation)

How It Works: - Adds small trainable low-rank matrices to frozen weights - Formula: \(W' = W + \frac{\alpha}{r}BA\)

Performance: Recovers 90-95% of full fine-tuning quality

Key Hyperparameters:

Parameter Typical Range Effect
Rank ® 8-64 Higher = more params, better adaptation
Alpha (α) 16-32 Scales the update magnitude
Target Modules all linear Targeting all layers improves quality

Best Practices: - For static datasets, avoid multi-epoch training (overfitting) - Data quality > hyperparameter tuning - Start simple, measure everything - Optimizer choice (AdamW vs SGD) shows minimal variation

3.3 QLoRA (Quantized LoRA)

How It Works: - Loads pre-trained model as 4-bit quantized weights - Only LoRA adapters trained in full precision

Performance: - 80-90% of full fine-tuning quality - 33% memory savings vs LoRA - 39% increased runtime

When to Use: - Start with LoRA if base model fits in GPU memory - Use QLoRA to squeeze large models onto limited VRAM

Notable Achievement: 7B model on 24GB VRAM (consumer GPU)

3.4 Production PEFT Deployment

Adapter-Based Serving (vLLM): - 1 base model in GPU memory - N small adapter files (MBs each) - Serves N fine-tuned versions efficiently

Benefits: - Small checkpoints (few MBs per task) - Prevents catastrophic forgetting - Less prone to overfitting


Part 4: Supervised Fine-Tuning (SFT)

4.1 Core Concepts

Purpose: Transform pre-trained "base model" into instruction-following assistant.

Process: Next-token prediction on high-quality instruction data.

Data Requirements: - Tens of thousands of examples typical - Start with 50+ well-crafted examples - Quality > Quantity (critical insight)

4.2 Data Preparation Pipeline

Seven-Stage Pipeline: 1. Dataset Preparation 2. Model Initialization 3. Training Environment Setup 4. Fine-Tuning 5. Evaluation and Validation 6. Deployment 7. Monitoring and Maintenance

Data Techniques:

Technique Tools Purpose
Balancing SMOTE, ensemble Handle class imbalance
Augmentation NLP-AUG, TextAttack Generate variations
Annotation Snorkel Precise labeling
Safety Custom filters Remove harmful content

Augmentation Methods: - Word embeddings: Replace with semantically similar - Back translation: Paraphrase via translation - Adversarial attacks: Generate challenging examples

4.3 SFT vs RL

Aspect SFT RL (DPO/RLHF)
Accuracy 88.3% Variable
Learning Intuitive + counter-intuitive rules Optimization-based
Role Foundation for correctness Refine behavior
Relationship Complementary, not competing

Part 5: Hugging Face TRL (2026)

Overview

TRL: Transformer Reinforcement Learning library for post-training foundation models.

Supported Methods: - SFT (Supervised Fine-Tuning) - DPO (Direct Preference Optimization) - GRPO (Group Relative Policy Optimization)

Key Features: - Automatic tokenizer updates - Tight PEFT integration - Various architecture support - Scales across hardware setups

SFT Trainer Example

from trl import SFTTrainer
from datasets import load_dataset

dataset = load_dataset("trl-lib/Capybara", split="train")
trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,
)
trainer.train()

Training Modes: - Default: Loss on all tokens (full sequence) - Use DataCollatorForCompletionOnlyLM with response_template for completion-only loss


Part 6: Catastrophic Forgetting

Problem Overview

Definition: Model forgets previously learned information while acquiring new knowledge.

Scale Effects: - Observed in 1B to 7B+ parameter models - Larger models experience MORE severe forgetting

Mitigation Techniques

Technique Effectiveness Notes
Full Fine-Tuning ❌ Highest forgetting Best task performance
PEFT (LoRA) ⚠️ Partial protection Better than full FT
FIP ✅ Best retention Functionally Invariant Paths

Key Finding: LoRA alone does NOT fully mitigate catastrophic forgetting in continual learning.

Best Practices

  • Use PEFT for better preservation
  • Consider FIP for sequential task learning
  • Monitor previous task performance
  • Accept some forgetting as unavoidable

Part 7: Provider Comparison (2026)

Provider Fine-Tuning Support Best Use Case
OpenAI Custom training Multimodal, general chat
Anthropic (Bedrock) AWS Bedrock tools Complex reasoning, coding
Google Vertex AI Gemini 2.5, Llama 3.1 RAG over massive documents

2026 Strategy: - Context caching = biggest cost saver - Multi-provider strategy via LLM Router (LiteLLM) - Don't lock into one provider


Part 8: Evaluation (2026)

Leading Frameworks

Tools: DeepEval, W&B Weave, MLflow, Humanloop, Arize AI, Langfuse, RAGAS

2026 Priority: Traceability — linking scores to exact prompt/model/dataset versions

Evaluation Metrics

Category Metrics
Core Accuracy, relevance, factuality, toxicity, hallucination
RAG Retrieval Precision@k, Recall@k, MRR, nDCG
RAG Generation Faithfulness, relevance, citation coverage
End-to-End Correctness, latency, cost, safety

Benchmarks

Standard: MMLU, BigBench, TruthfulQA, GSM8K, Hellaswag, ARC

RAG-Specific: RAGBench, CRAG, LegalBench-RAG, WixQA, T²-RAGBench


Part 9: Interview-Relevant Numbers

Memory & Cost

Metric Value
Full FT memory 16 GB per billion params
LoRA memory reduction 5.6× for 7B
PEFT cost reduction 10-20×
LoRA quality retention 90-95%
QLoRA quality retention 80-90%
QLoRA memory savings 33% vs LoRA
QLoRA runtime increase 39%

Alignment Performance

Metric Value
DPO vs RLHF sentiment Exceeds PPO
SFT accuracy 88.3%
Minimum SFT examples 50+
Typical SFT dataset Tens of thousands

Hardware Pricing (On-Demand)

GPU VRAM Price/Hour
RTX 4000 Ada 20 GB $0.76
RTX 6000 Ada/L40S 48 GB $1.57
AMD MI300X 192 GB $1.99
H100 80 GB $3.39


Заблуждение: LoRA полностью решает catastrophic forgetting

LoRA снижает forgetting по сравнению с full fine-tuning, но не устраняет его. Исследования (arXiv:2504.01241) показывают, что даже при LoRA модели теряют 15-25% performance на предыдущих задачах при continual learning. Метод FIP (Functionally Invariant Paths) показывает лучшие результаты по retention, но сложнее в реализации.

Заблуждение: DPO всегда лучше RLHF

DPO проще и стабильнее, но RLHF с PPO превосходит DPO в задачах со сложным reward landscape -- например, при alignment на безопасность с множественными constraints. GRPO (DeepSeek-R1) комбинирует преимущества обоих подходов через group-relative advantages. Выбор метода зависит от задачи, а не от "новизны" метода.

Заблуждение: больше данных для SFT = лучше модель

Sebastian Raschka и команда TRL показали: 50-1000 качественных примеров часто дают лучший результат, чем 100K шумных. На multi-epoch обучении со static datasets LoRA быстро переобучается. Качество данных обгоняет количество с разницей до 20% по метрикам.


Interview Questions

Q: Когда выбрать DPO вместо RLHF, и наоборот?

❌ Red flag: "DPO всегда лучше, потому что проще"

✅ Strong answer: "DPO оптимален когда есть чистые preference pairs и задача alignment относительно прямолинейная -- он стабильнее и не требует отдельной reward model. RLHF с PPO предпочтителен при сложном reward landscape с множественными constraints (безопасность + helpfulness + honesty), где reward model может обучиться нюансам, которые DPO не схватывает. GRPO из DeepSeek-R1 -- промежуточный вариант, использующий group-relative advantages без отдельной reward model."

Q: Как LoRA снижает затраты на fine-tuning, и каковы ограничения?

❌ Red flag: "LoRA просто замораживает часть весов"

✅ Strong answer: "LoRA добавляет low-rank матрицы BA к замороженным весам: W' = W + (alpha/r)*BA. При rank 8-64 это 0.1-1% от общего числа параметров. Снижение памяти 5.6x для 7B модели, стоимости -- 10-20x. Ограничения: потеря 5-10% качества vs full FT, catastrophic forgetting не устраняется полностью, optimizer choice (AdamW vs SGD) почти не влияет -- основной фактор качество данных."

Q: Вы fine-tune 70B модель с бюджетом на одну A100 80GB. Какой подход?

❌ Red flag: "Используем full fine-tuning с gradient checkpointing"

✅ Strong answer: "QLoRA -- единственный вариант уложиться в 80GB для 70B. Загрузка модели в 4-bit quantization, адаптеры обучаются в full precision. Потеря качества 10-20% vs full FT, runtime на 39% дольше чем LoRA. Для serving -- vLLM с adapter-based подходом: один base model в памяти, N адаптеров по несколько MB. Если 80-90% качества недостаточно, стоит рассмотреть меньшую модель (7B-13B) с LoRA или full FT."


Sources

  1. Zylos AI — "LLM Fine-tuning Techniques 2026" (Jan 13, 2026)
  2. Hugging Face TRL Documentation — DPO, SFT, GRPO Trainers
  3. Sebastian Raschka — "Practical Tips for Finetuning LLMs Using LoRA"
  4. Cameron R. Wolfe — "Direct Preference Optimization"
  5. Databricks — "Efficient Fine-Tuning with LoRA"
  6. arXiv:2308.08747 — Catastrophic Forgetting in LLMs
  7. arXiv:2504.01241 — Catastrophic Forgetting: Comparative Analysis