Alignment и PEFT: RLHF, DPO, GRPO, LoRA, QLoRA¶
~3 минуты чтения
Предварительно: Прогресс RLHF | Методы Alignment
Полный fine-tuning модели на 70B параметров стоит $50,000-100,000 и требует кластера из 16 H100. LoRA снижает затраты в 10-20 раз, сохраняя 90-95% качества. DPO заменил нестабильный PPO-цикл RLHF в большинстве production-систем 2025-2026: Llama 3, Mistral, Qwen -- все используют DPO или его варианты. Понимание trade-offs между alignment-методами и PEFT -- обязательный навык для ML-инженера, работающего с LLM.
URL: Zylos AI, Hugging Face TRL, Sebastian Raschka Тип: alignment / rlhf / dpo / lora / peft Дата: Февраль 2026 Сбор: Ralph Research ФАЗА 5
Part 1: Overview¶
Fine-Tuning Landscape 2026¶
Three Main Categories: 1. Alignment Methods: RLHF, DPO, GRPO — train models to follow human preferences 2. Parameter-Efficient Methods: LoRA, QLoRA — reduce compute costs 10-20× 3. Supervised Fine-Tuning (SFT): Adapt pre-trained models to specific tasks
Key Insight 2026:
Data quality has emerged as the most critical success factor, consistently outweighing hyperparameter optimization in importance.
Part 2: Alignment Methods¶
2.1 Direct Preference Optimization (DPO)¶
Concept: Simplifies alignment by directly updating the language model using preference data, without requiring a separate reward model.
Advantages over RLHF: - Stable and computationally lightweight - No RL loop instability (PPO issues) - No sampling from LM during training - Minimal hyperparameter tuning
Performance: - Exceeds PPO-based RLHF in sentiment control - Matches or improves response quality in summarization/dialogue - Substantially simpler to implement
Implementation:
from trl import DPOTrainer
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
train_dataset=preference_dataset,
beta=0.1, # KL penalty
)
trainer.train()
2.2 Reinforcement Learning from Human Feedback (RLHF)¶
Three-Step Process: 1. Collect human feedback on LLM behavior 2. Train reward model on feedback 3. Fine-tune LLM using RL (PPO, DPO, GRPO)
Challenges: - More complex than DPO - Requires separate reward model training - PPO optimization can be unstable
When to Use: - Complex alignment requiring nuanced reward modeling - When DPO is insufficient for specific use cases
2.3 GRPO (Group Relative Policy Optimization)¶
Concept: Introduced by DeepSeek in January 2025 (DeepSeek-R1 paper, arXiv:2501.12948), gained wide adoption in 2025-2026.
Key Innovation: Uses group-relative advantages instead of absolute rewards.
Status: Supported in Hugging Face TRL library.
2.4 RLAIF (AI Supervision)¶
Concept: Use AI models to fine-tune other AI models, reducing human annotation needs.
Related (but distinct) Concepts: Scalable Oversight (broader field), Superalignment (OpenAI's program for superintelligent AI alignment — different scope)
Applications: - Reducing annotation costs - Scaling alignment to larger models - Self-improvement loops
Part 3: Parameter-Efficient Fine-Tuning (PEFT)¶
3.1 Cost and Memory Comparison¶
| Method | Memory Usage | Quality | Cost Reduction |
|---|---|---|---|
| Full Fine-Tuning | 16 GB/B params | 100% | 1× |
| LoRA | ~3 GB/B params | 90-95% | 10-20× |
| QLoRA | ~1 GB/B params | 80-90% | 10-20× |
Hardware Implications:
| Model Size | Full FT | LoRA | QLoRA |
|---|---|---|---|
| 7B | 2× A100 80GB | RTX 4090 (24GB) | RTX 3090 (24GB) |
| 70B | 16× H100 (or 8× with ZeRO-Infinity offload) | 2× A100 (80GB) | 1× A100 (80GB) |
3.2 LoRA (Low-Rank Adaptation)¶
How It Works: - Adds small trainable low-rank matrices to frozen weights - Formula: \(W' = W + \frac{\alpha}{r}BA\)
Performance: Recovers 90-95% of full fine-tuning quality
Key Hyperparameters:
| Parameter | Typical Range | Effect |
|---|---|---|
| Rank ® | 8-64 | Higher = more params, better adaptation |
| Alpha (α) | 16-32 | Scales the update magnitude |
| Target Modules | all linear | Targeting all layers improves quality |
Best Practices: - For static datasets, avoid multi-epoch training (overfitting) - Data quality > hyperparameter tuning - Start simple, measure everything - Optimizer choice (AdamW vs SGD) shows minimal variation
3.3 QLoRA (Quantized LoRA)¶
How It Works: - Loads pre-trained model as 4-bit quantized weights - Only LoRA adapters trained in full precision
Performance: - 80-90% of full fine-tuning quality - 33% memory savings vs LoRA - 39% increased runtime
When to Use: - Start with LoRA if base model fits in GPU memory - Use QLoRA to squeeze large models onto limited VRAM
Notable Achievement: 7B model on 24GB VRAM (consumer GPU)
3.4 Production PEFT Deployment¶
Adapter-Based Serving (vLLM): - 1 base model in GPU memory - N small adapter files (MBs each) - Serves N fine-tuned versions efficiently
Benefits: - Small checkpoints (few MBs per task) - Prevents catastrophic forgetting - Less prone to overfitting
Part 4: Supervised Fine-Tuning (SFT)¶
4.1 Core Concepts¶
Purpose: Transform pre-trained "base model" into instruction-following assistant.
Process: Next-token prediction on high-quality instruction data.
Data Requirements: - Tens of thousands of examples typical - Start with 50+ well-crafted examples - Quality > Quantity (critical insight)
4.2 Data Preparation Pipeline¶
Seven-Stage Pipeline: 1. Dataset Preparation 2. Model Initialization 3. Training Environment Setup 4. Fine-Tuning 5. Evaluation and Validation 6. Deployment 7. Monitoring and Maintenance
Data Techniques:
| Technique | Tools | Purpose |
|---|---|---|
| Balancing | SMOTE, ensemble | Handle class imbalance |
| Augmentation | NLP-AUG, TextAttack | Generate variations |
| Annotation | Snorkel | Precise labeling |
| Safety | Custom filters | Remove harmful content |
Augmentation Methods: - Word embeddings: Replace with semantically similar - Back translation: Paraphrase via translation - Adversarial attacks: Generate challenging examples
4.3 SFT vs RL¶
| Aspect | SFT | RL (DPO/RLHF) |
|---|---|---|
| Accuracy | 88.3% | Variable |
| Learning | Intuitive + counter-intuitive rules | Optimization-based |
| Role | Foundation for correctness | Refine behavior |
| Relationship | Complementary, not competing |
Part 5: Hugging Face TRL (2026)¶
Overview¶
TRL: Transformer Reinforcement Learning library for post-training foundation models.
Supported Methods: - SFT (Supervised Fine-Tuning) - DPO (Direct Preference Optimization) - GRPO (Group Relative Policy Optimization)
Key Features: - Automatic tokenizer updates - Tight PEFT integration - Various architecture support - Scales across hardware setups
SFT Trainer Example¶
from trl import SFTTrainer
from datasets import load_dataset
dataset = load_dataset("trl-lib/Capybara", split="train")
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset,
)
trainer.train()
Training Modes:
- Default: Loss on all tokens (full sequence)
- Use DataCollatorForCompletionOnlyLM with response_template for completion-only loss
Part 6: Catastrophic Forgetting¶
Problem Overview¶
Definition: Model forgets previously learned information while acquiring new knowledge.
Scale Effects: - Observed in 1B to 7B+ parameter models - Larger models experience MORE severe forgetting
Mitigation Techniques¶
| Technique | Effectiveness | Notes |
|---|---|---|
| Full Fine-Tuning | ❌ Highest forgetting | Best task performance |
| PEFT (LoRA) | ⚠️ Partial protection | Better than full FT |
| FIP | ✅ Best retention | Functionally Invariant Paths |
Key Finding: LoRA alone does NOT fully mitigate catastrophic forgetting in continual learning.
Best Practices¶
- Use PEFT for better preservation
- Consider FIP for sequential task learning
- Monitor previous task performance
- Accept some forgetting as unavoidable
Part 7: Provider Comparison (2026)¶
| Provider | Fine-Tuning Support | Best Use Case |
|---|---|---|
| OpenAI | Custom training | Multimodal, general chat |
| Anthropic (Bedrock) | AWS Bedrock tools | Complex reasoning, coding |
| Google Vertex AI | Gemini 2.5, Llama 3.1 | RAG over massive documents |
2026 Strategy: - Context caching = biggest cost saver - Multi-provider strategy via LLM Router (LiteLLM) - Don't lock into one provider
Part 8: Evaluation (2026)¶
Leading Frameworks¶
Tools: DeepEval, W&B Weave, MLflow, Humanloop, Arize AI, Langfuse, RAGAS
2026 Priority: Traceability — linking scores to exact prompt/model/dataset versions
Evaluation Metrics¶
| Category | Metrics |
|---|---|
| Core | Accuracy, relevance, factuality, toxicity, hallucination |
| RAG Retrieval | Precision@k, Recall@k, MRR, nDCG |
| RAG Generation | Faithfulness, relevance, citation coverage |
| End-to-End | Correctness, latency, cost, safety |
Benchmarks¶
Standard: MMLU, BigBench, TruthfulQA, GSM8K, Hellaswag, ARC
RAG-Specific: RAGBench, CRAG, LegalBench-RAG, WixQA, T²-RAGBench
Part 9: Interview-Relevant Numbers¶
Memory & Cost¶
| Metric | Value |
|---|---|
| Full FT memory | 16 GB per billion params |
| LoRA memory reduction | 5.6× for 7B |
| PEFT cost reduction | 10-20× |
| LoRA quality retention | 90-95% |
| QLoRA quality retention | 80-90% |
| QLoRA memory savings | 33% vs LoRA |
| QLoRA runtime increase | 39% |
Alignment Performance¶
| Metric | Value |
|---|---|
| DPO vs RLHF sentiment | Exceeds PPO |
| SFT accuracy | 88.3% |
| Minimum SFT examples | 50+ |
| Typical SFT dataset | Tens of thousands |
Hardware Pricing (On-Demand)¶
| GPU | VRAM | Price/Hour |
|---|---|---|
| RTX 4000 Ada | 20 GB | $0.76 |
| RTX 6000 Ada/L40S | 48 GB | $1.57 |
| AMD MI300X | 192 GB | $1.99 |
| H100 | 80 GB | $3.39 |
Заблуждение: LoRA полностью решает catastrophic forgetting
LoRA снижает forgetting по сравнению с full fine-tuning, но не устраняет его. Исследования (arXiv:2504.01241) показывают, что даже при LoRA модели теряют 15-25% performance на предыдущих задачах при continual learning. Метод FIP (Functionally Invariant Paths) показывает лучшие результаты по retention, но сложнее в реализации.
Заблуждение: DPO всегда лучше RLHF
DPO проще и стабильнее, но RLHF с PPO превосходит DPO в задачах со сложным reward landscape -- например, при alignment на безопасность с множественными constraints. GRPO (DeepSeek-R1) комбинирует преимущества обоих подходов через group-relative advantages. Выбор метода зависит от задачи, а не от "новизны" метода.
Заблуждение: больше данных для SFT = лучше модель
Sebastian Raschka и команда TRL показали: 50-1000 качественных примеров часто дают лучший результат, чем 100K шумных. На multi-epoch обучении со static datasets LoRA быстро переобучается. Качество данных обгоняет количество с разницей до 20% по метрикам.
Interview Questions¶
Q: Когда выбрать DPO вместо RLHF, и наоборот?
Red flag: "DPO всегда лучше, потому что проще"
Strong answer: "DPO оптимален когда есть чистые preference pairs и задача alignment относительно прямолинейная -- он стабильнее и не требует отдельной reward model. RLHF с PPO предпочтителен при сложном reward landscape с множественными constraints (безопасность + helpfulness + honesty), где reward model может обучиться нюансам, которые DPO не схватывает. GRPO из DeepSeek-R1 -- промежуточный вариант, использующий group-relative advantages без отдельной reward model."
Q: Как LoRA снижает затраты на fine-tuning, и каковы ограничения?
Red flag: "LoRA просто замораживает часть весов"
Strong answer: "LoRA добавляет low-rank матрицы BA к замороженным весам: W' = W + (alpha/r)*BA. При rank 8-64 это 0.1-1% от общего числа параметров. Снижение памяти 5.6x для 7B модели, стоимости -- 10-20x. Ограничения: потеря 5-10% качества vs full FT, catastrophic forgetting не устраняется полностью, optimizer choice (AdamW vs SGD) почти не влияет -- основной фактор качество данных."
Q: Вы fine-tune 70B модель с бюджетом на одну A100 80GB. Какой подход?
Red flag: "Используем full fine-tuning с gradient checkpointing"
Strong answer: "QLoRA -- единственный вариант уложиться в 80GB для 70B. Загрузка модели в 4-bit quantization, адаптеры обучаются в full precision. Потеря качества 10-20% vs full FT, runtime на 39% дольше чем LoRA. Для serving -- vLLM с adapter-based подходом: один base model в памяти, N адаптеров по несколько MB. Если 80-90% качества недостаточно, стоит рассмотреть меньшую модель (7B-13B) с LoRA или full FT."
Sources¶
- Zylos AI — "LLM Fine-tuning Techniques 2026" (Jan 13, 2026)
- Hugging Face TRL Documentation — DPO, SFT, GRPO Trainers
- Sebastian Raschka — "Practical Tips for Finetuning LLMs Using LoRA"
- Cameron R. Wolfe — "Direct Preference Optimization"
- Databricks — "Efficient Fine-Tuning with LoRA"
- arXiv:2308.08747 — Catastrophic Forgetting in LLMs
- arXiv:2504.01241 — Catastrophic Forgetting: Comparative Analysis