LoRA и варианты файнтюнинга¶
~9 минут чтения
Предварительно: Техники файнтюнинга LLM, Квантизация LLM
При d=k=4096 и rank r=8 LoRA обучает 65.5K параметров вместо 16.7M -- сокращение в 256 раз. Для 70B модели это снижает потребление VRAM с 1.1 TB (full fine-tuning) до 280 GB (LoRA) или 48 GB (QLoRA на single A100). В 2026 году 90%+ PEFT-проектов используют LoRA или его варианты, при этом rsLoRA и PiSSA дают +3-5% качества без дополнительных затрат на ресурсы.
LoRA formula (W = W0 + BA), QLoRA (4-bit NF4), DoRA (magnitude-direction decomposition), rsLoRA (rank-stabilized), PiSSA (SVD init), AdaLoRA, PyTorch implementation from scratch, PEFT/BitsAndBytes stack, FinLoRA benchmark, production considerations (2025-2026)
Ключевые концепции¶
LoRA (Low-Rank Adaptation) -- замораживаем pretrained weights, добавляем trainable low-rank матрицы:
- \(W_0 \in \mathbb{R}^{d \times k}\) -- frozen pretrained weight
- \(B \in \mathbb{R}^{d \times r}\), \(A \in \mathbb{R}^{r \times k}\) -- trainable matrices
- \(r \ll \min(d, k)\) -- rank (typically 8-64)
Parameter Reduction¶
| d, k | r | Full Params | LoRA Params | Reduction |
|---|---|---|---|---|
| 4096 | 4 | 16.7M | 32.8K | 512x |
| 4096 | 8 | 16.7M | 65.5K | 256x |
| 4096 | 16 | 16.7M | 131K | 128x |
| 4096 | 64 | 16.7M | 524K | 32x |
Variant Ranking (2026)¶
| Method | GPU Memory | Performance | Stability | Best For |
|---|---|---|---|---|
| LoRA | High | Good | Excellent | Stable training |
| QLoRA | Very Low | Slightly below | Good | Limited VRAM |
| DoRA | High | Better | Excellent | Higher capacity |
| rsLoRA | Same | Better | Excellent | Default choice |
| PiSSA | Same | Best | Good | Max performance |
1. PyTorch Implementation¶
LoRA Layer¶
import torch
import torch.nn as nn
import math
class LoRALayer(nn.Module):
def __init__(self, in_features, out_features, rank=8, alpha=16.0, dropout=0.0):
super().__init__()
self.scaling = alpha / rank
self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
self.dropout = nn.Dropout(dropout) if dropout > 0 else nn.Identity()
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
def forward(self, x):
x = self.dropout(x)
return (x @ self.lora_A.T) @ self.lora_B.T * self.scaling
LoRA Linear Wrapper¶
class LoRALinear(nn.Module):
def __init__(self, in_features, out_features, rank=8, alpha=16.0, dropout=0.0, bias=True):
super().__init__()
self.linear = nn.Linear(in_features, out_features, bias=bias)
self.lora = LoRALayer(in_features, out_features, rank, alpha, dropout)
def forward(self, x):
return self.linear(x) + self.lora(x)
def merge_weights(self):
with torch.no_grad():
self.linear.weight.data += (
self.lora.lora_B @ self.lora.lora_A
) * self.lora.scaling
self.lora.lora_A.zero_()
self.lora.lora_B.zero_()
Model Wrapper¶
class LoRAModel(nn.Module):
def __init__(self, model, rank=8, alpha=16.0, dropout=0.0,
target_modules=None):
super().__init__()
self.model = model
self.target_modules = target_modules or ["q_proj", "v_proj", "k_proj", "o_proj"]
self._replace_with_lora(rank, alpha, dropout)
self._freeze_base_model()
def _replace_with_lora(self, rank, alpha, dropout):
for name, module in self.model.named_modules():
if any(t in name for t in self.target_modules):
if isinstance(module, nn.Linear):
lora_linear = LoRALinear(
module.in_features, module.out_features,
rank=rank, alpha=alpha, dropout=dropout,
bias=module.bias is not None,
)
lora_linear.linear.weight.data = module.weight.data.clone()
if module.bias is not None:
lora_linear.linear.bias.data = module.bias.data.clone()
parent = self.model.get_submodule('.'.join(name.split('.')[:-1]))
setattr(parent, name.split('.')[-1], lora_linear)
def _freeze_base_model(self):
for name, param in self.model.named_parameters():
if 'lora_' not in name:
param.requires_grad = False
def forward(self, *args, **kwargs):
return self.model(*args, **kwargs)
def merge_and_save(self, path):
for module in self.model.modules():
if isinstance(module, LoRALinear):
module.merge_weights()
torch.save(self.model.state_dict(), path)
Gradient Flow¶
2. QLoRA (Quantized LoRA)¶
Base model в 4-bit, LoRA adapters в full precision.
| Innovation | Description |
|---|---|
| 4-bit NF4 | NormalFloat for normally distributed weights |
| Double quantization | Quantizes quantization constants |
| Paged optimizers | CPU offload for memory spikes |
Memory Comparison¶
| Model | Full FT | LoRA | QLoRA |
|---|---|---|---|
| 7B | 112 GB | 28 GB | 6 GB |
| 13B | 208 GB | 52 GB | 10 GB |
| 70B | 1.1 TB | 280 GB | 48 GB |
PEFT/BitsAndBytes Stack¶
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
lora_config = LoraConfig(
r=8, lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05, bias="none", task_type="CAUSAL_LM",
)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
model = get_peft_model(model, lora_config)
3. DoRA (Weight-Decomposed LoRA)¶
Decompose: \(W = m \cdot \frac{V}{\|V\|}\), apply LoRA to direction V only.
class DoRALinear(nn.Module):
def __init__(self, in_features, out_features, rank=8, alpha=16.0):
super().__init__()
self.weight = nn.Parameter(torch.zeros(out_features, in_features))
self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
self.magnitude = nn.Parameter(torch.ones(out_features))
self.scaling = alpha / rank
nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
def forward(self, x):
V = self.weight / (self.weight.norm(dim=1, keepdim=True) + 1e-6)
delta_V = (self.lora_B @ self.lora_A) * self.scaling
V_new = (V + delta_V)
V_new = V_new / (V_new.norm(dim=1, keepdim=True) + 1e-6)
W_new = self.magnitude.unsqueeze(1) * V_new
return F.linear(x, W_new)
4. rsLoRA (Rank-Stabilized)¶
Standard LoRA: \(\Delta W = \frac{\alpha}{r} BA\). rsLoRA: \(\Delta W = \frac{\alpha}{\sqrt{r}} BA\)
Standard LoRA degrades as rank increases. rsLoRA -- stable or improves with higher rank.
"Start with rsLoRA: strictly better than standard LoRA with no downsides."
5. PiSSA (Principal Singular Values Adaptation)¶
SVD-based initialization: \(W = U \Sigma V^T \approx U_r \Sigma_r V_r^T\)
Initialize A, B from principal singular values. Faster convergence, best performance among variants.
| Model | Task | PiSSA | LoRA | Improvement |
|---|---|---|---|---|
| Gemma-7B | GSM8K | 77.7% | 74.53% | +3.25% |
| Mistral-7B | GSM8K | 72.86% | 67.7% | +5.16% |
6. Other Variants¶
| Variant | Key Innovation |
|---|---|
| AdaLoRA | Learns which layers need higher rank, prunes during training |
| NoRA (ICCV 2025) | Nested structure for better initialization |
7. Comparison¶
Performance (Mistral-7B GSM8K)¶
| Method | Score | Memory (7B) | Speed |
|---|---|---|---|
| Full FT | 73.0% | 112 GB | Slow |
| PiSSA | 72.86% | 28 GB | Fast |
| DoRA | ~70% | 28 GB | Fast |
| rsLoRA | ~69% | 28 GB | Fast |
| LoRA | 67.7% | 28 GB | Fast |
| QLoRA | ~66% | 6 GB | Medium |
Quality Benchmarks¶
| Method | MMLU | GSM8K | Quality vs Full |
|---|---|---|---|
| Full fine-tuning | 65.2 | 52.1 | 100% |
| LoRA (r=16) | 64.8 | 51.8 | 99% |
| LoRA (r=8) | 64.1 | 50.9 | 97% |
| QLoRA | 64.5 | 51.2 | 98% |
8. Selection Guide¶
Limited VRAM (<16GB)? -> QLoRA
Maximum performance? -> PiSSA
Drop-in LoRA upgrade? -> rsLoRA
Higher learning capacity? -> DoRA
Stable, battle-tested? -> Standard LoRA
| Use Case | Recommended | Reason |
|---|---|---|
| Consumer GPU (RTX 4090) | QLoRA | Fits in 24 GB |
| Single A100 (40 GB) | QLoRA or LoRA | Both work |
| Multi-GPU (2x A100) | rsLoRA or DoRA | No memory constraint |
| Production deployment | rsLoRA | Best tradeoff |
| Research/benchmarking | PiSSA | Maximum performance |
| Continual learning | DoRA | Better stability |
9. Best Practices¶
Hyperparameters¶
| Parameter | Recommended | Notes |
|---|---|---|
| Rank ® | 8-64 | Higher = more capacity |
| Alpha | 16-32 (alpha = 2r typical) | Scaling = alpha/r |
| Target modules | q_proj, v_proj minimum | All linear = best quality |
| Learning rate | 1e-4 to 5e-4 | Higher than full FT |
| Dropout | 0.05-0.1 | Optional regularization |
| Warmup ratio | 0.03-0.05 | Cosine annealing |
Target Modules (LLaMA/Mistral)¶
# Attention only (minimal)
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
# All linear layers (best quality)
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"]
Common Pitfalls¶
| Pitfall | Solution |
|---|---|
| Too high rank | Start r=16, increase if needed |
| Wrong alpha | Keep alpha = 2r |
| Learning too fast | Reduce LR, increase warmup |
| Catastrophic forgetting | Lower LR, LoRA on all layers |
LoRA НЕ предотвращает catastrophic forgetting
LoRA снижает риск за счёт малого числа параметров, но НЕ устраняет его. При высоком LR (>5e-4) или длительном обучении модель всё равно теряет general capabilities. Защита: низкий LR (1e-4), LoRA на ВСЕ linear layers (не только q/v), регулярная проверка на held-out general benchmarks.
alpha и r связаны -- scaling = alpha/r
alpha=16, r=8 даёт тот же scaling (2.0), что и alpha=32, r=16. Многие тюнят оба -- это бессмысленно. Фиксируй alpha = 2 * r и тюнь ТОЛЬКО rank. rsLoRA решает эту проблему: scaling = alpha/sqrt®, стабильно при любом rank.
Frameworks (2025-2026)¶
| Framework | Pros | Cons |
|---|---|---|
| HuggingFace PEFT | Standard, well-documented | Verbose API |
| Unsloth | 2x faster, memory efficient | Newer |
| Axolotl | Config-driven YAML | Steep learning curve |
Production Considerations¶
- Adapter merging: merge LoRA into base weights for zero inference overhead
- Multi-LoRA serving: multiple adapters for different tasks on same base model
- A/B testing: compare LoRA configs before deployment
Для интервью¶
Q: "Объясните LoRA и зачем он нужен."¶
LoRA замораживает pretrained weights W0 и добавляет trainable low-rank матрицы: W' = W0 + (alpha/r) * BA, где B in R^(d x r), A in R^(r x k), r << min(d,k). При d=k=4096, r=8: trainable params = 65.5K vs 16.7M (256x reduction). Memory: 10-15% от full fine-tuning. Quality: 97-99% от full FT (MMLU, GSM8K). Inference: zero overhead после merge. Инициализация: A -- Kaiming uniform, B -- zeros (output starts at zero).
Q: "Сравните LoRA, QLoRA, DoRA, rsLoRA."¶
QLoRA: base model в 4-bit NF4 + LoRA adapters в full precision. 70B модель на 48 GB (vs 1.1 TB full FT). Innovations: NF4, double quantization, paged optimizers. Trade-off: slightly slower training, marginal quality loss. DoRA: decompose W = m * V/||V|| (magnitude x direction), apply LoRA to V only. +2-4% vs standard LoRA. rsLoRA: scaling alpha/sqrt® вместо alpha/r -- stable across rank values, strictly better. PiSSA: SVD-based init, best performance (+3-5% vs LoRA), slightly longer init.
Q: "Напишите LoRA forward pass."¶
h = W0 * x + (alpha/r) * B * A * x. A in R^(r x k) initialized Kaiming, B in R^(d x r) initialized zeros. Gradients: dL/dB = dL/dh * (Ax)^T, dL/dA = B^T * dL/dh * x^T. Merge для inference: W_merged = W0 + B * A * (alpha/r), zero overhead.
Ключевые числа¶
| Факт | Значение |
|---|---|
| Parameter reduction (r=8, d=4096) | 256x |
| Parameter reduction (r=4) | 512x |
| Memory reduction LoRA vs full FT | 75-90% |
| Memory reduction QLoRA vs full FT | 95%+ |
| QLoRA 70B model memory | 48 GB |
| LoRA quality vs full FT | 97-99% |
| PiSSA vs LoRA (GSM8K) | +3-5% |
| DoRA vs LoRA | +2-4% |
| FinLoRA avg gain over base | 36% |
| Training speed LoRA vs full FT | 2-3x faster |
| LoRA adoption in PEFT | 90%+ |
Формулы¶
LoRA Forward¶
rsLoRA Forward¶
DoRA Forward¶
Parameter Reduction¶
PiSSA Init¶
Заблуждение: LoRA всегда лучше full fine-tuning
Нет. На задачах с domain shift >30% (медицина, юриспруденция, специализированные языки) full fine-tuning по-прежнему выигрывает 2-5% на специализированных бенчмарках. LoRA оптимален для adaptation (стиль, формат, tone), но не для глубокой смены домена.
Заблуждение: высокий rank всегда лучше
При стандартном LoRA увеличение rank выше 32-64 часто ухудшает результат из-за нестабильного scaling (alpha/r уменьшается). rsLoRA (alpha/sqrt®) решает эту проблему -- при нём higher rank действительно помогает. Если используешь стандартный LoRA, начни с r=16 и увеличивай только с проверкой на валидационном наборе.
Заблуждение: target_modules только q_proj и v_proj достаточно
Ранние работы рекомендовали адаптировать только attention projections. Исследования 2025-2026 показывают: LoRA на ВСЕХ linear layers (q/k/v/o_proj + gate/up/down_proj) даёт +2-4% качества при увеличении trainable params лишь в 2x. Для LLaMA/Mistral всегда используйте все 7 linear layers.
Interview Questions¶
Q: Когда LoRA лучше full fine-tuning?
Red flag: "LoRA всегда лучше, потому что дешевле"
Strong answer: "LoRA оптимален при adaptation задачах (chatbot style, формат, tone). При deep domain shift full fine-tuning даёт 2-5% выигрыш. Ключевой фактор -- rank selection: r=8 для простых задач, r=64-128 для сложных. QLoRA добавляет 4-bit quantization с минимальной потерей. На MMLU LoRA r=16 даёт 99% качества full FT (64.8 vs 65.2)."
Q: Напишите forward pass LoRA и объясните инициализацию.
Red flag: "A и B инициализируются случайно" (без деталей)
Strong answer: "h = W0*x + (alpha/r)B*A*x. A in R^(r x k) инициализируется Kaiming uniform, B in R^(d x r) -- zeros. Это гарантирует, что delta W = 0 в начале обучения (стабильный старт). Merge для inference: W_merged = W0 + B*A(alpha/r), zero overhead. Gradients: dL/dB = dL/dh * (Ax)^T, dL/dA = B^T * dL/dh * x^T."
Q: Сравните QLoRA, DoRA, rsLoRA, PiSSA -- когда что выбрать?
Red flag: Описание только QLoRA без знания более новых вариантов
Strong answer: "QLoRA: base model в 4-bit NF4, adapters в full precision -- для ограниченной VRAM (70B на 48 GB). DoRA: decompose W = m * V/||V||, LoRA только на direction V -- +2-4% качества, лучшая стабильность при continual learning. rsLoRA: scaling alpha/sqrt® вместо alpha/r -- strictly better, стабилен при любом rank. PiSSA: SVD-init из principal singular values -- +3-5% vs LoRA (77.7% vs 74.5% на GSM8K Gemma-7B), но дольше инициализация. Default choice 2026: rsLoRA."
Источники¶
- Hu et al. -- "LoRA: Low-Rank Adaptation of Large Language Models" (arXiv, 2021)
- Dettmers et al. -- "QLoRA: Efficient Finetuning of Quantized LLMs" (arXiv)
- NVIDIA -- "Introducing DoRA: A High-Performing Alternative to LoRA"
- FinLoRA Benchmark (arXiv 2505.19819, May 2025)
- Sebastian Raschka -- "LoRA and DoRA from Scratch"
- PiSSA GitHub & arXiv (Apr 2024)
- ICCV 2025 -- "NoRA: Nested Low-Rank Adaptation"
- Unsloth Documentation (Jan 2026)
- HuggingFace PEFT Documentation
- Lightning AI -- "Parameter-Efficient LLM Finetuning With LoRA"
See Also¶
- Fine-Tuning Techniques -- full fine-tuning vs PEFT, data preparation, evaluation
- Quantization -- QLoRA опирается на 4-bit NF4 квантизацию базовой модели
- Distributed Training -- FSDP2 нативно поддерживает LoRA adapters через DTensor
- Alignment Methods -- DPO/RLHF fine-tuning часто комбинируется с LoRA для эффективности
- Production Deploy -- multi-LoRA serving, adapter merging для zero-overhead inference