Перейти к содержанию

LoRA и варианты файнтюнинга

~9 минут чтения

Предварительно: Техники файнтюнинга LLM, Квантизация LLM

При d=k=4096 и rank r=8 LoRA обучает 65.5K параметров вместо 16.7M -- сокращение в 256 раз. Для 70B модели это снижает потребление VRAM с 1.1 TB (full fine-tuning) до 280 GB (LoRA) или 48 GB (QLoRA на single A100). В 2026 году 90%+ PEFT-проектов используют LoRA или его варианты, при этом rsLoRA и PiSSA дают +3-5% качества без дополнительных затрат на ресурсы.

LoRA formula (W = W0 + BA), QLoRA (4-bit NF4), DoRA (magnitude-direction decomposition), rsLoRA (rank-stabilized), PiSSA (SVD init), AdaLoRA, PyTorch implementation from scratch, PEFT/BitsAndBytes stack, FinLoRA benchmark, production considerations (2025-2026)


Ключевые концепции

LoRA (Low-Rank Adaptation) -- замораживаем pretrained weights, добавляем trainable low-rank матрицы:

\[W' = W_0 + \Delta W = W_0 + \frac{\alpha}{r} BA\]
  • \(W_0 \in \mathbb{R}^{d \times k}\) -- frozen pretrained weight
  • \(B \in \mathbb{R}^{d \times r}\), \(A \in \mathbb{R}^{r \times k}\) -- trainable matrices
  • \(r \ll \min(d, k)\) -- rank (typically 8-64)

Parameter Reduction

\[\text{Reduction} = \frac{d \times k}{2 \times d \times r} = \frac{k}{2r}\]
d, k r Full Params LoRA Params Reduction
4096 4 16.7M 32.8K 512x
4096 8 16.7M 65.5K 256x
4096 16 16.7M 131K 128x
4096 64 16.7M 524K 32x

Variant Ranking (2026)

Method GPU Memory Performance Stability Best For
LoRA High Good Excellent Stable training
QLoRA Very Low Slightly below Good Limited VRAM
DoRA High Better Excellent Higher capacity
rsLoRA Same Better Excellent Default choice
PiSSA Same Best Good Max performance

1. PyTorch Implementation

LoRA Layer

import torch
import torch.nn as nn
import math

class LoRALayer(nn.Module):
    def __init__(self, in_features, out_features, rank=8, alpha=16.0, dropout=0.0):
        super().__init__()
        self.scaling = alpha / rank
        self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        self.dropout = nn.Dropout(dropout) if dropout > 0 else nn.Identity()
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B)

    def forward(self, x):
        x = self.dropout(x)
        return (x @ self.lora_A.T) @ self.lora_B.T * self.scaling

LoRA Linear Wrapper

class LoRALinear(nn.Module):
    def __init__(self, in_features, out_features, rank=8, alpha=16.0, dropout=0.0, bias=True):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features, bias=bias)
        self.lora = LoRALayer(in_features, out_features, rank, alpha, dropout)

    def forward(self, x):
        return self.linear(x) + self.lora(x)

    def merge_weights(self):
        with torch.no_grad():
            self.linear.weight.data += (
                self.lora.lora_B @ self.lora.lora_A
            ) * self.lora.scaling
            self.lora.lora_A.zero_()
            self.lora.lora_B.zero_()

Model Wrapper

class LoRAModel(nn.Module):
    def __init__(self, model, rank=8, alpha=16.0, dropout=0.0,
                 target_modules=None):
        super().__init__()
        self.model = model
        self.target_modules = target_modules or ["q_proj", "v_proj", "k_proj", "o_proj"]
        self._replace_with_lora(rank, alpha, dropout)
        self._freeze_base_model()

    def _replace_with_lora(self, rank, alpha, dropout):
        for name, module in self.model.named_modules():
            if any(t in name for t in self.target_modules):
                if isinstance(module, nn.Linear):
                    lora_linear = LoRALinear(
                        module.in_features, module.out_features,
                        rank=rank, alpha=alpha, dropout=dropout,
                        bias=module.bias is not None,
                    )
                    lora_linear.linear.weight.data = module.weight.data.clone()
                    if module.bias is not None:
                        lora_linear.linear.bias.data = module.bias.data.clone()
                    parent = self.model.get_submodule('.'.join(name.split('.')[:-1]))
                    setattr(parent, name.split('.')[-1], lora_linear)

    def _freeze_base_model(self):
        for name, param in self.model.named_parameters():
            if 'lora_' not in name:
                param.requires_grad = False

    def forward(self, *args, **kwargs):
        return self.model(*args, **kwargs)

    def merge_and_save(self, path):
        for module in self.model.modules():
            if isinstance(module, LoRALinear):
                module.merge_weights()
        torch.save(self.model.state_dict(), path)

Gradient Flow

\[\frac{\partial \mathcal{L}}{\partial B} = \frac{\partial \mathcal{L}}{\partial h} \cdot (Ax)^T, \quad \frac{\partial \mathcal{L}}{\partial A} = B^T \cdot \frac{\partial \mathcal{L}}{\partial h} \cdot x^T\]

2. QLoRA (Quantized LoRA)

Base model в 4-bit, LoRA adapters в full precision.

Innovation Description
4-bit NF4 NormalFloat for normally distributed weights
Double quantization Quantizes quantization constants
Paged optimizers CPU offload for memory spikes

Memory Comparison

Model Full FT LoRA QLoRA
7B 112 GB 28 GB 6 GB
13B 208 GB 52 GB 10 GB
70B 1.1 TB 280 GB 48 GB

PEFT/BitsAndBytes Stack

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

lora_config = LoraConfig(
    r=8, lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05, bias="none", task_type="CAUSAL_LM",
)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
model = get_peft_model(model, lora_config)

3. DoRA (Weight-Decomposed LoRA)

Decompose: \(W = m \cdot \frac{V}{\|V\|}\), apply LoRA to direction V only.

class DoRALinear(nn.Module):
    def __init__(self, in_features, out_features, rank=8, alpha=16.0):
        super().__init__()
        self.weight = nn.Parameter(torch.zeros(out_features, in_features))
        self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        self.magnitude = nn.Parameter(torch.ones(out_features))
        self.scaling = alpha / rank
        nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B)

    def forward(self, x):
        V = self.weight / (self.weight.norm(dim=1, keepdim=True) + 1e-6)
        delta_V = (self.lora_B @ self.lora_A) * self.scaling
        V_new = (V + delta_V)
        V_new = V_new / (V_new.norm(dim=1, keepdim=True) + 1e-6)
        W_new = self.magnitude.unsqueeze(1) * V_new
        return F.linear(x, W_new)

4. rsLoRA (Rank-Stabilized)

Standard LoRA: \(\Delta W = \frac{\alpha}{r} BA\). rsLoRA: \(\Delta W = \frac{\alpha}{\sqrt{r}} BA\)

Standard LoRA degrades as rank increases. rsLoRA -- stable or improves with higher rank.

"Start with rsLoRA: strictly better than standard LoRA with no downsides."


5. PiSSA (Principal Singular Values Adaptation)

SVD-based initialization: \(W = U \Sigma V^T \approx U_r \Sigma_r V_r^T\)

Initialize A, B from principal singular values. Faster convergence, best performance among variants.

Model Task PiSSA LoRA Improvement
Gemma-7B GSM8K 77.7% 74.53% +3.25%
Mistral-7B GSM8K 72.86% 67.7% +5.16%

6. Other Variants

Variant Key Innovation
AdaLoRA Learns which layers need higher rank, prunes during training
NoRA (ICCV 2025) Nested structure for better initialization

7. Comparison

Performance (Mistral-7B GSM8K)

Method Score Memory (7B) Speed
Full FT 73.0% 112 GB Slow
PiSSA 72.86% 28 GB Fast
DoRA ~70% 28 GB Fast
rsLoRA ~69% 28 GB Fast
LoRA 67.7% 28 GB Fast
QLoRA ~66% 6 GB Medium

Quality Benchmarks

Method MMLU GSM8K Quality vs Full
Full fine-tuning 65.2 52.1 100%
LoRA (r=16) 64.8 51.8 99%
LoRA (r=8) 64.1 50.9 97%
QLoRA 64.5 51.2 98%

8. Selection Guide

Limited VRAM (<16GB)?     -> QLoRA
Maximum performance?      -> PiSSA
Drop-in LoRA upgrade?     -> rsLoRA
Higher learning capacity? -> DoRA
Stable, battle-tested?    -> Standard LoRA
Use Case Recommended Reason
Consumer GPU (RTX 4090) QLoRA Fits in 24 GB
Single A100 (40 GB) QLoRA or LoRA Both work
Multi-GPU (2x A100) rsLoRA or DoRA No memory constraint
Production deployment rsLoRA Best tradeoff
Research/benchmarking PiSSA Maximum performance
Continual learning DoRA Better stability

9. Best Practices

Hyperparameters

Parameter Recommended Notes
Rank ® 8-64 Higher = more capacity
Alpha 16-32 (alpha = 2r typical) Scaling = alpha/r
Target modules q_proj, v_proj minimum All linear = best quality
Learning rate 1e-4 to 5e-4 Higher than full FT
Dropout 0.05-0.1 Optional regularization
Warmup ratio 0.03-0.05 Cosine annealing

Target Modules (LLaMA/Mistral)

# Attention only (minimal)
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]

# All linear layers (best quality)
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj"]

Common Pitfalls

Pitfall Solution
Too high rank Start r=16, increase if needed
Wrong alpha Keep alpha = 2r
Learning too fast Reduce LR, increase warmup
Catastrophic forgetting Lower LR, LoRA on all layers

LoRA НЕ предотвращает catastrophic forgetting

LoRA снижает риск за счёт малого числа параметров, но НЕ устраняет его. При высоком LR (>5e-4) или длительном обучении модель всё равно теряет general capabilities. Защита: низкий LR (1e-4), LoRA на ВСЕ linear layers (не только q/v), регулярная проверка на held-out general benchmarks.

alpha и r связаны -- scaling = alpha/r

alpha=16, r=8 даёт тот же scaling (2.0), что и alpha=32, r=16. Многие тюнят оба -- это бессмысленно. Фиксируй alpha = 2 * r и тюнь ТОЛЬКО rank. rsLoRA решает эту проблему: scaling = alpha/sqrt®, стабильно при любом rank.

Frameworks (2025-2026)

Framework Pros Cons
HuggingFace PEFT Standard, well-documented Verbose API
Unsloth 2x faster, memory efficient Newer
Axolotl Config-driven YAML Steep learning curve

Production Considerations

  • Adapter merging: merge LoRA into base weights for zero inference overhead
  • Multi-LoRA serving: multiple adapters for different tasks on same base model
  • A/B testing: compare LoRA configs before deployment

Для интервью

Q: "Объясните LoRA и зачем он нужен."

LoRA замораживает pretrained weights W0 и добавляет trainable low-rank матрицы: W' = W0 + (alpha/r) * BA, где B in R^(d x r), A in R^(r x k), r << min(d,k). При d=k=4096, r=8: trainable params = 65.5K vs 16.7M (256x reduction). Memory: 10-15% от full fine-tuning. Quality: 97-99% от full FT (MMLU, GSM8K). Inference: zero overhead после merge. Инициализация: A -- Kaiming uniform, B -- zeros (output starts at zero).

Q: "Сравните LoRA, QLoRA, DoRA, rsLoRA."

QLoRA: base model в 4-bit NF4 + LoRA adapters в full precision. 70B модель на 48 GB (vs 1.1 TB full FT). Innovations: NF4, double quantization, paged optimizers. Trade-off: slightly slower training, marginal quality loss. DoRA: decompose W = m * V/||V|| (magnitude x direction), apply LoRA to V only. +2-4% vs standard LoRA. rsLoRA: scaling alpha/sqrt® вместо alpha/r -- stable across rank values, strictly better. PiSSA: SVD-based init, best performance (+3-5% vs LoRA), slightly longer init.

Q: "Напишите LoRA forward pass."

h = W0 * x + (alpha/r) * B * A * x. A in R^(r x k) initialized Kaiming, B in R^(d x r) initialized zeros. Gradients: dL/dB = dL/dh * (Ax)^T, dL/dA = B^T * dL/dh * x^T. Merge для inference: W_merged = W0 + B * A * (alpha/r), zero overhead.


Ключевые числа

Факт Значение
Parameter reduction (r=8, d=4096) 256x
Parameter reduction (r=4) 512x
Memory reduction LoRA vs full FT 75-90%
Memory reduction QLoRA vs full FT 95%+
QLoRA 70B model memory 48 GB
LoRA quality vs full FT 97-99%
PiSSA vs LoRA (GSM8K) +3-5%
DoRA vs LoRA +2-4%
FinLoRA avg gain over base 36%
Training speed LoRA vs full FT 2-3x faster
LoRA adoption in PEFT 90%+

Формулы

LoRA Forward

\[h = W_0 x + \frac{\alpha}{r} B A x\]

rsLoRA Forward

\[h = W_0 x + \frac{\alpha}{\sqrt{r}} B A x\]

DoRA Forward

\[W' = m' \cdot \frac{V + \Delta V}{\|V + \Delta V\|}, \quad \Delta V = BA\]

Parameter Reduction

\[\text{Reduction} = \frac{d \times k}{2 \times d \times r} = \frac{k}{2r}\]

PiSSA Init

\[W = U \Sigma V^T \approx U_r \Sigma_r V_r^T\]

Заблуждение: LoRA всегда лучше full fine-tuning

Нет. На задачах с domain shift >30% (медицина, юриспруденция, специализированные языки) full fine-tuning по-прежнему выигрывает 2-5% на специализированных бенчмарках. LoRA оптимален для adaptation (стиль, формат, tone), но не для глубокой смены домена.

Заблуждение: высокий rank всегда лучше

При стандартном LoRA увеличение rank выше 32-64 часто ухудшает результат из-за нестабильного scaling (alpha/r уменьшается). rsLoRA (alpha/sqrt®) решает эту проблему -- при нём higher rank действительно помогает. Если используешь стандартный LoRA, начни с r=16 и увеличивай только с проверкой на валидационном наборе.

Заблуждение: target_modules только q_proj и v_proj достаточно

Ранние работы рекомендовали адаптировать только attention projections. Исследования 2025-2026 показывают: LoRA на ВСЕХ linear layers (q/k/v/o_proj + gate/up/down_proj) даёт +2-4% качества при увеличении trainable params лишь в 2x. Для LLaMA/Mistral всегда используйте все 7 linear layers.

Interview Questions

Q: Когда LoRA лучше full fine-tuning?

❌ Red flag: "LoRA всегда лучше, потому что дешевле"

✅ Strong answer: "LoRA оптимален при adaptation задачах (chatbot style, формат, tone). При deep domain shift full fine-tuning даёт 2-5% выигрыш. Ключевой фактор -- rank selection: r=8 для простых задач, r=64-128 для сложных. QLoRA добавляет 4-bit quantization с минимальной потерей. На MMLU LoRA r=16 даёт 99% качества full FT (64.8 vs 65.2)."

Q: Напишите forward pass LoRA и объясните инициализацию.

❌ Red flag: "A и B инициализируются случайно" (без деталей)

✅ Strong answer: "h = W0*x + (alpha/r)B*A*x. A in R^(r x k) инициализируется Kaiming uniform, B in R^(d x r) -- zeros. Это гарантирует, что delta W = 0 в начале обучения (стабильный старт). Merge для inference: W_merged = W0 + B*A(alpha/r), zero overhead. Gradients: dL/dB = dL/dh * (Ax)^T, dL/dA = B^T * dL/dh * x^T."

Q: Сравните QLoRA, DoRA, rsLoRA, PiSSA -- когда что выбрать?

❌ Red flag: Описание только QLoRA без знания более новых вариантов

✅ Strong answer: "QLoRA: base model в 4-bit NF4, adapters в full precision -- для ограниченной VRAM (70B на 48 GB). DoRA: decompose W = m * V/||V||, LoRA только на direction V -- +2-4% качества, лучшая стабильность при continual learning. rsLoRA: scaling alpha/sqrt® вместо alpha/r -- strictly better, стабилен при любом rank. PiSSA: SVD-init из principal singular values -- +3-5% vs LoRA (77.7% vs 74.5% на GSM8K Gemma-7B), но дольше инициализация. Default choice 2026: rsLoRA."


Источники

  1. Hu et al. -- "LoRA: Low-Rank Adaptation of Large Language Models" (arXiv, 2021)
  2. Dettmers et al. -- "QLoRA: Efficient Finetuning of Quantized LLMs" (arXiv)
  3. NVIDIA -- "Introducing DoRA: A High-Performing Alternative to LoRA"
  4. FinLoRA Benchmark (arXiv 2505.19819, May 2025)
  5. Sebastian Raschka -- "LoRA and DoRA from Scratch"
  6. PiSSA GitHub & arXiv (Apr 2024)
  7. ICCV 2025 -- "NoRA: Nested Low-Rank Adaptation"
  8. Unsloth Documentation (Jan 2026)
  9. HuggingFace PEFT Documentation
  10. Lightning AI -- "Parameter-Efficient LLM Finetuning With LoRA"

See Also

  • Fine-Tuning Techniques -- full fine-tuning vs PEFT, data preparation, evaluation
  • Quantization -- QLoRA опирается на 4-bit NF4 квантизацию базовой модели
  • Distributed Training -- FSDP2 нативно поддерживает LoRA adapters через DTensor
  • Alignment Methods -- DPO/RLHF fine-tuning часто комбинируется с LoRA для эффективности
  • Production Deploy -- multi-LoRA serving, adapter merging для zero-overhead inference