LoRA и варианты файнтюнинга¶

~9 минут чтения

Предварительно: Техники файнтюнинга LLM, Квантизация LLM

При d=k=4096 и rank r=8 LoRA обучает 65.5K параметров вместо 16.7M -- сокращение в 256 раз. Для 70B модели это снижает потребление VRAM с 1.1 TB (full fine-tuning) до 280 GB (LoRA) или 48 GB (QLoRA на single A100). В 2026 году 90%+ PEFT-проектов используют LoRA или его варианты, при этом rsLoRA и PiSSA дают +3-5% качества без дополнительных затрат на ресурсы.

LoRA formula (W = W0 + BA), QLoRA (4-bit NF4), DoRA (magnitude-direction decomposition), rsLoRA (rank-stabilized), PiSSA (SVD init), AdaLoRA, PyTorch implementation from scratch, PEFT/BitsAndBytes stack, FinLoRA benchmark, production considerations (2025-2026)

Ключевые концепции¶

LoRA (Low-Rank Adaptation) -- замораживаем pretrained weights, добавляем trainable low-rank матрицы:

\[W' = W_0 + \Delta W = W_0 + \frac{\alpha}{r} BA\]

\(W_0 \in \mathbb{R}^{d \times k}\) -- frozen pretrained weight
\(B \in \mathbb{R}^{d \times r}\), \(A \in \mathbb{R}^{r \times k}\) -- trainable matrices
\(r \ll \min(d, k)\) -- rank (typically 8-64)

Parameter Reduction¶

\[\text{Reduction} = \frac{d \times k}{2 \times d \times r} = \frac{k}{2r}\]

d, k	r	Full Params	LoRA Params	Reduction
4096	4	16.7M	32.8K	512x
4096	8	16.7M	65.5K	256x
4096	16	16.7M	131K	128x
4096	64	16.7M	524K	32x

Variant Ranking (2026)¶

Method	GPU Memory	Performance	Stability	Best For
LoRA	High	Good	Excellent	Stable training
QLoRA	Very Low	Slightly below	Good	Limited VRAM
DoRA	High	Better	Excellent	Higher capacity
rsLoRA	Same	Better	Excellent	Default choice
PiSSA	Same	Best	Good	Max performance

1. PyTorch Implementation¶

LoRA Layer¶

import torch
import torch.nn as nn
import math

class LoRALayer(nn.Module):
    def __init__(self, in_features, out_features, rank=8, alpha=16.0, dropout=0.0):
        super().__init__()
        self.scaling = alpha / rank
        self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        self.dropout = nn.Dropout(dropout) if dropout > 0 else nn.Identity()
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B)

    def forward(self, x):
        x = self.dropout(x)
        return (x @ self.lora_A.T) @ self.lora_B.T * self.scaling

LoRA Linear Wrapper¶

class LoRALinear(nn.Module):
    def __init__(self, in_features, out_features, rank=8, alpha=16.0, dropout=0.0, bias=True):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features, bias=bias)
        self.lora = LoRALayer(in_features, out_features, rank, alpha, dropout)

    def forward(self, x):
        return self.linear(x) + self.lora(x)

    def merge_weights(self):
        with torch.no_grad():
            self.linear.weight.data += (
                self.lora.lora_B @ self.lora.lora_A
            ) * self.lora.scaling
            self.lora.lora_A.zero_()
            self.lora.lora_B.zero_()

Model Wrapper¶

class LoRAModel(nn.Module):
    def __init__(self, model, rank=8, alpha=16.0, dropout=0.0,
                 target_modules=None):
        super().__init__()
        self.model = model
        self.target_modules = target_modules or ["q_proj", "v_proj", "k_proj", "o_proj"]
        self._replace_with_lora(rank, alpha, dropout)
        self._freeze_base_model()

    def _replace_with_lora(self, rank, alpha, dropout):
        for name, module in self.model.named_modules():
            if any(t in name for t in self.target_modules):
                if isinstance(module, nn.Linear):
                    lora_linear = LoRALinear(
                        module.in_features, module.out_features,
                        rank=rank, alpha=alpha, dropout=dropout,
                        bias=module.bias is not None,
                    )
                    lora_linear.linear.weight.data = module.weight.data.clone()
                    if module.bias is not None:
                        lora_linear.linear.bias.data = module.bias.data.clone()
                    parent = self.model.get_submodule('.'.join(name.split('.')[:-1]))
                    setattr(parent, name.split('.')[-1], lora_linear)

    def _freeze_base_model(self):
        for name, param in self.model.named_parameters():
            if 'lora_' not in name:
                param.requires_grad = False

    def forward(self, *args, **kwargs):
        return self.model(*args, **kwargs)

    def merge_and_save(self, path):
        for module in self.model.modules():
            if isinstance(module, LoRALinear):
                module.merge_weights()
        torch.save(self.model.state_dict(), path)

Gradient Flow¶

\[\frac{\partial \mathcal{L}}{\partial B} = \frac{\partial \mathcal{L}}{\partial h} \cdot (Ax)^T, \quad \frac{\partial \mathcal{L}}{\partial A} = B^T \cdot \frac{\partial \mathcal{L}}{\partial h} \cdot x^T\]

2. QLoRA (Quantized LoRA)¶

Base model в 4-bit, LoRA adapters в full precision.

Innovation	Description
4-bit NF4	NormalFloat for normally distributed weights
Double quantization	Quantizes quantization constants
Paged optimizers	CPU offload for memory spikes

Memory Comparison¶

Model	Full FT	LoRA	QLoRA
7B	112 GB	28 GB	6 GB
13B	208 GB	52 GB	10 GB
70B	1.1 TB	280 GB	48 GB

PEFT/BitsAndBytes Stack¶

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

lora_config = LoraConfig(
    r=8, lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05, bias="none", task_type="CAUSAL_LM",
)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
model = get_peft_model(model, lora_config)

3. DoRA (Weight-Decomposed LoRA)¶

Decompose: \(W = m \cdot \frac{V}{\|V\|}\), apply LoRA to direction V only.

class DoRALinear(nn.Module):
    def __init__(self, in_features, out_features, rank=8, alpha=16.0):
        super().__init__()
        self.weight = nn.Parameter(torch.zeros(out_features, in_features))
        self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        self.magnitude = nn.Parameter(torch.ones(out_features))
        self.scaling = alpha / rank
        nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B)

    def forward(self, x):
        V = self.weight / (self.weight.norm(dim=1, keepdim=True) + 1e-6)
        delta_V = (self.lora_B @ self.lora_A) * self.scaling
        V_new = (V + delta_V)
        V_new = V_new / (V_new.norm(dim=1, keepdim=True) + 1e-6)
        W_new = self.magnitude.unsqueeze(1) * V_new
        return F.linear(x, W_new)

4. rsLoRA (Rank-Stabilized)¶

Standard LoRA: \(\Delta W = \frac{\alpha}{r} BA\). rsLoRA: \(\Delta W = \frac{\alpha}{\sqrt{r}} BA\)

Standard LoRA degrades as rank increases. rsLoRA -- stable or improves with higher rank.

"Start with rsLoRA: strictly better than standard LoRA with no downsides."

5. PiSSA (Principal Singular Values Adaptation)¶

SVD-based initialization: \(W = U \Sigma V^T \approx U_r \Sigma_r V_r^T\)

Initialize A, B from principal singular values. Faster convergence, best performance among variants.

Model	Task	PiSSA	LoRA	Improvement
Gemma-7B	GSM8K	77.7%	74.53%	+3.25%
Mistral-7B	GSM8K	72.86%	67.7%	+5.16%

6. Other Variants¶

Variant	Key Innovation
AdaLoRA	Learns which layers need higher rank, prunes during training
NoRA (ICCV 2025)	Nested structure for better initialization

7. Comparison¶

Performance (Mistral-7B GSM8K)¶

Method	Score	Memory (7B)	Speed
Full FT	73.0%	112 GB	Slow
PiSSA	72.86%	28 GB	Fast
DoRA	~70%	28 GB	Fast
rsLoRA	~69%	28 GB	Fast
LoRA	67.7%	28 GB	Fast
QLoRA	~66%	6 GB	Medium

Quality Benchmarks¶

Method	MMLU	GSM8K	Quality vs Full
Full fine-tuning	65.2	52.1	100%
LoRA (r=16)	64.8	51.8	99%
LoRA (r=8)	64.1	50.9	97%
QLoRA	64.5	51.2	98%

8. Selection Guide¶

Limited VRAM (<16GB)?     -> QLoRA
Maximum performance?      -> PiSSA
Drop-in LoRA upgrade?     -> rsLoRA
Higher learning capacity? -> DoRA
Stable, battle-tested?    -> Standard LoRA

Use Case	Recommended	Reason
Consumer GPU (RTX 4090)	QLoRA	Fits in 24 GB
Single A100 (40 GB)	QLoRA or LoRA	Both work
Multi-GPU (2x A100)	rsLoRA or DoRA	No memory constraint
Production deployment	rsLoRA	Best tradeoff
Research/benchmarking	PiSSA	Maximum performance
Continual learning	DoRA	Better stability

9. Best Practices¶

Hyperparameters¶

Parameter	Recommended	Notes
Rank ®	8-64	Higher = more capacity
Alpha	16-32 (alpha = 2r typical)	Scaling = alpha/r
Target modules	q_proj, v_proj minimum	All linear = best quality
Learning rate	1e-4 to 5e-4	Higher than full FT
Dropout	0.05-0.1	Optional regularization
Warmup ratio	0.03-0.05	Cosine annealing

Target Modules (LLaMA/Mistral)¶

# Attention only (minimal)
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]

# All linear layers (best quality)
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj"]

Common Pitfalls¶

Pitfall	Solution
Too high rank	Start r=16, increase if needed
Wrong alpha	Keep alpha = 2r
Learning too fast	Reduce LR, increase warmup
Catastrophic forgetting	Lower LR, LoRA on all layers

LoRA НЕ предотвращает catastrophic forgetting

LoRA снижает риск за счёт малого числа параметров, но НЕ устраняет его. При высоком LR (>5e-4) или длительном обучении модель всё равно теряет general capabilities. Защита: низкий LR (1e-4), LoRA на ВСЕ linear layers (не только q/v), регулярная проверка на held-out general benchmarks.

alpha и r связаны -- scaling = alpha/r

alpha=16, r=8 даёт тот же scaling (2.0), что и alpha=32, r=16. Многие тюнят оба -- это бессмысленно. Фиксируй alpha = 2 * r и тюнь ТОЛЬКО rank. rsLoRA решает эту проблему: scaling = alpha/sqrt®, стабильно при любом rank.

Frameworks (2025-2026)¶

Framework	Pros	Cons
HuggingFace PEFT	Standard, well-documented	Verbose API
Unsloth	2x faster, memory efficient	Newer
Axolotl	Config-driven YAML	Steep learning curve

Production Considerations¶

Adapter merging: merge LoRA into base weights for zero inference overhead
Multi-LoRA serving: multiple adapters for different tasks on same base model
A/B testing: compare LoRA configs before deployment

Для интервью¶

Q: "Объясните LoRA и зачем он нужен."¶

LoRA замораживает pretrained weights W0 и добавляет trainable low-rank матрицы: W' = W0 + (alpha/r) * BA, где B in R^(d x r), A in R^(r x k), r << min(d,k). При d=k=4096, r=8: trainable params = 65.5K vs 16.7M (256x reduction). Memory: 10-15% от full fine-tuning. Quality: 97-99% от full FT (MMLU, GSM8K). Inference: zero overhead после merge. Инициализация: A -- Kaiming uniform, B -- zeros (output starts at zero).

Q: "Сравните LoRA, QLoRA, DoRA, rsLoRA."¶

QLoRA: base model в 4-bit NF4 + LoRA adapters в full precision. 70B модель на 48 GB (vs 1.1 TB full FT). Innovations: NF4, double quantization, paged optimizers. Trade-off: slightly slower training, marginal quality loss. DoRA: decompose W = m * V/||V|| (magnitude x direction), apply LoRA to V only. +2-4% vs standard LoRA. rsLoRA: scaling alpha/sqrt® вместо alpha/r -- stable across rank values, strictly better. PiSSA: SVD-based init, best performance (+3-5% vs LoRA), slightly longer init.

Q: "Напишите LoRA forward pass."¶

h = W0 * x + (alpha/r) * B * A * x. A in R^(r x k) initialized Kaiming, B in R^(d x r) initialized zeros. Gradients: dL/dB = dL/dh * (Ax)^T, dL/dA = B^T * dL/dh * x^T. Merge для inference: W_merged = W0 + B * A * (alpha/r), zero overhead.

Ключевые числа¶

Факт	Значение
Parameter reduction (r=8, d=4096)	256x
Parameter reduction (r=4)	512x
Memory reduction LoRA vs full FT	75-90%
Memory reduction QLoRA vs full FT	95%+
QLoRA 70B model memory	48 GB
LoRA quality vs full FT	97-99%
PiSSA vs LoRA (GSM8K)	+3-5%
DoRA vs LoRA	+2-4%
FinLoRA avg gain over base	36%
Training speed LoRA vs full FT	2-3x faster
LoRA adoption in PEFT	90%+

Формулы¶

LoRA Forward¶

\[h = W_0 x + \frac{\alpha}{r} B A x\]

rsLoRA Forward¶

\[h = W_0 x + \frac{\alpha}{\sqrt{r}} B A x\]

DoRA Forward¶

\[W' = m' \cdot \frac{V + \Delta V}{\|V + \Delta V\|}, \quad \Delta V = BA\]

Parameter Reduction¶

\[\text{Reduction} = \frac{d \times k}{2 \times d \times r} = \frac{k}{2r}\]

PiSSA Init¶

\[W = U \Sigma V^T \approx U_r \Sigma_r V_r^T\]

Заблуждение: LoRA всегда лучше full fine-tuning

Нет. На задачах с domain shift >30% (медицина, юриспруденция, специализированные языки) full fine-tuning по-прежнему выигрывает 2-5% на специализированных бенчмарках. LoRA оптимален для adaptation (стиль, формат, tone), но не для глубокой смены домена.

Заблуждение: высокий rank всегда лучше

При стандартном LoRA увеличение rank выше 32-64 часто ухудшает результат из-за нестабильного scaling (alpha/r уменьшается). rsLoRA (alpha/sqrt®) решает эту проблему -- при нём higher rank действительно помогает. Если используешь стандартный LoRA, начни с r=16 и увеличивай только с проверкой на валидационном наборе.

Заблуждение: target_modules только q_proj и v_proj достаточно

Ранние работы рекомендовали адаптировать только attention projections. Исследования 2025-2026 показывают: LoRA на ВСЕХ linear layers (q/k/v/o_proj + gate/up/down_proj) даёт +2-4% качества при увеличении trainable params лишь в 2x. Для LLaMA/Mistral всегда используйте все 7 linear layers.

Interview Questions¶

Q: Когда LoRA лучше full fine-tuning?

Red flag: "LoRA всегда лучше, потому что дешевле"

Strong answer: "LoRA оптимален при adaptation задачах (chatbot style, формат, tone). При deep domain shift full fine-tuning даёт 2-5% выигрыш. Ключевой фактор -- rank selection: r=8 для простых задач, r=64-128 для сложных. QLoRA добавляет 4-bit quantization с минимальной потерей. На MMLU LoRA r=16 даёт 99% качества full FT (64.8 vs 65.2)."

Q: Напишите forward pass LoRA и объясните инициализацию.

Red flag: "A и B инициализируются случайно" (без деталей)

Strong answer: "h = W0*x + (alpha/r)B*A*x. A in R^(r x k) инициализируется Kaiming uniform, B in R^(d x r) -- zeros. Это гарантирует, что delta W = 0 в начале обучения (стабильный старт). Merge для inference: W_merged = W0 + B*A(alpha/r), zero overhead. Gradients: dL/dB = dL/dh * (Ax)^T, dL/dA = B^T * dL/dh * x^T."

Q: Сравните QLoRA, DoRA, rsLoRA, PiSSA -- когда что выбрать?

Red flag: Описание только QLoRA без знания более новых вариантов

Strong answer: "QLoRA: base model в 4-bit NF4, adapters в full precision -- для ограниченной VRAM (70B на 48 GB). DoRA: decompose W = m * V/||V||, LoRA только на direction V -- +2-4% качества, лучшая стабильность при continual learning. rsLoRA: scaling alpha/sqrt® вместо alpha/r -- strictly better, стабилен при любом rank. PiSSA: SVD-init из principal singular values -- +3-5% vs LoRA (77.7% vs 74.5% на GSM8K Gemma-7B), но дольше инициализация. Default choice 2026: rsLoRA."

Источники¶

Hu et al. -- "LoRA: Low-Rank Adaptation of Large Language Models" (arXiv, 2021)
Dettmers et al. -- "QLoRA: Efficient Finetuning of Quantized LLMs" (arXiv)
NVIDIA -- "Introducing DoRA: A High-Performing Alternative to LoRA"
FinLoRA Benchmark (arXiv 2505.19819, May 2025)
Sebastian Raschka -- "LoRA and DoRA from Scratch"
PiSSA GitHub & arXiv (Apr 2024)
ICCV 2025 -- "NoRA: Nested Low-Rank Adaptation"
Unsloth Documentation (Jan 2026)
HuggingFace PEFT Documentation
Lightning AI -- "Parameter-Efficient LLM Finetuning With LoRA"

LoRA и варианты файнтюнинга¶

Ключевые концепции¶

Parameter Reduction¶

Variant Ranking (2026)¶

1. PyTorch Implementation¶

LoRA Layer¶

LoRA Linear Wrapper¶

Model Wrapper¶

Gradient Flow¶

2. QLoRA (Quantized LoRA)¶

Memory Comparison¶

PEFT/BitsAndBytes Stack¶

3. DoRA (Weight-Decomposed LoRA)¶

4. rsLoRA (Rank-Stabilized)¶

5. PiSSA (Principal Singular Values Adaptation)¶

6. Other Variants¶

7. Comparison¶

Performance (Mistral-7B GSM8K)¶

Quality Benchmarks¶

8. Selection Guide¶

9. Best Practices¶

Hyperparameters¶

Target Modules (LLaMA/Mistral)¶

Common Pitfalls¶

Frameworks (2025-2026)¶

Production Considerations¶

Для интервью¶

Q: "Объясните LoRA и зачем он нужен."¶

Q: "Сравните LoRA, QLoRA, DoRA, rsLoRA."¶

Q: "Напишите LoRA forward pass."¶

Ключевые числа¶

Формулы¶

LoRA Forward¶

rsLoRA Forward¶

DoRA Forward¶

Parameter Reduction¶

PiSSA Init¶

Interview Questions¶

Источники¶

See Also¶