Mixture of Experts (MoE): архитектура и балансировка¶

~10 минут чтения

Предварительно: Эффективные трансформеры | Распределенное обучение FSDP/DeepSpeed

MoE -- архитектурный прорыв, позволяющий масштабировать модели до сотен миллиардов параметров при фиксированном inference-бюджете. DeepSeek V3 имеет 671B total параметров, но активирует только 37B (5.5%) на каждый токен -- это дает 90-95% compute reduction и 18K+ tokens/sec на vLLM vs 8K для dense Llama-70B. Ключевая проблема -- load balancing: без балансировки 60-80% capacity теряется из-за expert collapse (rich-get-richer). Loss-free routing (DeepSeek V3) решил это без gradient interference.

Ключевые концепции¶

MoE -- sparse архитектура: каждый токен обрабатывается подмножеством экспертов (top-K), а не всей сетью.

Как работает MoE¶

Компонент	Описание
Experts	Специализированные FFN модули
Router/Gating	Решает какие эксперты обрабатывают каждый токен
Active parameters	Подмножество используемое per token (например 37B из 671B)
Total parameters	Полный размер модели

Ключевые дизайн-решения¶

Выбор	Trade-off
Число экспертов	Больше = точнее специализация, сложнее балансировка
Экспертов на токен	Больше = лучше качество, меньше выигрыш в compute
Размер эксперта	Больше = больше capacity, больше compute
Routing strategy	Top-K, expert choice, loss-free

Эффективность MoE¶

Метрика	Значение
Compute reduction per inference	90-95% vs dense
Training efficiency	2-7x быстрее dense
Memory growth	Sub-linear от parameter count
Power reduction	До 50%

1. MoE Forward Pass¶

\[y = \sum_{i=1}^{N} g_i(x) \cdot E_i(x)\]

\(g_i(x)\) -- gating weight для expert \(i\) (из softmax router)
\(E_i(x)\) -- output expert \(i\)
Top-K: обычно K=2 (Mixtral) или K=8+1 (DeepSeek V3)

2. Ведущие MoE модели 2025-2026¶

Модель	Total Params	Active Params	Experts	Архитектура
DeepSeek V3	671B	37B	256	9 experts/token (8+1 shared), MLA
DeepSeek V3.2	685B	37B	256	Enhanced load balancing
Llama 4 Maverick	402B	17B	128	2 experts/token, 8192 hidden
Qwen3-235B	235B	22B	154	22B active
Qwen3-480B Coder	480B	~40B	TBD	Code-specialized
GPT-OSS-120B	120B	~20B	TBD	Open-source MoE
Kimi K2	~1T	32B	TBD	Long context optimized
Mixtral 8x7B	47B	13B	8	2 experts/token, auxiliary loss

Parameter Efficiency¶

Модель	Active/Total	Efficiency Gain
DeepSeek V3	5.5% (37B/671B)	~18x
Llama 4 Maverick	4.2% (17B/402B)	~24x
Qwen3-235B	9.4% (22B/235B)	~11x

DeepSeek V3 vs Llama 4¶

DeepSeek V3: 9 active (8 routed + 1 shared), fine-grained (256 smaller experts), auxiliary-loss-free routing, MLA attention. Llama 4: 2 active, fewer но крупнее experts (8192 hidden), проще routing. Оба match GPT-4o на coding/reasoning.

3. MoE vs Dense¶

Inference Efficiency¶

Метрика	MoE (DeepSeek V3)	Dense (Llama-70B)
Active parameters	37B	70B
Compute reduction	~90%	Baseline
Tokens/sec (vLLM)	18,000+	~8,000
Memory per request	Sub-linear	Linear

Training Efficiency¶

Benchmark	MoE Advantage
MLPerf Training v5.0	2.1x faster (1024 H100s)
Perplexity	4.92 vs 5.15 (dense Llama-70B)

4. Load Balancing¶

Проблема¶

\[\text{Imbalance} \rightarrow \text{Rich-Get-Richer} \rightarrow \text{Expert Collapse}\]

Без балансировки: несколько экспертов обрабатывают большинство токенов, остальные недообучены, 60-80% capacity wasted.

4.1 Standard Auxiliary Loss¶

\[\mathcal{L}_{aux} = \alpha \cdot N \cdot \sum_{i=1}^{N} f_i \cdot P_i\]

\(f_i\) = fraction of tokens routed to expert \(i\)
\(P_i\) = average routing probability для expert \(i\)
\(\alpha \approx 0.01\)

Complete MoE Training Loss:

\[\mathcal{L}_{total} = \mathcal{L}_{task} + \alpha \cdot \mathcal{L}_{aux} + \beta \cdot \mathcal{L}_{z}\]

\(\mathcal{L}_z\) = Router Z-Loss (stability): \(\frac{1}{BS} \sum_{i,j} (\log \sum_k e^{z_{ijk}})^2\)
\(\beta \approx 0.001\)

def load_balancing_loss(gates, expert_indices, num_experts):
    # f_i = fraction of tokens per expert
    one_hot = F.one_hot(expert_indices, num_experts).float()
    token_counts = one_hot.sum(dim=(0, 1, 2))
    f = token_counts / token_counts.sum()

    # P_i = average routing probability per expert
    P = gates.mean(dim=(0, 1))

    aux_loss = (f * P).sum() * num_experts
    return aux_loss

Проблемы: gradient interference, hyperparameter sensitivity (\(\alpha\) критичен), trade-off balance vs performance.

4.2 Loss-Free Balancing (DeepSeek V3)¶

Вместо auxiliary loss -- динамический bias:

\[\text{router}(x) = \text{TopK}(\text{softmax}(z + b))\]

Bias Update Rule:

\[b_i \leftarrow b_i - \gamma \cdot (f_i - \frac{1}{N})\]

class LossFreeMoERouter(nn.Module):
    def __init__(self, d_model, num_experts, top_k, bias_update_rate=0.001):
        super().__init__()
        self.gate = nn.Linear(d_model, num_experts)
        self.top_k = top_k
        self.num_experts = num_experts
        self.register_buffer('expert_bias', torch.zeros(num_experts))
        self.register_buffer('load_tracker', torch.zeros(num_experts))
        self.bias_update_rate = bias_update_rate

    def forward(self, x):
        logits = self.gate(x)
        biased_logits = logits + self.expert_bias  # No gradient!
        probs = F.softmax(biased_logits, dim=-1)
        topk_probs, topk_indices = torch.topk(probs, self.top_k, dim=-1)
        topk_probs = topk_probs / topk_probs.sum(dim=-1, keepdim=True)

        with torch.no_grad():
            self._update_load_tracker(topk_indices)
        return topk_indices, topk_probs

    @torch.no_grad()
    def update_bias(self):
        target = 1.0 / self.num_experts
        self.expert_bias -= self.bias_update_rate * (self.load_tracker - target)
        self.load_tracker.zero_()

Аспект	Auxiliary Loss	Loss-Free
Gradient interference	Да	Нет
Model quality	Ограничена trade-off	Лучше
Hyperparameters	\(\alpha\) критичен	\(\gamma\) менее чувствителен
Training stability	Может флуктуировать	Стабильнее

4.3 SIMBAL (Similarity-Preserving)¶

\[\mathcal{L}_{SIMBAL} = \sum_i f_i \cdot P_i + \lambda \cdot \mathcal{L}_{sim}\]

\(\mathcal{L}_{sim}\) penalizes inconsistent routing для похожих inputs. Результат: 36% faster convergence, lower redundancy.

4.4 Capacity Factor¶

\[\text{Capacity}_i = \text{cf} \times \frac{T}{N}\]

Capacity Factor	Token Drop Rate	Compute
1.0	10-20%	Minimal
1.25	5-10%	Low
1.5	1-5%	Medium
2.0	<1%	High

Выбор стратегии¶

Сценарий	Рекомендация
Research/эксперименты	Auxiliary loss (проще)
Production training	Loss-free (DeepSeek)
Длинные training runs	Loss-free + periodic bias updates
Constrained compute	SIMBAL (быстрее сходимость)

5. Advances 2025-2026¶

MoE++ (Zero-Computation Experts)¶

3 типа zero-computation экспертов: 1. Zero expert -- discard operation 2. Copy expert -- skip connection 3. Constant expert -- замена learned vector

\[\text{Output} = \sum_{i \in \text{active}} g_i(x) \cdot E_i(x) + \sum_{j \in \text{zero}} g_j(x) \cdot Z_j(x)\]

Результат: 1.1-2.1x expert forward throughput. Simple токены используют меньше экспертов.

ExpertFlow (Predictive Routing Offloading)¶

Lightweight predictor forecasts routing paths -> dynamic token scheduling -> real-time error correction. 93.72% GPU memory savings, 2-10x inference speedup.

LPR (Latent Prototype Routing)¶

Routing через clustering perspective. Gini coefficient: 0.70 -> 0.035 (near-perfect balance). Протестировано на DeepSeek-V3, Qwen3-MoE, Mixtral.

GuiLoMo (LoRA + MoE)¶

Layer-wise expert numbers и ranks allocation через GuidedSelection Vectors (bilevel optimization). Task-specific expert configuration.

Multi-head Latent Attention (MLA)¶

DeepSeek V2/V3: compresses KV cache через latent projection. 90%+ KV cache reduction.

6. Expert Parallelism¶

Distributed Pattern¶

graph TD
    subgraph GPU0["GPU 0"]
        E0[Expert 0]
        E1[Expert 1]
    end
    subgraph GPU1["GPU 1"]
        E2[Expert 2]
        E3[Expert 3]
    end
    subgraph GPU2["GPU 2"]
        E4[Expert 4]
        E5[Expert 5]
    end
    subgraph GPU3["GPU 3"]
        E6[Expert 6]
        E7[Expert 7]
    end

    GPU0 <-->|All-to-All| GPU1
    GPU1 <-->|All-to-All| GPU2
    GPU2 <-->|All-to-All| GPU3

    style E0 fill:#e8eaf6,stroke:#3f51b5
    style E1 fill:#e8eaf6,stroke:#3f51b5
    style E2 fill:#e8f5e9,stroke:#4caf50
    style E3 fill:#e8f5e9,stroke:#4caf50
    style E4 fill:#fff3e0,stroke:#ef6c00
    style E5 fill:#fff3e0,stroke:#ef6c00
    style E6 fill:#f3e5f5,stroke:#9c27b0
    style E7 fill:#f3e5f5,stroke:#9c27b0

All-to-All communication: send tokens to expert owners -> local computation -> return outputs.

Fine-Grained vs Traditional¶

Аспект	Traditional	Fine-Grained
Expert size	Full FFN	Smaller chunks
Number of experts	8-64	64-256+
Routing flexibility	Lower	Higher
Collapse risk	Higher	Lower

DeepSeek V2/V3: 256 fine-grained experts vs traditional 8.

7. Expert Specialization¶

Паттерн	Описание
Domain experts	Специализация на code, math и т.д.
Syntax experts	Пунктуация, форматирование
Reasoning experts	Логический вывод, multi-step
Super experts	Критическое подмножество, disproportionate impact

8. Complete MoE Layer¶

class MoELayer(nn.Module):
    def __init__(self, d_model, d_ff, num_experts, top_k=2,
                 capacity_factor=1.25, aux_loss_coef=0.01, use_loss_free=False):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        self.aux_loss_coef = aux_loss_coef
        self.use_loss_free = use_loss_free
        self.gate = nn.Linear(d_model, num_experts)

        if use_loss_free:
            self.register_buffer('expert_bias', torch.zeros(num_experts))
            self.register_buffer('load_tracker', torch.zeros(num_experts))
            self.bias_update_rate = 0.001

        self.experts = nn.ModuleList([
            nn.Sequential(nn.Linear(d_model, d_ff), nn.SiLU(), nn.Linear(d_ff, d_model))
            for _ in range(num_experts)
        ])

    def forward(self, x):
        B, S, D = x.shape
        logits = self.gate(x)
        if self.use_loss_free:
            logits = logits + self.expert_bias

        probs = F.softmax(logits, dim=-1)
        topk_probs, topk_indices = torch.topk(probs, self.top_k, dim=-1)
        topk_probs = topk_probs / topk_probs.sum(dim=-1, keepdim=True)

        output = torch.zeros_like(x)
        for k in range(self.top_k):
            expert_ids = topk_indices[:, :, k]
            expert_weights = topk_probs[:, :, k:k+1]
            for e in range(self.num_experts):
                mask = (expert_ids == e)
                if mask.any():
                    expert_input = x[mask]
                    output[mask] += expert_weights[mask] * self.experts[e](expert_input)

        aux_loss = None
        if not self.use_loss_free and self.training:
            one_hot = F.one_hot(topk_indices, self.num_experts).float()
            f = one_hot.sum(dim=(0, 1, 2))
            f = f / f.sum()
            P = probs.mean(dim=(0, 1))
            aux_loss = self.aux_loss_coef * (f * P).sum() * self.num_experts

        return output, aux_loss

Для интервью¶

Q: "Explain MoE architecture and why it's better than dense."¶

MoE -- sparse архитектура: каждый токен обрабатывается top-K экспертами из N total. DeepSeek V3: 671B total, 37B active (5.5%), 256 experts, K=9. Выигрыш: 90-95% compute reduction per inference, 2-7x faster training, sub-linear memory growth. vLLM: 18K+ tokens/sec vs 8K для dense. Trade-off: сложнее load balancing, all-to-all communication для expert parallelism, risk of expert collapse.

Q: "Compare auxiliary loss vs loss-free load balancing."¶

Auxiliary loss: \(\mathcal{L}_{aux} = \alpha N \sum f_i P_i\), \(\alpha \approx 0.01\). Проблема: gradient interference, sensitivity к \(\alpha\), trade-off quality vs balance. Loss-free (DeepSeek V3): dynamic bias \(b_i\) обновляется без градиентов: \(b_i \leftarrow b_i - \gamma(f_i - 1/N)\). Преимущества: нет gradient interference, лучше quality, стабильнее. DeepSeek V3 на 14.8T tokens -- remarkably stable. SIMBAL: preserves routing similarity для похожих inputs, 36% faster convergence. Рекомендация: loss-free для production, auxiliary для research.

Q: "What is expert collapse and how to prevent it?"¶

Rich-get-richer: без балансировки несколько экспертов получают 60%+ токенов, остальные undertrained. 60-80% capacity wasted. Prevention: (1) Auxiliary loss. (2) Loss-free bias (DeepSeek V3). (3) Fine-grained segmentation (256 мелких experts vs 8 крупных -- DeepSeek V2/V3). (4) Capacity factor (1.25-2.0) с token dropping. (5) MoE++ zero-computation experts (zero/copy/constant -- 1.1-2.1x throughput). (6) LPR: Gini 0.70 -> 0.035 near-perfect balance.

Ключевые числа¶

Факт	Значение
DeepSeek V3 total/active	671B / 37B (5.5%)
Llama 4 Maverick total/active	402B / 17B (4.2%)
Qwen3-235B total/active	235B / 22B (9.4%)
MoE compute reduction	90-95% vs dense
MoE training speedup	2-7x vs dense
vLLM tokens/sec (DeepSeek V3)	18,000+
MLPerf: MoE vs dense	2.1x faster (1024 H100s)
DeepSeek V3 training tokens	14.8T
MoE++ throughput gain	1.1-2.1x
ExpertFlow memory savings	93.72%
ExpertFlow inference speedup	2-10x
LPR Gini improvement	0.70 -> 0.035
SIMBAL convergence speedup	36%
MLA KV cache reduction	90%+
Capacity waste без балансировки	60-80%
Typical \(\alpha\) (aux loss)	0.01
Typical \(\beta\) (z-loss)	0.001
Typical \(\gamma\) (bias update)	0.001
Typical capacity_factor	1.25-2.0

Формулы¶

\[y = \sum_{i=1}^{N} g_i(x) \cdot E_i(x) \quad \text{(MoE forward)}\]

\[\mathcal{L}_{aux} = \alpha \cdot N \cdot \sum_{i=1}^{N} f_i \cdot P_i \quad \text{(Auxiliary loss)}\]

\[\mathcal{L}_z = \frac{1}{BS} \sum_{i,j} (\log \sum_k e^{z_{ijk}})^2 \quad \text{(Router Z-loss)}\]

\[b_i \leftarrow b_i - \gamma \cdot (f_i - \frac{1}{N}) \quad \text{(Loss-free bias update)}\]

\[\text{Capacity}_i = \text{cf} \times \frac{T}{N} \quad \text{(Token capacity)}\]

\[\text{Memory savings} = 1 - \frac{\text{Active Experts on GPU}}{\text{Total Experts}} \quad \text{(ExpertFlow)}\]

Источники¶

FriendliAI -- "The Rise of MoE: Comparing 2025's Leading MoE AI Models"
Sebastian Raschka -- "The Big LLM Architecture Comparison" (2025)
arXiv -- MoE-CAP: Benchmarking Cost, Accuracy and Performance (2412.07067)
arXiv -- ExpertFlow: Predictive Routing Path Offloading (2410.17954)
arXiv -- LPR: Latent Prototype Routing (2506.21328)
arXiv -- MoE++: Zero-Computation Experts (2410.07348)
arXiv -- SIMBAL: Similarity Preserving Routers (2506.14038)
arXiv -- Loss-Free Balancing (2408.15664)
arXiv -- DeepSeek-V3 Technical Report (2412.19437)
arXiv -- GuiLoMo: LoRA-MoE Optimization (2506.14646)
arXiv -- FLEX-MoE: Federated MoE (2512.23070)
arXiv -- Least-Loaded Expert Parallelism (2601.17111)
Michael Brenndoerfer -- "MoE Load Balancing" (Jan 2026)
Cameron Wolfe -- "nanoMoE: MoE LLMs from Scratch" (Mar 2025)

Заблуждение: MoE модели требуют пропорционально больше GPU memory

MoE memory растет sub-linear от parameter count. DeepSeek V3 (671B) требует примерно столько же inference memory, сколько dense 70B модель, потому что только 37B активных параметров загружены в GPU memory per forward pass. ExpertFlow дополнительно экономит 93.72% GPU memory через predictive offloading.

Заблуждение: auxiliary loss достаточно для стабильной балансировки

Auxiliary loss создает gradient interference -- градиенты от \(\mathcal{L}_{aux}\) конфликтуют с \(\mathcal{L}_{task}\), а гиперпараметр \(\alpha\) критически чувствителен. DeepSeek V3 показал, что loss-free bias update (без градиентов) дает лучшее качество модели при стабильной балансировке на 14.8T токенах. Для production training -- loss-free предпочтителен.

Заблуждение: больше экспертов на токен = лучше качество

Увеличение K (экспертов на токен) улучшает качество, но уменьшает compute efficiency -- главное преимущество MoE. Mixtral (K=2) активирует 28% параметров, DeepSeek V3 (K=9 из 256) -- 5.5%. При K=N MoE вырождается в dense модель. Оптимум -- минимальный K, дающий достаточное качество.

Interview Questions¶

Q: Объясните MoE архитектуру -- почему она эффективнее dense?

Red flag: "MoE использует несколько экспертов и выбирает лучшего для каждого запроса."

Strong answer: "MoE заменяет FFN в трансформере на N специализированных экспертов + router. Каждый токен обрабатывается top-K экспертами (не всеми): \(y = \sum g_i(x) E_i(x)\). DeepSeek V3: 671B total, 37B active (5.5%), K=9 из 256. Выигрыш: 90-95% compute reduction per inference, 2-7x faster training, 18K tokens/sec vs 8K dense. Trade-offs: all-to-all communication overhead при expert parallelism, risk of expert collapse без балансировки, сложнее deployment (все параметры в memory, даже неактивные)."

Q: Сравните auxiliary loss и loss-free балансировку.

Red flag: "Loss-free лучше потому что не нужен дополнительный loss."

Strong answer: "Auxiliary loss: \(\mathcal{L}_{aux} = \alpha N \sum f_i P_i\) -- штрафует неравномерное распределение токенов. Проблема: gradient interference (градиенты \(\mathcal{L}_{aux}\) vs \(\mathcal{L}_{task}\)), sensitivity к \(\alpha \approx 0.01\), trade-off quality vs balance. Loss-free (DeepSeek V3): dynamic bias \(b_i \leftarrow b_i - \gamma(f_i - 1/N)\) обновляется без градиентов. Нет gradient interference, лучшее quality, стабильнее на 14.8T токенах. Третья опция -- SIMBAL: preserves routing similarity для похожих inputs, 36% faster convergence. Рекомендация: loss-free для production, auxiliary для быстрых экспериментов."

Q: Что такое expert collapse и как его предотвратить?

Red flag: "Expert collapse -- когда эксперты перестают работать. Решение -- обучать дольше."

Strong answer: "Rich-get-richer: популярные эксперты получают больше токенов, лучше обучаются, получают еще больше -- остальные undertrained. 60-80% capacity wasted. Шесть стратегий предотвращения: (1) Auxiliary loss. (2) Loss-free bias. (3) Fine-grained segmentation -- 256 мелких экспертов vs 8 крупных снижает risk (DeepSeek V2/V3). (4) Capacity factor 1.25-2.0 с token dropping. (5) MoE++ zero-computation experts (zero/copy/constant) -- простые токены не тратят compute. (6) LPR: Gini coefficient с 0.70 до 0.035 (near-perfect balance)."

Q: Как деплоить MoE модель с 671B параметрами?

Red flag: "Нужно много GPU с достаточным memory для всех параметров."

Strong answer: "Expert parallelism: эксперты распределяются по GPU (8 экспертов на 4 GPU = по 2 на каждый). All-to-all communication отправляет токены к GPU с нужными экспертами. ExpertFlow: lightweight predictor предсказывает routing paths, динамически подгружает экспертов -- 93.72% memory savings, 2-10x speedup. Дополнительно: MLA (Multi-head Latent Attention) из DeepSeek V2/V3 сжимает KV cache на 90%+. Quantization (GPTQ/AWQ) для неактивных экспертов. vLLM с PagedAttention для production serving."