Спекулятивное декодирование¶
~9 минут чтения
Предварительно: KV Cache, vLLM и PagedAttention
Зачем это нужно¶
LLM генерирует токены по одному: 1 forward pass = 1 токен. При этом GPU загружен на ~30% -- bottleneck не в compute, а в memory bandwidth (загрузка весов из VRAM). Compute простаивает.
Аналогия: представьте редактора, который вычитывает текст по одному слову. Ассистент (draft model) предлагает сразу 5 слов, редактор проверяет их одним взглядом. Если 4 из 5 правильные -- скорость выросла в 4 раза. Если все 5 неправильные -- потерян один взгляд, ассистент пробует снова.
Ключевое свойство: output идентичен стандартному декодированию. Это не аппроксимация -- target модель всегда верифицирует draft. Speedup 1.5-4x без потери качества.
Обзор¶
Speculative Decoding -- техника ускорения LLM инференса: "легкая" draft модель предлагает несколько токенов, target модель верифицирует их за один forward pass.
Key Innovation¶
Speculative decoding -- LOSSLESS ускорение
В отличие от квантизации или pruning, speculative decoding даёт точно такой же output как стандартный autoregressive decoding. Target модель всегда верифицирует draft -- если draft неправильный, reject + resample. Speedup 1.5-4x без потери качества.
Вместо: 1 токен за 1 forward pass (memory-bound) Используется: N токенов за 1 forward pass (compute-bound)
Comparison: Standard vs Speculative Decoding¶
| Aspect | Standard Autoregressive | Speculative Decoding |
|---|---|---|
| Tokens per pass | 1 | 2-8 (avg) |
| Memory utilization | Low (idle compute) | High |
| Latency reduction | Baseline | 1.5-4× |
| Output quality | Exact | Exact (lossless) |
| Extra VRAM | None | +3-10% |
1. Why Speculative Decoding Exists¶
Problem: Memory-Bound Inference¶
LLM инференс ограничен пропускной способностью памяти, а не compute:
Каждый токен требует: 1. Загрузки весов модели из VRAM 2. Одного forward pass 3. Синхронизации памяти
Solution: Parallel Verification¶
Speculative decoding использует idle compute для параллельной верификации:
# Standard: N passes for N tokens
for _ in range(N):
token = model(input) # N expensive passes
input = concat(input, token)
# Speculative: 1 pass for N tokens (if accepted)
draft_tokens = draft_model.generate(input, n=N) # Fast, cheap
verified_tokens = target_model.verify(input, draft_tokens) # Single pass
Ключевое понимание: Target model может проверить несколько токенов за один forward pass благодаря: 1. KV Cache для prefix 2. Параллельные операции attention 3. Batch processing на GPU
2. Draft-Target Approach (Classic)¶
Architecture¶
graph LR
DRAFT["Draft Model<br/>(small, fast)"] -->|"Draft N tokens"| INPUT["Input +<br/>Draft Tokens"]
INPUT --> TARGET["Target Model<br/>(large)"]
TARGET -->|"Verify in 1 pass"| ACC["Accepted Tokens"]
style DRAFT fill:#e8f5e9,stroke:#4caf50
style TARGET fill:#e8eaf6,stroke:#3f51b5
style ACC fill:#fff3e0,stroke:#ef6c00
Rejection Sampling¶
Acceptance Logic:
Где: - \(p_{\text{target}}(t)\) — вероятность токена от target model - \(p_{\text{draft}}(t)\) — вероятность токена от draft model
Algorithm¶
def speculative_decode(draft_model, target_model, prompt, num_speculative=4):
tokens = prompt
while not finished:
# 1. Draft model generates N tokens
draft_tokens = []
draft_probs = []
current = tokens
for _ in range(num_speculative):
logits = draft_model(current)
prob = softmax(logits[-1])
token = sample(prob)
draft_tokens.append(token)
draft_probs.append(prob[token])
current = concat(current, token)
# 2. Target model verifies all in ONE pass
target_logits = target_model(concat(tokens, draft_tokens))
# 3. Acceptance sampling
accepted = 0
for i, (draft_token, draft_prob) in enumerate(zip(draft_tokens, draft_probs)):
target_prob = softmax(target_logits[len(tokens) + i - 1])[draft_token]
if target_prob >= draft_prob:
# Accept
accepted += 1
else:
# Rejection sampling
alpha = target_prob / draft_prob
if random() < alpha:
accepted += 1
else:
# Sample from adjusted distribution
corrected = max(0, target_prob - draft_prob)
new_token = sample_from_adjusted(corrected)
tokens = concat(tokens, draft_tokens[:accepted], new_token)
break
if accepted == num_speculative:
# Bonus token from target
bonus = sample(softmax(target_logits[-1]))
tokens = concat(tokens, draft_tokens, bonus)
return tokens
Draft Model Selection¶
| Draft Size | Target Size | Acceptance Rate | Speedup |
|---|---|---|---|
| 125M | 7B | 60-70% | 1.5-2× |
| 350M | 13B | 55-65% | 1.4-1.8× |
| 1B | 70B | 45-55% | 1.3-1.6× |
Rule of thumb: Draft model ≈ 1/10 размера target model.
3. EAGLE-2 (June 2024)¶
Paper: "EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees"
Key Innovation: Feature-Level Drafting¶
Вместо отдельной модели, EAGLE использует lightweight head, который работает с hidden states target model:
graph LR
TM["Target Model<br/>Layers 0-N"] --> HS["Final Hidden State"]
HS --> EH["EAGLE Head"]
EH --> DT["Draft Tokens"]
TM --> LM["lm_head"]
style EH fill:#e8f5e9,stroke:#4caf50
style DT fill:#fff3e0,stroke:#ef6c00
Draft Tree¶
EAGLE-2 строит дерево кандидатов вместо линейной последовательности:
graph LR
INPUT["Input"] --> A["the"] --> A1["cat"] --> A2["sat"]
INPUT --> B["a"] --> B1["dog"] --> B2["ran"]
INPUT --> C["one"] --> C1["bird"] --> C2["flew"]
style A fill:#e8f5e9,stroke:#4caf50
style B fill:#e8eaf6,stroke:#3f51b5
style C fill:#fff3e0,stroke:#ef6c00
Преимущества: - Адаптивная глубина по confidence - Параллельная верификация через tree attention - Instance-adaptive: больше кандидатов для "лёгких" токенов
Parameters¶
| Parameter | Description | Default |
|---|---|---|
speculative_num_steps |
Depth of drafting | 5 |
speculative_eagle_topk |
Branching factor | 4 |
speculative_num_draft_tokens |
Max tree size | 8 |
4. EAGLE-3 (March 2025)¶
Paper: "EAGLE-3: Efficient Speculative Decoding with Multi-Layer Features"
Key Innovation: Multi-Layer Feature Fusion¶
EAGLE-3 берёт hidden states из трёх слоёв target model:
Target Model:
├── Low Layer (L/4) ──┐
├── Mid Layer (L/2) ──┼──▶ Fusion ──▶ EAGLE-3 Head ──▶ Draft Tree
├── High Layer (3L/4) ──┘
└── lm_head
Training: Train-Time-Testing¶
EAGLE-3 использует специальную технику обучения, которая симулирует multi-step draft sampling:
For each position:
1. Generate first draft token (blue)
2. Generate second draft token (yellow) conditioned on first
3. Generate third draft token (green) conditioned on first two
...
Это использует FlexAttention для эффективного вычисления sparse attention mask.
Performance (SGLang, Llama-3.1-8B, MT-Bench)¶
| Method | Throughput (tokens/s) | Speedup |
|---|---|---|
| Standard | 158.34 | 1.0× |
| EAGLE-2 | 244.10 | 1.54× |
| EAGLE-3 | 373.25 | 2.36× |
SpecBundle (LMSYS, Dec 2025)¶
SpecBundle Phase 1 — коллекция production-ready EAGLE-3 моделей:
- Trained on: Perfect-Blend dataset (1.4M samples vs 320K in original)
- Models: Llama-3.1-8B, Llama-3.3-70B, Qwen3-32B, Qwen3-235B-MoE, Kimi-K2
- Speedup: Up to 4× end-to-end
5. MTP: Multi-Token Prediction (DeepSeek)¶
Architecture¶
DeepSeek V3/R1 используют встроенные multi-token prediction heads:
graph LR
M["Model"] --> LM["Main LM Head"] --> T1["Token t+1"]
M --> H1["MTP Head 1"] --> T2["Token t+2"]
M --> H2["MTP Head 2"] --> T3["Token t+3"]
M --> HN["MTP Head N"] --> TN["Token t+N"]
style LM fill:#e8eaf6,stroke:#3f51b5
style H1 fill:#e8f5e9,stroke:#4caf50
style H2 fill:#e8f5e9,stroke:#4caf50
style HN fill:#e8f5e9,stroke:#4caf50
Comparison: MTP vs EAGLE¶
| Aspect | MTP (DeepSeek) | EAGLE-3 |
|---|---|---|
| Draft source | Multiple prediction heads | Single feature-based head |
| Training | Jointly with model | Post-hoc |
| Overhead | Built-in | Separate model weights |
| Best for | Native MTP models | Any model |
SGLang MTP Support¶
python -m sglang.launch_server \
--model deepseek-ai/DeepSeek-V3 \
--speculative-algorithm MTP \
--speculative-num-steps 3
6. Другие методы¶
Medusa (Multi-Head Prediction)¶
Sequentially-dependent draft heads, attached to target model. Hydra variant (Oct 2024): improved token acceptance via prediction confidence correlation.
| Method | Speedup | Преимущества | Недостатки |
|---|---|---|---|
| Medusa | 2-3x | Multi-head prediction | Training multiple heads |
| PPD (EMNLP 2025) | 2x | Memory-efficient, <50% training time of Medusa | Newer, less tested |
| OPT-Tree | -- | Adaptive draft tree, dynamic candidates | Complexity |
| Batch SpecDec (arXiv:2510.22876) | 3x at BS=8 | Proper batch handling | CUDA-only |
Online Speculative Decoding¶
Periodically fine-tune draft model на corrections от target model. Draft адаптируется к target behavior over time.
AWS SageMaker (Production)¶
EAGLE-based adaptive speculative decoding. 2.5x speedup while maintaining quality. Red Hat OpenShift AI also supports EAGLE-3 with vLLM.
7. NGRAM Speculative Decoding¶
Training-free метод, использующий n-gram статистику:
from collections import Counter
class NGramSpeculator:
def __init__(self, n=4):
self.ngram_cache = {} # (token_seq) -> next_token_counts
def update(self, tokens):
"""Update n-gram statistics from context"""
for i in range(len(tokens) - self.n):
key = tuple(tokens[i:i+self.n])
next_token = tokens[i+self.n]
if key not in self.ngram_cache:
self.ngram_cache[key] = Counter()
self.ngram_cache[key][next_token] += 1
def speculate(self, tokens, k=4):
"""Predict next k tokens using n-grams"""
key = tuple(tokens[-self.n:])
if key in self.ngram_cache:
return self.ngram_cache[key].most_common(k)
return []
Преимущества: - No extra model - No training required - Works with any LLM
Недостатки: - CUDA-only - Lower acceptance rate than model-based methods
8. Training Draft Models¶
Speculators v0.3.0 (vLLM, Dec 2025)¶
Pipeline:
1. Data Generation:
├── Preprocess: chat template, tokenize, loss mask
├── Hidden States: extract from vLLM forward pass
└── Save: input_ids, hidden_states, loss_mask
2. Training:
├── Eagle3DraftModel initialization
├── Train-time-testing simulation
├── FlexAttention for sparse masks
└── Vocabulary mapping (target-to-draft)
3. Deployment:
├── speculators_config in model weights
└── Seamless vLLM integration
SpecForge v0.2 (SGLang, Dec 2025)¶
Features: - Multi-backend support (SGLang, HuggingFace) - 10× faster data regeneration - Unified online/offline training
# SpecForge multi-backend
target_model = get_eagle3_target_model(
pretrained_model_name_or_path="meta-llama/Llama-3.1-8B-Instruct",
backend="sglang", # or "huggingface"
torch_dtype=torch.bfloat16,
)
9. Code Examples¶
vLLM with EAGLE-3¶
from vllm import LLM, SamplingParams
# Standard serving
llm = LLM(
model="Qwen/Qwen3-32B",
tensor_parallel_size=1,
)
# With speculative decoding
llm_spec = LLM(
model="Qwen/Qwen3-32B",
speculative_config={
"model": "RedHatAI/Qwen3-32B-speculator.eagle3",
"num_speculative_tokens": 3,
"method": "eagle3",
},
tensor_parallel_size=1,
)
# Benchmark
sampling_params = SamplingParams(temperature=0, max_tokens=256)
outputs = llm_spec.generate(prompts, sampling_params)
# ~1.8-2× faster than standard
SGLang with EAGLE-3¶
# Server launch
python -m sglang.launch_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path lmsys/sglang-EAGLE3-llama3.1-8B \
--speculative-num-steps 5 \
--speculative-eagle-topk 8 \
--speculative-num-draft-tokens 64
Python: Standalone Draft Model¶
import torch
import torch.nn.functional as F
class SpeculativeDecoder:
def __init__(self, target_model, draft_model, device="cuda"):
self.target = target_model
self.draft = draft_model
self.device = device
def generate(self, input_ids, max_tokens=100, num_speculative=4):
generated = input_ids.clone()
for _ in range(max_tokens):
# Draft phase
draft_tokens = []
draft_probs = []
current = generated
with torch.no_grad():
for _ in range(num_speculative):
logits = self.draft(current).logits[:, -1, :]
probs = F.softmax(logits, dim=-1)
token = torch.argmax(probs, dim=-1, keepdim=True)
draft_tokens.append(token)
draft_probs.append(probs.gather(-1, token))
current = torch.cat([current, token], dim=-1)
# Verify phase
draft_sequence = torch.cat(draft_tokens, dim=-1)
verify_input = torch.cat([generated, draft_sequence], dim=-1)
with torch.no_grad():
target_logits = self.target(verify_input).logits
# Acceptance
num_accepted = 0
for i, (token, draft_prob) in enumerate(zip(draft_tokens, draft_probs)):
pos = generated.shape[1] + i - 1
target_probs = F.softmax(target_logits[:, pos, :], dim=-1)
target_prob = target_probs.gather(-1, token)
if target_prob >= draft_prob:
num_accepted += 1
else:
# Rejection sampling
alpha = target_prob / (draft_prob + 1e-10)
if torch.rand(1, device=self.device) < alpha:
num_accepted += 1
else:
break
# Update generated
if num_accepted > 0:
generated = torch.cat([
generated,
torch.cat(draft_tokens[:num_accepted], dim=-1)
], dim=-1)
if num_accepted == num_speculative:
# Bonus token
bonus = torch.argmax(target_logits[:, -1, :], dim=-1, keepdim=True)
generated = torch.cat([generated, bonus], dim=-1)
if generated.shape[1] >= input_ids.shape[1] + max_tokens:
break
return generated
10. Production Benchmarks¶
Qwen3-32B with EAGLE-3 (RTX 6000, 96GB)¶
| Metric | Standard | Speculative | Speedup |
|---|---|---|---|
| Response tokens/s | 22.96 | 41.88 | 1.82× |
| Time to First Token | 25.62s | 15.98s | 1.60× |
| Prompt tokens/s | 3.05 | 5.22 | 1.71× |
| Acceptance Rate | N/A | 33.8% | — |
Llama-3.1-8B with EAGLE-3 (H100)¶
| Method | Throughput | Speedup |
|---|---|---|
| Standard | 158 tok/s | 1.0× |
| EAGLE-2 | 244 tok/s | 1.54× |
| EAGLE-3 | 373 tok/s | 2.36× |
SpecBundle (Multiple Models, SGLang)¶
| Model | SpecBundle Speedup |
|---|---|
| Llama-3.1-8B | 2.8× |
| Llama-3.3-70B | 2.2× |
| Qwen3-235B-MoE | 2.5× |
| Kimi-K2 | 2.1× |
11. Decision Framework¶
Method Selection¶
| Scenario | Recommended Method |
|---|---|
| Best speed/quality | EAGLE-3 (SGLang/vLLM) |
| Broad compatibility | EAGLE-2 |
| Native MTP model | MTP (DeepSeek V3/R1) |
| No extra model | NGRAM (training-free) |
| Have smaller LLM | STANDALONE draft-target |
| Maximum throughput | EAGLE-3 + torch.compile |
Engine Selection¶
| Engine | EAGLE-3 | MTP | NGRAM | STANDALONE |
|---|---|---|---|---|
| vLLM | ✅ | ✅ | ✅ | ✅ |
| SGLang | ✅ | ✅ | ✅ | ✅ |
| TensorRT-LLM | ✅ | ❌ | ❌ | ✅ |
| llama.cpp | ❌ | ❌ | ❌ | ✅ |
Performance Tuning¶
- Start with defaults:
num_speculative_tokens=3-5 - Tune for your workload:
- Short outputs: Increase
num_speculative_tokens - Long outputs: Decrease for better memory
- Monitor acceptance rate:
- <50%: Draft model too weak, increase size
-
80%: Can increase speculation depth
- VRAM constraints:
- EAGLE: +3-5% VRAM
- STANDALONE draft: +10-20% VRAM
12. Interview Questions¶
Basic¶
- "What is speculative decoding and why does it speed up inference?"
- "Explain the draft-target approach"
- "What is acceptance rate in speculative decoding?"
Advanced¶
- "Compare EAGLE-3 vs classic draft-target approach"
- "How does train-time-testing work in EAGLE-3?"
- "Explain why speculative decoding is lossless"
- "What factors affect acceptance rate?"
System Design¶
- "Design an LLM serving system with speculative decoding for a chatbot"
- "When would you choose NGRAM over EAGLE-3?"
- "How to train a custom EAGLE-3 draft model?"
- "When does speculative decoding NOT help?"
Q: "When does speculative decoding NOT help?"
Three cases: 1) High temperature sampling reduces acceptance rate. 2) Very diverse outputs need too many candidates. 3) Small batch sizes where verification overhead dominates. Batch SpecDec (2026) helps with case 3. Also: domain mismatch between draft and target reduces acceptance rate.
13. Formulas Summary¶
Acceptance Probability¶
Expected Tokens Per Pass¶
Speedup Formula¶
Memory Overhead¶
14. Sources & Further Reading¶
Papers¶
- Speculative Sampling: Chen et al. (Google DeepMind), 2023 — arXiv:2302.01318 1b. Fast Inference via Speculative Decoding: Leviathan, Kalman, Matias (Google), 2022 — arXiv:2211.17192
- EAGLE: Li et al., 2024 — arXiv:2401.15077
- EAGLE-2: Li et al., 2024 — arXiv:2406.16858
- EAGLE-3: Zhang et al., 2025 — arXiv:2503.01840
- DART: Liu et al., 2026 — arXiv:2601.19278
- Hydra (Medusa variant): arXiv:2402.05109
- Batch Speculative Decoding: arXiv:2510.22876 (Oct 2025)
- PPD (Parallel Prompt Decoding): EMNLP 2025
- Online Speculative Decoding: arXiv:2310.07177
Documentation¶
- SGLang Speculative Decoding: docs.sglang.io/advanced_features/speculative_decoding.html
- vLLM Speculative Decoding: docs.vllm.ai/en/latest/features/spec_decode/
- NVIDIA TensorRT-LLM: github.com/NVIDIA/TensorRT-LLM
Production¶
- SpecBundle: huggingface.co/collections/lmsys/specbundle
- SpecForge: github.com/sgl-project/SpecForge
- Speculators (vLLM): github.com/vllm-project/speculators
Interview Questions¶
Conceptual:
- "Как speculative decoding ускоряет inference без потери качества?" -- Draft model (маленькая, быстрая) генерирует K токенов-кандидатов. Target model (большая) верифицирует все K за один forward pass (параллельно). Принятые токены идентичны тем, что target сгенерировала бы сама. Speedup = K * acceptance_rate, обычно 2-3x.
- "Почему speculative decoding математически lossless?" -- Rejection sampling: если draft token не совпадает с distribution target модели, он отвергается и заменяется sample из скорректированного distribution. Финальный output эквивалентен авторегрессивной генерации target модели.
- "Draft model vs self-drafting: trade-offs?" -- Отдельная draft model: нужно хранить две модели в памяти, но acceptance rate выше при хорошем выборе. Self-drafting (Medusa, EAGLE): дополнительные prediction heads на target model, нет extra memory, но требует fine-tuning heads.
System Design:
- "Когда speculative decoding НЕ даёт ускорения?" -- Когда acceptance rate <50% (draft слишком отличается от target), batch size >8 (GPU уже загружен compute, bottleneck не в memory bandwidth), или когда задача creative (высокий temperature снижает acceptance rate).
Частые ошибки
"Speculative decoding всегда даёт 2-3x speedup" -- Speedup зависит от acceptance rate, который зависит от similarity draft/target моделей и типа задачи. На creative writing (temperature 1.0) -- acceptance rate падает до 30-40%, speedup <1.5x. На code/factual (temperature 0) -- 70-80% acceptance, speedup 2.5-3x.
"Можно использовать любую маленькую модель как draft" -- Draft model должна быть обучена на тех же данных или fine-tuned для alignment с target. Random small model даёт acceptance rate <20% -- это медленнее чем без speculation (overhead верификации).
"Speculative decoding и batching -- одно и то же ускорение" -- Batching ускоряет throughput (больше запросов параллельно). Speculative decoding ускоряет latency одного запроса. Они complementary, но при большом batch speculative decoding теряет эффективность (GPU уже compute-bound).
See Also¶
- Inference Engines -- vLLM, SGLang, TensorRT-LLM поддерживают speculative decoding
- KV Cache Optimization -- KV cache management для draft + target модели
- Inference Optimization -- обзор всех техник ускорения инференса
- Quantization -- complementary technique: quantize + speculate
- Knowledge Distillation -- distill draft models для лучшего acceptance rate