Спекулятивное декодирование¶

~9 минут чтения

Предварительно: KV Cache, vLLM и PagedAttention

Зачем это нужно¶

LLM генерирует токены по одному: 1 forward pass = 1 токен. При этом GPU загружен на ~30% -- bottleneck не в compute, а в memory bandwidth (загрузка весов из VRAM). Compute простаивает.

Аналогия: представьте редактора, который вычитывает текст по одному слову. Ассистент (draft model) предлагает сразу 5 слов, редактор проверяет их одним взглядом. Если 4 из 5 правильные -- скорость выросла в 4 раза. Если все 5 неправильные -- потерян один взгляд, ассистент пробует снова.

Ключевое свойство: output идентичен стандартному декодированию. Это не аппроксимация -- target модель всегда верифицирует draft. Speedup 1.5-4x без потери качества.

Обзор¶

Speculative Decoding -- техника ускорения LLM инференса: "легкая" draft модель предлагает несколько токенов, target модель верифицирует их за один forward pass.

Key Innovation¶

\[ \text{Speedup} = \frac{N_{\text{tokens per pass}}}{1} \times \text{Acceptance Rate} \]

Speculative decoding -- LOSSLESS ускорение

В отличие от квантизации или pruning, speculative decoding даёт точно такой же output как стандартный autoregressive decoding. Target модель всегда верифицирует draft -- если draft неправильный, reject + resample. Speedup 1.5-4x без потери качества.

Вместо: 1 токен за 1 forward pass (memory-bound) Используется: N токенов за 1 forward pass (compute-bound)

Comparison: Standard vs Speculative Decoding¶

Aspect	Standard Autoregressive	Speculative Decoding
Tokens per pass	1	2-8 (avg)
Memory utilization	Low (idle compute)	High
Latency reduction	Baseline	1.5-4×
Output quality	Exact	Exact (lossless)
Extra VRAM	None	+3-10%

1. Why Speculative Decoding Exists¶

Problem: Memory-Bound Inference¶

LLM инференс ограничен пропускной способностью памяти, а не compute:

GPU Compute: 1000 TFLOPS
Memory BW: 3 TB/s
Effective utilization: ~30% (most compute sits idle)

Каждый токен требует: 1. Загрузки весов модели из VRAM 2. Одного forward pass 3. Синхронизации памяти

Solution: Parallel Verification¶

Speculative decoding использует idle compute для параллельной верификации:

# Standard: N passes for N tokens
for _ in range(N):
    token = model(input)  # N expensive passes
    input = concat(input, token)

# Speculative: 1 pass for N tokens (if accepted)
draft_tokens = draft_model.generate(input, n=N)  # Fast, cheap
verified_tokens = target_model.verify(input, draft_tokens)  # Single pass

Ключевое понимание: Target model может проверить несколько токенов за один forward pass благодаря: 1. KV Cache для prefix 2. Параллельные операции attention 3. Batch processing на GPU

2. Draft-Target Approach (Classic)¶

Architecture¶

graph LR
    DRAFT["Draft Model<br/>(small, fast)"] -->|"Draft N tokens"| INPUT["Input +<br/>Draft Tokens"]
    INPUT --> TARGET["Target Model<br/>(large)"]
    TARGET -->|"Verify in 1 pass"| ACC["Accepted Tokens"]
    style DRAFT fill:#e8f5e9,stroke:#4caf50
    style TARGET fill:#e8eaf6,stroke:#3f51b5
    style ACC fill:#fff3e0,stroke:#ef6c00

Rejection Sampling¶

Acceptance Logic:

\[ \text{Accept} = \begin{cases} \text{Yes} & \text{if } p_{\text{target}}(t) \geq p_{\text{draft}}(t) \\ \text{Sample} & \text{if } p_{\text{target}}(t) < p_{\text{draft}}(t) \end{cases} \]

Где: - \(p_{\text{target}}(t)\) — вероятность токена от target model - \(p_{\text{draft}}(t)\) — вероятность токена от draft model

Algorithm¶

def speculative_decode(draft_model, target_model, prompt, num_speculative=4):
    tokens = prompt

    while not finished:
        # 1. Draft model generates N tokens
        draft_tokens = []
        draft_probs = []
        current = tokens
        for _ in range(num_speculative):
            logits = draft_model(current)
            prob = softmax(logits[-1])
            token = sample(prob)
            draft_tokens.append(token)
            draft_probs.append(prob[token])
            current = concat(current, token)

        # 2. Target model verifies all in ONE pass
        target_logits = target_model(concat(tokens, draft_tokens))

        # 3. Acceptance sampling
        accepted = 0
        for i, (draft_token, draft_prob) in enumerate(zip(draft_tokens, draft_probs)):
            target_prob = softmax(target_logits[len(tokens) + i - 1])[draft_token]

            if target_prob >= draft_prob:
                # Accept
                accepted += 1
            else:
                # Rejection sampling
                alpha = target_prob / draft_prob
                if random() < alpha:
                    accepted += 1
                else:
                    # Sample from adjusted distribution
                    corrected = max(0, target_prob - draft_prob)
                    new_token = sample_from_adjusted(corrected)
                    tokens = concat(tokens, draft_tokens[:accepted], new_token)
                    break

        if accepted == num_speculative:
            # Bonus token from target
            bonus = sample(softmax(target_logits[-1]))
            tokens = concat(tokens, draft_tokens, bonus)

    return tokens

Draft Model Selection¶

Draft Size	Target Size	Acceptance Rate	Speedup
125M	7B	60-70%	1.5-2×
350M	13B	55-65%	1.4-1.8×
1B	70B	45-55%	1.3-1.6×

Rule of thumb: Draft model ≈ 1/10 размера target model.

3. EAGLE-2 (June 2024)¶

Paper: "EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees"

Key Innovation: Feature-Level Drafting¶

Вместо отдельной модели, EAGLE использует lightweight head, который работает с hidden states target model:

graph LR
    TM["Target Model<br/>Layers 0-N"] --> HS["Final Hidden State"]
    HS --> EH["EAGLE Head"]
    EH --> DT["Draft Tokens"]
    TM --> LM["lm_head"]
    style EH fill:#e8f5e9,stroke:#4caf50
    style DT fill:#fff3e0,stroke:#ef6c00

Draft Tree¶

EAGLE-2 строит дерево кандидатов вместо линейной последовательности:

graph LR
    INPUT["Input"] --> A["the"] --> A1["cat"] --> A2["sat"]
    INPUT --> B["a"] --> B1["dog"] --> B2["ran"]
    INPUT --> C["one"] --> C1["bird"] --> C2["flew"]
    style A fill:#e8f5e9,stroke:#4caf50
    style B fill:#e8eaf6,stroke:#3f51b5
    style C fill:#fff3e0,stroke:#ef6c00

Преимущества: - Адаптивная глубина по confidence - Параллельная верификация через tree attention - Instance-adaptive: больше кандидатов для "лёгких" токенов

Parameters¶

Parameter	Description	Default
`speculative_num_steps`	Depth of drafting	5
`speculative_eagle_topk`	Branching factor	4
`speculative_num_draft_tokens`	Max tree size	8

4. EAGLE-3 (March 2025)¶

Paper: "EAGLE-3: Efficient Speculative Decoding with Multi-Layer Features"

Key Innovation: Multi-Layer Feature Fusion¶

EAGLE-3 берёт hidden states из трёх слоёв target model:

Target Model:
├── Low Layer (L/4)    ──┐
├── Mid Layer (L/2)    ──┼──▶ Fusion ──▶ EAGLE-3 Head ──▶ Draft Tree
├── High Layer (3L/4)  ──┘
└── lm_head

Training: Train-Time-Testing¶

EAGLE-3 использует специальную технику обучения, которая симулирует multi-step draft sampling:

For each position:
  1. Generate first draft token (blue)
  2. Generate second draft token (yellow) conditioned on first
  3. Generate third draft token (green) conditioned on first two
  ...

Это использует FlexAttention для эффективного вычисления sparse attention mask.

Performance (SGLang, Llama-3.1-8B, MT-Bench)¶

Method	Throughput (tokens/s)	Speedup
Standard	158.34	1.0×
EAGLE-2	244.10	1.54×
EAGLE-3	373.25	2.36×

SpecBundle (LMSYS, Dec 2025)¶

SpecBundle Phase 1 — коллекция production-ready EAGLE-3 моделей:

Trained on: Perfect-Blend dataset (1.4M samples vs 320K in original)
Models: Llama-3.1-8B, Llama-3.3-70B, Qwen3-32B, Qwen3-235B-MoE, Kimi-K2
Speedup: Up to 4× end-to-end

5. MTP: Multi-Token Prediction (DeepSeek)¶

Architecture¶

DeepSeek V3/R1 используют встроенные multi-token prediction heads:

graph LR
    M["Model"] --> LM["Main LM Head"] --> T1["Token t+1"]
    M --> H1["MTP Head 1"] --> T2["Token t+2"]
    M --> H2["MTP Head 2"] --> T3["Token t+3"]
    M --> HN["MTP Head N"] --> TN["Token t+N"]
    style LM fill:#e8eaf6,stroke:#3f51b5
    style H1 fill:#e8f5e9,stroke:#4caf50
    style H2 fill:#e8f5e9,stroke:#4caf50
    style HN fill:#e8f5e9,stroke:#4caf50

Comparison: MTP vs EAGLE¶

Aspect	MTP (DeepSeek)	EAGLE-3
Draft source	Multiple prediction heads	Single feature-based head
Training	Jointly with model	Post-hoc
Overhead	Built-in	Separate model weights
Best for	Native MTP models	Any model

SGLang MTP Support¶

python -m sglang.launch_server \
  --model deepseek-ai/DeepSeek-V3 \
  --speculative-algorithm MTP \
  --speculative-num-steps 3

6. Другие методы¶

Medusa (Multi-Head Prediction)¶

Sequentially-dependent draft heads, attached to target model. Hydra variant (Oct 2024): improved token acceptance via prediction confidence correlation.

Method	Speedup	Преимущества	Недостатки
Medusa	2-3x	Multi-head prediction	Training multiple heads
PPD (EMNLP 2025)	2x	Memory-efficient, <50% training time of Medusa	Newer, less tested
OPT-Tree	--	Adaptive draft tree, dynamic candidates	Complexity
Batch SpecDec (arXiv:2510.22876)	3x at BS=8	Proper batch handling	CUDA-only

Online Speculative Decoding¶

Periodically fine-tune draft model на corrections от target model. Draft адаптируется к target behavior over time.

AWS SageMaker (Production)¶

EAGLE-based adaptive speculative decoding. 2.5x speedup while maintaining quality. Red Hat OpenShift AI also supports EAGLE-3 with vLLM.

7. NGRAM Speculative Decoding¶

Training-free метод, использующий n-gram статистику:

from collections import Counter

class NGramSpeculator:
    def __init__(self, n=4):
        self.ngram_cache = {}  # (token_seq) -> next_token_counts

    def update(self, tokens):
        """Update n-gram statistics from context"""
        for i in range(len(tokens) - self.n):
            key = tuple(tokens[i:i+self.n])
            next_token = tokens[i+self.n]
            if key not in self.ngram_cache:
                self.ngram_cache[key] = Counter()
            self.ngram_cache[key][next_token] += 1

    def speculate(self, tokens, k=4):
        """Predict next k tokens using n-grams"""
        key = tuple(tokens[-self.n:])
        if key in self.ngram_cache:
            return self.ngram_cache[key].most_common(k)
        return []

Преимущества: - No extra model - No training required - Works with any LLM

Недостатки: - CUDA-only - Lower acceptance rate than model-based methods

8. Training Draft Models¶

Speculators v0.3.0 (vLLM, Dec 2025)¶

Pipeline:

1. Data Generation:
   ├── Preprocess: chat template, tokenize, loss mask
   ├── Hidden States: extract from vLLM forward pass
   └── Save: input_ids, hidden_states, loss_mask

2. Training:
   ├── Eagle3DraftModel initialization
   ├── Train-time-testing simulation
   ├── FlexAttention for sparse masks
   └── Vocabulary mapping (target-to-draft)

3. Deployment:
   ├── speculators_config in model weights
   └── Seamless vLLM integration

SpecForge v0.2 (SGLang, Dec 2025)¶

Features: - Multi-backend support (SGLang, HuggingFace) - 10× faster data regeneration - Unified online/offline training

# SpecForge multi-backend
target_model = get_eagle3_target_model(
    pretrained_model_name_or_path="meta-llama/Llama-3.1-8B-Instruct",
    backend="sglang",  # or "huggingface"
    torch_dtype=torch.bfloat16,
)

9. Code Examples¶

vLLM with EAGLE-3¶

from vllm import LLM, SamplingParams

# Standard serving
llm = LLM(
    model="Qwen/Qwen3-32B",
    tensor_parallel_size=1,
)

# With speculative decoding
llm_spec = LLM(
    model="Qwen/Qwen3-32B",
    speculative_config={
        "model": "RedHatAI/Qwen3-32B-speculator.eagle3",
        "num_speculative_tokens": 3,
        "method": "eagle3",
    },
    tensor_parallel_size=1,
)

# Benchmark
sampling_params = SamplingParams(temperature=0, max_tokens=256)

outputs = llm_spec.generate(prompts, sampling_params)
# ~1.8-2× faster than standard

SGLang with EAGLE-3¶

# Server launch
python -m sglang.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --speculative-algorithm EAGLE3 \
  --speculative-draft-model-path lmsys/sglang-EAGLE3-llama3.1-8B \
  --speculative-num-steps 5 \
  --speculative-eagle-topk 8 \
  --speculative-num-draft-tokens 64

Python: Standalone Draft Model¶

import torch
import torch.nn.functional as F

class SpeculativeDecoder:
    def __init__(self, target_model, draft_model, device="cuda"):
        self.target = target_model
        self.draft = draft_model
        self.device = device

    def generate(self, input_ids, max_tokens=100, num_speculative=4):
        generated = input_ids.clone()

        for _ in range(max_tokens):
            # Draft phase
            draft_tokens = []
            draft_probs = []
            current = generated

            with torch.no_grad():
                for _ in range(num_speculative):
                    logits = self.draft(current).logits[:, -1, :]
                    probs = F.softmax(logits, dim=-1)
                    token = torch.argmax(probs, dim=-1, keepdim=True)
                    draft_tokens.append(token)
                    draft_probs.append(probs.gather(-1, token))
                    current = torch.cat([current, token], dim=-1)

            # Verify phase
            draft_sequence = torch.cat(draft_tokens, dim=-1)
            verify_input = torch.cat([generated, draft_sequence], dim=-1)

            with torch.no_grad():
                target_logits = self.target(verify_input).logits

            # Acceptance
            num_accepted = 0
            for i, (token, draft_prob) in enumerate(zip(draft_tokens, draft_probs)):
                pos = generated.shape[1] + i - 1
                target_probs = F.softmax(target_logits[:, pos, :], dim=-1)
                target_prob = target_probs.gather(-1, token)

                if target_prob >= draft_prob:
                    num_accepted += 1
                else:
                    # Rejection sampling
                    alpha = target_prob / (draft_prob + 1e-10)
                    if torch.rand(1, device=self.device) < alpha:
                        num_accepted += 1
                    else:
                        break

            # Update generated
            if num_accepted > 0:
                generated = torch.cat([
                    generated,
                    torch.cat(draft_tokens[:num_accepted], dim=-1)
                ], dim=-1)

            if num_accepted == num_speculative:
                # Bonus token
                bonus = torch.argmax(target_logits[:, -1, :], dim=-1, keepdim=True)
                generated = torch.cat([generated, bonus], dim=-1)

            if generated.shape[1] >= input_ids.shape[1] + max_tokens:
                break

        return generated

10. Production Benchmarks¶

Qwen3-32B with EAGLE-3 (RTX 6000, 96GB)¶

Metric	Standard	Speculative	Speedup
Response tokens/s	22.96	41.88	1.82×
Time to First Token	25.62s	15.98s	1.60×
Prompt tokens/s	3.05	5.22	1.71×
Acceptance Rate	N/A	33.8%	—

Llama-3.1-8B with EAGLE-3 (H100)¶

Method	Throughput	Speedup
Standard	158 tok/s	1.0×
EAGLE-2	244 tok/s	1.54×
EAGLE-3	373 tok/s	2.36×

SpecBundle (Multiple Models, SGLang)¶

Model	SpecBundle Speedup
Llama-3.1-8B	2.8×
Llama-3.3-70B	2.2×
Qwen3-235B-MoE	2.5×
Kimi-K2	2.1×

11. Decision Framework¶

Method Selection¶

Scenario	Recommended Method
Best speed/quality	EAGLE-3 (SGLang/vLLM)
Broad compatibility	EAGLE-2
Native MTP model	MTP (DeepSeek V3/R1)
No extra model	NGRAM (training-free)
Have smaller LLM	STANDALONE draft-target
Maximum throughput	EAGLE-3 + torch.compile

Engine Selection¶

Engine	EAGLE-3	MTP	NGRAM	STANDALONE
vLLM	✅	✅	✅	✅
SGLang	✅	✅	✅	✅
TensorRT-LLM	✅	❌	❌	✅
llama.cpp	❌	❌	❌	✅

Performance Tuning¶

Start with defaults: num_speculative_tokens=3-5
Tune for your workload:
Short outputs: Increase num_speculative_tokens
Long outputs: Decrease for better memory
Monitor acceptance rate:
<50%: Draft model too weak, increase size
80%: Can increase speculation depth
VRAM constraints:
EAGLE: +3-5% VRAM
STANDALONE draft: +10-20% VRAM

12. Interview Questions¶

Basic¶

"What is speculative decoding and why does it speed up inference?"
"Explain the draft-target approach"
"What is acceptance rate in speculative decoding?"

Advanced¶

"Compare EAGLE-3 vs classic draft-target approach"
"How does train-time-testing work in EAGLE-3?"
"Explain why speculative decoding is lossless"
"What factors affect acceptance rate?"

System Design¶

"Design an LLM serving system with speculative decoding for a chatbot"
"When would you choose NGRAM over EAGLE-3?"
"How to train a custom EAGLE-3 draft model?"
"When does speculative decoding NOT help?"

Q: "When does speculative decoding NOT help?"

Three cases: 1) High temperature sampling reduces acceptance rate. 2) Very diverse outputs need too many candidates. 3) Small batch sizes where verification overhead dominates. Batch SpecDec (2026) helps with case 3. Also: domain mismatch between draft and target reduces acceptance rate.

13. Formulas Summary¶

Acceptance Probability¶

\[P(\text{accept}) = \min\left(1, \frac{p_{\text{target}}(t)}{p_{\text{draft}}(t)}\right)\]

Expected Tokens Per Pass¶

\[\mathbb{E}[\text{tokens}] = 1 + \sum_{i=1}^{n} P(\text{accept}_1 \cap ... \cap \text{accept}_i)\]

Speedup Formula¶

\[\text{Speedup} \approx \frac{\mathbb{E}[\text{accepted tokens}] + 1}{1 + \frac{\text{draft cost}}{\text{target cost}}}\]

Memory Overhead¶

\[\text{VRAM}_{\text{extra}} = \text{Draft params} + \text{KV cache for draft}\]

14. Sources & Further Reading¶

Papers¶

Speculative Sampling: Chen et al. (Google DeepMind), 2023 — arXiv:2302.01318 1b. Fast Inference via Speculative Decoding: Leviathan, Kalman, Matias (Google), 2022 — arXiv:2211.17192
EAGLE: Li et al., 2024 — arXiv:2401.15077
EAGLE-2: Li et al., 2024 — arXiv:2406.16858
EAGLE-3: Zhang et al., 2025 — arXiv:2503.01840
DART: Liu et al., 2026 — arXiv:2601.19278
Hydra (Medusa variant): arXiv:2402.05109
Batch Speculative Decoding: arXiv:2510.22876 (Oct 2025)
PPD (Parallel Prompt Decoding): EMNLP 2025
Online Speculative Decoding: arXiv:2310.07177

Documentation¶

SGLang Speculative Decoding: docs.sglang.io/advanced_features/speculative_decoding.html
vLLM Speculative Decoding: docs.vllm.ai/en/latest/features/spec_decode/
NVIDIA TensorRT-LLM: github.com/NVIDIA/TensorRT-LLM

Production¶

SpecBundle: huggingface.co/collections/lmsys/specbundle
SpecForge: github.com/sgl-project/SpecForge
Speculators (vLLM): github.com/vllm-project/speculators

Interview Questions¶

Conceptual:

"Как speculative decoding ускоряет inference без потери качества?" -- Draft model (маленькая, быстрая) генерирует K токенов-кандидатов. Target model (большая) верифицирует все K за один forward pass (параллельно). Принятые токены идентичны тем, что target сгенерировала бы сама. Speedup = K * acceptance_rate, обычно 2-3x.
"Почему speculative decoding математически lossless?" -- Rejection sampling: если draft token не совпадает с distribution target модели, он отвергается и заменяется sample из скорректированного distribution. Финальный output эквивалентен авторегрессивной генерации target модели.
"Draft model vs self-drafting: trade-offs?" -- Отдельная draft model: нужно хранить две модели в памяти, но acceptance rate выше при хорошем выборе. Self-drafting (Medusa, EAGLE): дополнительные prediction heads на target model, нет extra memory, но требует fine-tuning heads.

System Design:

"Когда speculative decoding НЕ даёт ускорения?" -- Когда acceptance rate <50% (draft слишком отличается от target), batch size >8 (GPU уже загружен compute, bottleneck не в memory bandwidth), или когда задача creative (высокий temperature снижает acceptance rate).

Частые ошибки

"Speculative decoding всегда даёт 2-3x speedup" -- Speedup зависит от acceptance rate, который зависит от similarity draft/target моделей и типа задачи. На creative writing (temperature 1.0) -- acceptance rate падает до 30-40%, speedup <1.5x. На code/factual (temperature 0) -- 70-80% acceptance, speedup 2.5-3x.

"Можно использовать любую маленькую модель как draft" -- Draft model должна быть обучена на тех же данных или fine-tuned для alignment с target. Random small model даёт acceptance rate <20% -- это медленнее чем без speculation (overhead верификации).

"Speculative decoding и batching -- одно и то же ускорение" -- Batching ускоряет throughput (больше запросов параллельно). Speculative decoding ускоряет latency одного запроса. Они complementary, но при большом batch speculative decoding теряет эффективность (GPU уже compute-bound).

Спекулятивное декодирование¶

Зачем это нужно¶

Обзор¶

Key Innovation¶

Comparison: Standard vs Speculative Decoding¶

1. Why Speculative Decoding Exists¶

Problem: Memory-Bound Inference¶

Solution: Parallel Verification¶

2. Draft-Target Approach (Classic)¶

Architecture¶

Rejection Sampling¶

Algorithm¶

Draft Model Selection¶

3. EAGLE-2 (June 2024)¶

Key Innovation: Feature-Level Drafting¶

Draft Tree¶

Parameters¶

4. EAGLE-3 (March 2025)¶

Key Innovation: Multi-Layer Feature Fusion¶

Training: Train-Time-Testing¶

Performance (SGLang, Llama-3.1-8B, MT-Bench)¶

SpecBundle (LMSYS, Dec 2025)¶

5. MTP: Multi-Token Prediction (DeepSeek)¶

Architecture¶

Comparison: MTP vs EAGLE¶

SGLang MTP Support¶

6. Другие методы¶

Medusa (Multi-Head Prediction)¶

Online Speculative Decoding¶

AWS SageMaker (Production)¶

7. NGRAM Speculative Decoding¶

8. Training Draft Models¶

Speculators v0.3.0 (vLLM, Dec 2025)¶

SpecForge v0.2 (SGLang, Dec 2025)¶

9. Code Examples¶

vLLM with EAGLE-3¶

SGLang with EAGLE-3¶

Python: Standalone Draft Model¶

10. Production Benchmarks¶

Qwen3-32B with EAGLE-3 (RTX 6000, 96GB)¶

Llama-3.1-8B with EAGLE-3 (H100)¶

SpecBundle (Multiple Models, SGLang)¶

11. Decision Framework¶

Method Selection¶

Engine Selection¶

Performance Tuning¶

12. Interview Questions¶

Basic¶

Advanced¶

System Design¶

13. Formulas Summary¶

Acceptance Probability¶

Expected Tokens Per Pass¶

Speedup Formula¶

Memory Overhead¶

14. Sources & Further Reading¶

Papers¶

Documentation¶

Production¶

Interview Questions¶

See Also¶