Перейти к содержанию

Спекулятивное декодирование

~9 минут чтения

Предварительно: KV Cache, vLLM и PagedAttention


Зачем это нужно

LLM генерирует токены по одному: 1 forward pass = 1 токен. При этом GPU загружен на ~30% -- bottleneck не в compute, а в memory bandwidth (загрузка весов из VRAM). Compute простаивает.

Аналогия: представьте редактора, который вычитывает текст по одному слову. Ассистент (draft model) предлагает сразу 5 слов, редактор проверяет их одним взглядом. Если 4 из 5 правильные -- скорость выросла в 4 раза. Если все 5 неправильные -- потерян один взгляд, ассистент пробует снова.

Ключевое свойство: output идентичен стандартному декодированию. Это не аппроксимация -- target модель всегда верифицирует draft. Speedup 1.5-4x без потери качества.


Обзор

Speculative Decoding -- техника ускорения LLM инференса: "легкая" draft модель предлагает несколько токенов, target модель верифицирует их за один forward pass.

Key Innovation

\[ \text{Speedup} = \frac{N_{\text{tokens per pass}}}{1} \times \text{Acceptance Rate} \]

Speculative decoding -- LOSSLESS ускорение

В отличие от квантизации или pruning, speculative decoding даёт точно такой же output как стандартный autoregressive decoding. Target модель всегда верифицирует draft -- если draft неправильный, reject + resample. Speedup 1.5-4x без потери качества.

Вместо: 1 токен за 1 forward pass (memory-bound) Используется: N токенов за 1 forward pass (compute-bound)

Comparison: Standard vs Speculative Decoding

Aspect Standard Autoregressive Speculative Decoding
Tokens per pass 1 2-8 (avg)
Memory utilization Low (idle compute) High
Latency reduction Baseline 1.5-4×
Output quality Exact Exact (lossless)
Extra VRAM None +3-10%

1. Why Speculative Decoding Exists

Problem: Memory-Bound Inference

LLM инференс ограничен пропускной способностью памяти, а не compute:

GPU Compute: 1000 TFLOPS
Memory BW: 3 TB/s
Effective utilization: ~30% (most compute sits idle)

Каждый токен требует: 1. Загрузки весов модели из VRAM 2. Одного forward pass 3. Синхронизации памяти

Solution: Parallel Verification

Speculative decoding использует idle compute для параллельной верификации:

# Standard: N passes for N tokens
for _ in range(N):
    token = model(input)  # N expensive passes
    input = concat(input, token)

# Speculative: 1 pass for N tokens (if accepted)
draft_tokens = draft_model.generate(input, n=N)  # Fast, cheap
verified_tokens = target_model.verify(input, draft_tokens)  # Single pass

Ключевое понимание: Target model может проверить несколько токенов за один forward pass благодаря: 1. KV Cache для prefix 2. Параллельные операции attention 3. Batch processing на GPU


2. Draft-Target Approach (Classic)

Architecture

graph LR
    DRAFT["Draft Model<br/>(small, fast)"] -->|"Draft N tokens"| INPUT["Input +<br/>Draft Tokens"]
    INPUT --> TARGET["Target Model<br/>(large)"]
    TARGET -->|"Verify in 1 pass"| ACC["Accepted Tokens"]
    style DRAFT fill:#e8f5e9,stroke:#4caf50
    style TARGET fill:#e8eaf6,stroke:#3f51b5
    style ACC fill:#fff3e0,stroke:#ef6c00

Rejection Sampling

Acceptance Logic:

\[ \text{Accept} = \begin{cases} \text{Yes} & \text{if } p_{\text{target}}(t) \geq p_{\text{draft}}(t) \\ \text{Sample} & \text{if } p_{\text{target}}(t) < p_{\text{draft}}(t) \end{cases} \]

Где: - \(p_{\text{target}}(t)\) — вероятность токена от target model - \(p_{\text{draft}}(t)\) — вероятность токена от draft model

Algorithm

def speculative_decode(draft_model, target_model, prompt, num_speculative=4):
    tokens = prompt

    while not finished:
        # 1. Draft model generates N tokens
        draft_tokens = []
        draft_probs = []
        current = tokens
        for _ in range(num_speculative):
            logits = draft_model(current)
            prob = softmax(logits[-1])
            token = sample(prob)
            draft_tokens.append(token)
            draft_probs.append(prob[token])
            current = concat(current, token)

        # 2. Target model verifies all in ONE pass
        target_logits = target_model(concat(tokens, draft_tokens))

        # 3. Acceptance sampling
        accepted = 0
        for i, (draft_token, draft_prob) in enumerate(zip(draft_tokens, draft_probs)):
            target_prob = softmax(target_logits[len(tokens) + i - 1])[draft_token]

            if target_prob >= draft_prob:
                # Accept
                accepted += 1
            else:
                # Rejection sampling
                alpha = target_prob / draft_prob
                if random() < alpha:
                    accepted += 1
                else:
                    # Sample from adjusted distribution
                    corrected = max(0, target_prob - draft_prob)
                    new_token = sample_from_adjusted(corrected)
                    tokens = concat(tokens, draft_tokens[:accepted], new_token)
                    break

        if accepted == num_speculative:
            # Bonus token from target
            bonus = sample(softmax(target_logits[-1]))
            tokens = concat(tokens, draft_tokens, bonus)

    return tokens

Draft Model Selection

Draft Size Target Size Acceptance Rate Speedup
125M 7B 60-70% 1.5-2×
350M 13B 55-65% 1.4-1.8×
1B 70B 45-55% 1.3-1.6×

Rule of thumb: Draft model ≈ 1/10 размера target model.


3. EAGLE-2 (June 2024)

Paper: "EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees"

Key Innovation: Feature-Level Drafting

Вместо отдельной модели, EAGLE использует lightweight head, который работает с hidden states target model:

graph LR
    TM["Target Model<br/>Layers 0-N"] --> HS["Final Hidden State"]
    HS --> EH["EAGLE Head"]
    EH --> DT["Draft Tokens"]
    TM --> LM["lm_head"]
    style EH fill:#e8f5e9,stroke:#4caf50
    style DT fill:#fff3e0,stroke:#ef6c00

Draft Tree

EAGLE-2 строит дерево кандидатов вместо линейной последовательности:

graph LR
    INPUT["Input"] --> A["the"] --> A1["cat"] --> A2["sat"]
    INPUT --> B["a"] --> B1["dog"] --> B2["ran"]
    INPUT --> C["one"] --> C1["bird"] --> C2["flew"]
    style A fill:#e8f5e9,stroke:#4caf50
    style B fill:#e8eaf6,stroke:#3f51b5
    style C fill:#fff3e0,stroke:#ef6c00

Преимущества: - Адаптивная глубина по confidence - Параллельная верификация через tree attention - Instance-adaptive: больше кандидатов для "лёгких" токенов

Parameters

Parameter Description Default
speculative_num_steps Depth of drafting 5
speculative_eagle_topk Branching factor 4
speculative_num_draft_tokens Max tree size 8

4. EAGLE-3 (March 2025)

Paper: "EAGLE-3: Efficient Speculative Decoding with Multi-Layer Features"

Key Innovation: Multi-Layer Feature Fusion

EAGLE-3 берёт hidden states из трёх слоёв target model:

Target Model:
├── Low Layer (L/4)    ──┐
├── Mid Layer (L/2)    ──┼──▶ Fusion ──▶ EAGLE-3 Head ──▶ Draft Tree
├── High Layer (3L/4)  ──┘
└── lm_head

Training: Train-Time-Testing

EAGLE-3 использует специальную технику обучения, которая симулирует multi-step draft sampling:

For each position:
  1. Generate first draft token (blue)
  2. Generate second draft token (yellow) conditioned on first
  3. Generate third draft token (green) conditioned on first two
  ...

Это использует FlexAttention для эффективного вычисления sparse attention mask.

Performance (SGLang, Llama-3.1-8B, MT-Bench)

Method Throughput (tokens/s) Speedup
Standard 158.34 1.0×
EAGLE-2 244.10 1.54×
EAGLE-3 373.25 2.36×

SpecBundle (LMSYS, Dec 2025)

SpecBundle Phase 1 — коллекция production-ready EAGLE-3 моделей:

  • Trained on: Perfect-Blend dataset (1.4M samples vs 320K in original)
  • Models: Llama-3.1-8B, Llama-3.3-70B, Qwen3-32B, Qwen3-235B-MoE, Kimi-K2
  • Speedup: Up to end-to-end

5. MTP: Multi-Token Prediction (DeepSeek)

Architecture

DeepSeek V3/R1 используют встроенные multi-token prediction heads:

graph LR
    M["Model"] --> LM["Main LM Head"] --> T1["Token t+1"]
    M --> H1["MTP Head 1"] --> T2["Token t+2"]
    M --> H2["MTP Head 2"] --> T3["Token t+3"]
    M --> HN["MTP Head N"] --> TN["Token t+N"]
    style LM fill:#e8eaf6,stroke:#3f51b5
    style H1 fill:#e8f5e9,stroke:#4caf50
    style H2 fill:#e8f5e9,stroke:#4caf50
    style HN fill:#e8f5e9,stroke:#4caf50

Comparison: MTP vs EAGLE

Aspect MTP (DeepSeek) EAGLE-3
Draft source Multiple prediction heads Single feature-based head
Training Jointly with model Post-hoc
Overhead Built-in Separate model weights
Best for Native MTP models Any model

SGLang MTP Support

python -m sglang.launch_server \
  --model deepseek-ai/DeepSeek-V3 \
  --speculative-algorithm MTP \
  --speculative-num-steps 3

6. Другие методы

Medusa (Multi-Head Prediction)

Sequentially-dependent draft heads, attached to target model. Hydra variant (Oct 2024): improved token acceptance via prediction confidence correlation.

Method Speedup Преимущества Недостатки
Medusa 2-3x Multi-head prediction Training multiple heads
PPD (EMNLP 2025) 2x Memory-efficient, <50% training time of Medusa Newer, less tested
OPT-Tree -- Adaptive draft tree, dynamic candidates Complexity
Batch SpecDec (arXiv:2510.22876) 3x at BS=8 Proper batch handling CUDA-only

Online Speculative Decoding

Periodically fine-tune draft model на corrections от target model. Draft адаптируется к target behavior over time.

AWS SageMaker (Production)

EAGLE-based adaptive speculative decoding. 2.5x speedup while maintaining quality. Red Hat OpenShift AI also supports EAGLE-3 with vLLM.


7. NGRAM Speculative Decoding

Training-free метод, использующий n-gram статистику:

from collections import Counter

class NGramSpeculator:
    def __init__(self, n=4):
        self.ngram_cache = {}  # (token_seq) -> next_token_counts

    def update(self, tokens):
        """Update n-gram statistics from context"""
        for i in range(len(tokens) - self.n):
            key = tuple(tokens[i:i+self.n])
            next_token = tokens[i+self.n]
            if key not in self.ngram_cache:
                self.ngram_cache[key] = Counter()
            self.ngram_cache[key][next_token] += 1

    def speculate(self, tokens, k=4):
        """Predict next k tokens using n-grams"""
        key = tuple(tokens[-self.n:])
        if key in self.ngram_cache:
            return self.ngram_cache[key].most_common(k)
        return []

Преимущества: - No extra model - No training required - Works with any LLM

Недостатки: - CUDA-only - Lower acceptance rate than model-based methods


8. Training Draft Models

Speculators v0.3.0 (vLLM, Dec 2025)

Pipeline:

1. Data Generation:
   ├── Preprocess: chat template, tokenize, loss mask
   ├── Hidden States: extract from vLLM forward pass
   └── Save: input_ids, hidden_states, loss_mask

2. Training:
   ├── Eagle3DraftModel initialization
   ├── Train-time-testing simulation
   ├── FlexAttention for sparse masks
   └── Vocabulary mapping (target-to-draft)

3. Deployment:
   ├── speculators_config in model weights
   └── Seamless vLLM integration

SpecForge v0.2 (SGLang, Dec 2025)

Features: - Multi-backend support (SGLang, HuggingFace) - 10× faster data regeneration - Unified online/offline training

# SpecForge multi-backend
target_model = get_eagle3_target_model(
    pretrained_model_name_or_path="meta-llama/Llama-3.1-8B-Instruct",
    backend="sglang",  # or "huggingface"
    torch_dtype=torch.bfloat16,
)

9. Code Examples

vLLM with EAGLE-3

from vllm import LLM, SamplingParams

# Standard serving
llm = LLM(
    model="Qwen/Qwen3-32B",
    tensor_parallel_size=1,
)

# With speculative decoding
llm_spec = LLM(
    model="Qwen/Qwen3-32B",
    speculative_config={
        "model": "RedHatAI/Qwen3-32B-speculator.eagle3",
        "num_speculative_tokens": 3,
        "method": "eagle3",
    },
    tensor_parallel_size=1,
)

# Benchmark
sampling_params = SamplingParams(temperature=0, max_tokens=256)

outputs = llm_spec.generate(prompts, sampling_params)
# ~1.8-2× faster than standard

SGLang with EAGLE-3

# Server launch
python -m sglang.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --speculative-algorithm EAGLE3 \
  --speculative-draft-model-path lmsys/sglang-EAGLE3-llama3.1-8B \
  --speculative-num-steps 5 \
  --speculative-eagle-topk 8 \
  --speculative-num-draft-tokens 64

Python: Standalone Draft Model

import torch
import torch.nn.functional as F

class SpeculativeDecoder:
    def __init__(self, target_model, draft_model, device="cuda"):
        self.target = target_model
        self.draft = draft_model
        self.device = device

    def generate(self, input_ids, max_tokens=100, num_speculative=4):
        generated = input_ids.clone()

        for _ in range(max_tokens):
            # Draft phase
            draft_tokens = []
            draft_probs = []
            current = generated

            with torch.no_grad():
                for _ in range(num_speculative):
                    logits = self.draft(current).logits[:, -1, :]
                    probs = F.softmax(logits, dim=-1)
                    token = torch.argmax(probs, dim=-1, keepdim=True)
                    draft_tokens.append(token)
                    draft_probs.append(probs.gather(-1, token))
                    current = torch.cat([current, token], dim=-1)

            # Verify phase
            draft_sequence = torch.cat(draft_tokens, dim=-1)
            verify_input = torch.cat([generated, draft_sequence], dim=-1)

            with torch.no_grad():
                target_logits = self.target(verify_input).logits

            # Acceptance
            num_accepted = 0
            for i, (token, draft_prob) in enumerate(zip(draft_tokens, draft_probs)):
                pos = generated.shape[1] + i - 1
                target_probs = F.softmax(target_logits[:, pos, :], dim=-1)
                target_prob = target_probs.gather(-1, token)

                if target_prob >= draft_prob:
                    num_accepted += 1
                else:
                    # Rejection sampling
                    alpha = target_prob / (draft_prob + 1e-10)
                    if torch.rand(1, device=self.device) < alpha:
                        num_accepted += 1
                    else:
                        break

            # Update generated
            if num_accepted > 0:
                generated = torch.cat([
                    generated,
                    torch.cat(draft_tokens[:num_accepted], dim=-1)
                ], dim=-1)

            if num_accepted == num_speculative:
                # Bonus token
                bonus = torch.argmax(target_logits[:, -1, :], dim=-1, keepdim=True)
                generated = torch.cat([generated, bonus], dim=-1)

            if generated.shape[1] >= input_ids.shape[1] + max_tokens:
                break

        return generated

10. Production Benchmarks

Qwen3-32B with EAGLE-3 (RTX 6000, 96GB)

Metric Standard Speculative Speedup
Response tokens/s 22.96 41.88 1.82×
Time to First Token 25.62s 15.98s 1.60×
Prompt tokens/s 3.05 5.22 1.71×
Acceptance Rate N/A 33.8%

Llama-3.1-8B with EAGLE-3 (H100)

Method Throughput Speedup
Standard 158 tok/s 1.0×
EAGLE-2 244 tok/s 1.54×
EAGLE-3 373 tok/s 2.36×

SpecBundle (Multiple Models, SGLang)

Model SpecBundle Speedup
Llama-3.1-8B 2.8×
Llama-3.3-70B 2.2×
Qwen3-235B-MoE 2.5×
Kimi-K2 2.1×

11. Decision Framework

Method Selection

Scenario Recommended Method
Best speed/quality EAGLE-3 (SGLang/vLLM)
Broad compatibility EAGLE-2
Native MTP model MTP (DeepSeek V3/R1)
No extra model NGRAM (training-free)
Have smaller LLM STANDALONE draft-target
Maximum throughput EAGLE-3 + torch.compile

Engine Selection

Engine EAGLE-3 MTP NGRAM STANDALONE
vLLM
SGLang
TensorRT-LLM
llama.cpp

Performance Tuning

  1. Start with defaults: num_speculative_tokens=3-5
  2. Tune for your workload:
  3. Short outputs: Increase num_speculative_tokens
  4. Long outputs: Decrease for better memory
  5. Monitor acceptance rate:
  6. <50%: Draft model too weak, increase size
  7. 80%: Can increase speculation depth

  8. VRAM constraints:
  9. EAGLE: +3-5% VRAM
  10. STANDALONE draft: +10-20% VRAM

12. Interview Questions

Basic

  1. "What is speculative decoding and why does it speed up inference?"
  2. "Explain the draft-target approach"
  3. "What is acceptance rate in speculative decoding?"

Advanced

  1. "Compare EAGLE-3 vs classic draft-target approach"
  2. "How does train-time-testing work in EAGLE-3?"
  3. "Explain why speculative decoding is lossless"
  4. "What factors affect acceptance rate?"

System Design

  1. "Design an LLM serving system with speculative decoding for a chatbot"
  2. "When would you choose NGRAM over EAGLE-3?"
  3. "How to train a custom EAGLE-3 draft model?"
  4. "When does speculative decoding NOT help?"

Q: "When does speculative decoding NOT help?"

Three cases: 1) High temperature sampling reduces acceptance rate. 2) Very diverse outputs need too many candidates. 3) Small batch sizes where verification overhead dominates. Batch SpecDec (2026) helps with case 3. Also: domain mismatch between draft and target reduces acceptance rate.


13. Formulas Summary

Acceptance Probability

\[P(\text{accept}) = \min\left(1, \frac{p_{\text{target}}(t)}{p_{\text{draft}}(t)}\right)\]

Expected Tokens Per Pass

\[\mathbb{E}[\text{tokens}] = 1 + \sum_{i=1}^{n} P(\text{accept}_1 \cap ... \cap \text{accept}_i)\]

Speedup Formula

\[\text{Speedup} \approx \frac{\mathbb{E}[\text{accepted tokens}] + 1}{1 + \frac{\text{draft cost}}{\text{target cost}}}\]

Memory Overhead

\[\text{VRAM}_{\text{extra}} = \text{Draft params} + \text{KV cache for draft}\]

14. Sources & Further Reading

Papers

  1. Speculative Sampling: Chen et al. (Google DeepMind), 2023 — arXiv:2302.01318 1b. Fast Inference via Speculative Decoding: Leviathan, Kalman, Matias (Google), 2022 — arXiv:2211.17192
  2. EAGLE: Li et al., 2024 — arXiv:2401.15077
  3. EAGLE-2: Li et al., 2024 — arXiv:2406.16858
  4. EAGLE-3: Zhang et al., 2025 — arXiv:2503.01840
  5. DART: Liu et al., 2026 — arXiv:2601.19278
  6. Hydra (Medusa variant): arXiv:2402.05109
  7. Batch Speculative Decoding: arXiv:2510.22876 (Oct 2025)
  8. PPD (Parallel Prompt Decoding): EMNLP 2025
  9. Online Speculative Decoding: arXiv:2310.07177

Documentation

  1. SGLang Speculative Decoding: docs.sglang.io/advanced_features/speculative_decoding.html
  2. vLLM Speculative Decoding: docs.vllm.ai/en/latest/features/spec_decode/
  3. NVIDIA TensorRT-LLM: github.com/NVIDIA/TensorRT-LLM

Production

  1. SpecBundle: huggingface.co/collections/lmsys/specbundle
  2. SpecForge: github.com/sgl-project/SpecForge
  3. Speculators (vLLM): github.com/vllm-project/speculators

Interview Questions

Conceptual:

  1. "Как speculative decoding ускоряет inference без потери качества?" -- Draft model (маленькая, быстрая) генерирует K токенов-кандидатов. Target model (большая) верифицирует все K за один forward pass (параллельно). Принятые токены идентичны тем, что target сгенерировала бы сама. Speedup = K * acceptance_rate, обычно 2-3x.
  2. "Почему speculative decoding математически lossless?" -- Rejection sampling: если draft token не совпадает с distribution target модели, он отвергается и заменяется sample из скорректированного distribution. Финальный output эквивалентен авторегрессивной генерации target модели.
  3. "Draft model vs self-drafting: trade-offs?" -- Отдельная draft model: нужно хранить две модели в памяти, но acceptance rate выше при хорошем выборе. Self-drafting (Medusa, EAGLE): дополнительные prediction heads на target model, нет extra memory, но требует fine-tuning heads.

System Design:

  1. "Когда speculative decoding НЕ даёт ускорения?" -- Когда acceptance rate <50% (draft слишком отличается от target), batch size >8 (GPU уже загружен compute, bottleneck не в memory bandwidth), или когда задача creative (высокий temperature снижает acceptance rate).

Частые ошибки

"Speculative decoding всегда даёт 2-3x speedup" -- Speedup зависит от acceptance rate, который зависит от similarity draft/target моделей и типа задачи. На creative writing (temperature 1.0) -- acceptance rate падает до 30-40%, speedup <1.5x. На code/factual (temperature 0) -- 70-80% acceptance, speedup 2.5-3x.

"Можно использовать любую маленькую модель как draft" -- Draft model должна быть обучена на тех же данных или fine-tuned для alignment с target. Random small model даёт acceptance rate <20% -- это медленнее чем без speculation (overhead верификации).

"Speculative decoding и batching -- одно и то же ускорение" -- Batching ускоряет throughput (больше запросов параллельно). Speculative decoding ускоряет latency одного запроса. Они complementary, но при большом batch speculative decoding теряет эффективность (GPU уже compute-bound).


See Also