Оптимизация инференса LLM¶

~10 минут чтения

Continuous batching, PagedAttention, FlashAttention, speculative decoding (EAGLE-3), quantization (FP8/INT4/NVFP4), MoE, сравнение движков (vLLM, SGLang, TensorRT-LLM) (2025-2026)

Предварительно: Архитектура памяти LLM, KV-кэш оптимизация

Зачем это нужно. Инференс LLM -- 80-90% от общей стоимости владения в production. Обслуживание Llama 70B на 4xA100 стоит ~$25K/мес при cloud-ценах. Без оптимизации GPU utilization составляет 30-50%, и вы платите за простой. Пять core-техник (continuous batching, PagedAttention, quantization, FlashAttention, speculative decoding) в сумме дают 10-50x улучшение throughput и 2-4x снижение стоимости per token. Порядок внедрения критичен: continuous batching дает 2-24x throughput при нулевой сложности, а speculative decoding -- 2-6x latency при средней сложности.

Ключевые концепции¶

Пять core техник¶

#	Техника	Speedup	Сложность внедрения
1	Continuous batching	2-24x throughput	Low
2	KV cache (PagedAttention)	2-4x memory efficiency	Low
3	Quantization (FP8/INT8)	2x memory, 2x speed	Low
4	FlashAttention	10-20x memory savings	Low
5	Speculative decoding	2-3x latency	Medium

Порядок внедрения: сверху вниз -- наибольший ROI при минимальной сложности.

1. Continuous Batching¶

Проблема: static batching: GPU простаивает, ожидая медленные запросы.

Решение: dynamically add/remove requests as they complete.

Метрика	Static Batching	Continuous Batching
GPU utilization	30-50%	70-90%
Throughput	Baseline	2-4x
Latency variance	High	Low
vLLM vs baseline	--	Up to 23x
vLLM vs TGI (high concurrency)	--	24x

Framework support:

Framework	Название фичи
vLLM	Iteration-level scheduling (core)
SGLang	Built-in
TensorRT-LLM	In-flight batching
LMDeploy	Persistent batching
HuggingFace TGI	Supported

2. PagedAttention (KV Cache)¶

Проблема: KV cache contiguous allocation -> 60-80% memory waste из-за fragmentation.

Решение: page-based memory management (аналог virtual memory в OS).

graph LR
    subgraph Traditional["Traditional KV Cache (60-80% waste)"]
        direction LR
        T1["Request 1: contiguous block"] --> TF["Fragmentation (waste)"]
        T2["Request 2: contiguous block"] --> TF
    end

    subgraph Paged["PagedAttention (<4% waste)"]
        direction LR
        P1["R1"] --> P2["R1"] --> P3["R2"] --> P4["R1"] --> P5["R2"] --> P6["R2"] --> P7["R1"]
    end

    style T1 fill:#e8eaf6,stroke:#3f51b5
    style T2 fill:#e8f5e9,stroke:#4caf50
    style TF fill:#fce4ec,stroke:#c62828
    style P1 fill:#e8eaf6,stroke:#3f51b5
    style P2 fill:#e8eaf6,stroke:#3f51b5
    style P3 fill:#e8f5e9,stroke:#4caf50
    style P4 fill:#e8eaf6,stroke:#3f51b5
    style P5 fill:#e8f5e9,stroke:#4caf50
    style P6 fill:#e8f5e9,stroke:#4caf50
    style P7 fill:#e8eaf6,stroke:#3f51b5

Метрика	Traditional	PagedAttention
Memory waste	60-80%	<4%
Memory efficiency	60-70%	95%+
Max batch size	Limited by longest seq	2-4x larger
Throughput vs FasterTransformer	--	2-4x

Advanced techniques (2025-2026):

Техника	Описание
LMCache	Hierarchical caching: GPU -> CPU -> network
Prefix caching	Reuse common prompt prefixes
KV compression	FP8/INT8 for 2-3x memory savings
Automatic prefix caching	vLLM built-in feature

3. Quantization¶

Format	Memory (7B)	Quality Loss	Best For
FP16/BF16	~14 GB	Baseline	Maximum quality
FP8	~7 GB	<1%	Production inference (Hopper+)
INT8	~7 GB	~2%	Balanced deployment
INT4	~3.5 GB	8-10%	Edge/resource-constrained
NVFP4	~4 GB	<1%	Next-gen GPUs (Blackwell)

FP8 (State of the Art): requires Hopper (H100) or Ada. 2.3x inference speedup vs FP16 на LLaMA-v2-7B.

NVFP4 (Blackwell): 3.5x memory reduction vs FP16, <1% accuracy degradation (LiveCodeBench, MMLU-PRO).

Rankings:

Rank	Method	Best For
1	FP8	Batch >= 16, optimal perf/accuracy
2	Q5_K_M / GPTQ-INT8	Best trade-off for most domains
3	AWQ	Better than GPTQ weight-only
4	INT4 (GPTQ)	Use cautiously, significant loss

Task-specific impact: coding/STEM most affected. 70B+ can maintain 4-bit; smaller models need 8-bit.

4. FlashAttention¶

Fused operations: single kernel, block processing, never materializes full attention matrix.

Version	GPU	TFLOPS (FP16)	Utilization	Key Features
FA-1	A100	~300	~50%	Basic fusion
FA-2	A100/H100	~400	~35%	Improved kernels
FA-3	H100	840	85%	Warp specialization, FP8
FA-4	Blackwell	TBD	Higher	Scalability, reduced overhead

FA-3 на H100: BF16 = 840 TFLOPS (85% utilization). FP8 = 1,300 TFLOPS (1.3 PFLOPS). 1.5-2x faster vs FA-2.

Memory: O(n^2) -> O(n). Standard attention at 128K = prohibitive. FlashAttention = manageable.

5. Speculative Decoding¶

Step	Описание
Draft	Smaller model predicts multiple tokens ahead
Verify	Target model verifies candidates in single forward pass
Accept/Reject	Accepted tokens used; rejected trigger regeneration
Output	Mathematically identical to standard decoding

EAGLE-3 (NeurIPS 2025, SOTA): - Architecture: 1-2 transformer layers as "draft head" (2-5% of target size) - Innovation: fusion of low-, mid-, high-level semantic features - Speedup: 2-6x - Framework: vLLM v0.8.5+, SGLang via SpecForge

Adoption: Google Search (AI Overviews), vLLM (native EAGLE-⅓ support), SGLang.

Limiting factors: acceptance rate 0.6-0.8, domain mismatch reduces effectiveness.

6. MoE для инференса¶

Метрика	Improvement
Compute per inference	90-95% reduction
Training efficiency	2-7x faster
Power consumption	Up to 50% reduction

Notable models:

Model	Total	Active	Context
DeepSeek R1	671B	37B	Standard
Gemini 1.5	~1T	150-200B	1M tokens
Kimi K2	~1T	32B	Long context

Research (2025): Super Experts (critical subset), MaxScore routing (constrained optimization), MegaScale-Infer (disaggregated expert parallelism).

Движки инференса¶

Сравнение лидеров¶

Engine	Best For	Throughput	TTFT	License
TensorRT-LLM	NVIDIA hardware, max perf	180-220 req/s	35-50ms	NVIDIA
vLLM	Flexibility, community	100-150 req/s	50-80ms	Apache 2.0
SGLang	Structured output, multi-turn	120-180 req/s	40-70ms	Apache 2.0
llama.cpp	CPU/edge	10-50 req/s	100-500ms	MIT
TGI	HuggingFace ecosystem	--	--	HF License

vLLM¶

graph TD
    VS["vLLM Server"] --> SCH["Scheduler<br/>continuous batching, preemption/swapping"]
    SCH --> BM["Block Manager<br/>PagedAttention, copy-on-write prefix caching"]
    BM --> W["Worker(s)<br/>FlashAttention, tensor parallelism"]

    style VS fill:#e8eaf6,stroke:#3f51b5
    style SCH fill:#e8f5e9,stroke:#4caf50
    style BM fill:#fff3e0,stroke:#ef6c00
    style W fill:#f3e5f5,stroke:#9c27b0

Developer: UC Berkeley Sky Lab (2023)
50+ model architectures, 2M+ monthly downloads
Key: PagedAttention

SGLang¶

Developer: Stanford/Berkeley (2024)
Key: RadixAttention (tree-based KV cache sharing), best structured output (regex-constrained JSON/XML)
Optimized for multi-turn, 5x throughput in multi-call workloads
Most stable per-token latency (4-21ms)

TensorRT-LLM¶

graph TD
    TRT["TensorRT-LLM"] --> RT["Runtime<br/>in-flight batching, KV cache management"]
    RT --> KL["Kernel Library<br/>custom attention kernels, quantized matmul, fused ops"]
    KL --> TE["TensorRT Engine<br/>graph optimization, kernel auto-tuning"]

    style TRT fill:#fff3e0,stroke:#ef6c00
    style RT fill:#e8eaf6,stroke:#3f51b5
    style KL fill:#e8f5e9,stroke:#4caf50
    style TE fill:#f3e5f5,stroke:#9c27b0

Developer: NVIDIA (2023)
Best on A100/H100/B200, highest throughput, lowest latency

Benchmarks по GPU tier (Llama 70B)¶

Throughput (req/sec):

Engine	A100 (80GB)	H100 (80GB)	B200
TensorRT-LLM	120-150	180-220	250-300
SGLang	80-120	120-180	180-240
vLLM	70-100	100-150	150-200
llama.cpp	10-20	15-30	N/A

TTFT (ms) by sequence length:

Engine	128 tokens	1024 tokens	8192 tokens
TensorRT-LLM	35-50	60-100	200-400
SGLang	40-70	80-120	250-500
vLLM	50-80	100-150	300-600
llama.cpp	100-200	200-400	500-1500

Memory:

Engine	7B	70B	120B
vLLM	16GB	140GB	240GB
SGLang	14GB	130GB	220GB
TensorRT-LLM	12GB	120GB	200GB

Prefix Caching¶

Engine	Метод	Cache Hit Benefit
vLLM	Automatic	5-10x faster
SGLang	RadixAttention	10-20x faster
TensorRT-LLM	Manual	3-5x faster

Decision Tree¶

graph TD
    START["Выбор движка"] --> Q1{"Max throughput<br/>на NVIDIA?"}
    Q1 -->|"Да"| TRT["TensorRT-LLM"]
    Q1 -->|"Нет"| Q2{"Structured output<br/>(JSON/XML)?"}
    Q2 -->|"Да"| SGL["SGLang"]
    Q2 -->|"Нет"| Q3{"Multi-turn<br/>conversations?"}
    Q3 -->|"Да"| SGL2["SGLang"]
    Q3 -->|"Нет"| Q4{"CPU / edge?"}
    Q4 -->|"Да"| LLAMA["llama.cpp"]
    Q4 -->|"Нет"| VLLM["vLLM"]

    style START fill:#e8eaf6,stroke:#3f51b5
    style TRT fill:#fff3e0,stroke:#ef6c00
    style SGL fill:#e8f5e9,stroke:#4caf50
    style SGL2 fill:#e8f5e9,stroke:#4caf50
    style LLAMA fill:#f3e5f5,stroke:#9c27b0
    style VLLM fill:#e8f5e9,stroke:#4caf50

Interview Questions¶

Q: Назовите 5 ключевых техник оптимизации инференса LLM в порядке приоритета.

Red flag: "Квантизация, pruning, knowledge distillation..." (путает inference optimization с model compression)

Strong answer: "1) Continuous batching (2-24x throughput, dynamically add/remove requests -- biggest ROI при нулевой сложности). 2) PagedAttention (KV cache paging, memory waste: 60-80% -> <4%, enables 2-4x larger batch sizes). 3) Quantization FP8 (2x memory + 2x speed, <1% quality loss на Hopper+). 4) FlashAttention (O(n^2) -> O(n) memory, FA-3: 840 TFLOPS, 85% utilization на H100). 5) Speculative decoding (draft-verify pattern, 2-6x latency с EAGLE-3, mathematically identical output). Порядок: throughput -> memory -> latency."

Q: Как работает PagedAttention и почему это важно?

Red flag: "Это оптимизация attention, ускоряет вычисления."

Strong answer: "Аналог virtual memory в OS. KV cache разбивается на fixed-size blocks (16 tokens), выделяемые non-contiguously on-demand. Block tables mapят virtual -> physical addresses. Решает проблему fragmentation: traditional KV cache -- 60-80% memory waste из-за contiguous allocation, PagedAttention -- <4% waste. Это позволяет 2-4x larger batch sizes на том же GPU. vLLM built-in, copy-on-write для prefix caching. Без PagedAttention Llama 70B на A100 80GB обслуживает ~8 concurrent requests, с ним -- до 32."

Q: Speculative decoding -- как работает и когда не помогает?

Red flag: "Маленькая модель генерирует ответ, большая проверяет."

Strong answer: "Draft model (2-5% params от target) генерирует N candidate tokens. Target model верифицирует все N за один forward pass (batch verification). Accepted tokens используются, rejected trigger regeneration. Output mathematically identical -- это не приближение. EAGLE-3 (NeurIPS 2025): 2-6x speedup, 1-2 transformer layers как draft head. Google Search использует для AI Overviews. Ограничение: acceptance rate 0.6-0.8, и при domain mismatch (draft не обучен на вашем домене) может быть медленнее прямой генерации."

Q: vLLM vs SGLang vs TensorRT-LLM -- как выбирать?

Red flag: "vLLM самый популярный, его и используем."

Strong answer: "Зависит от workload. TensorRT-LLM: max throughput (180-220 req/s), lowest latency (35-50ms TTFT), лидирует при concurrency < 16, но proprietary и сложнее в деплое. vLLM: best flexibility, 50+ model architectures, Apache 2.0, лидирует при high concurrency (128+), 2M+ monthly downloads. SGLang: best structured output (RadixAttention + regex-constrained JSON/XML), best multi-turn (5x throughput), RadixAttention дает 10-20x speedup при cache hit. Для chat с JSON output -- SGLang. Для batch processing -- TRT-LLM. Для прототипирования и гибкости -- vLLM."

Ключевые числа¶

Факт	Значение
Continuous batching (vLLM vs TGI)	Up to 24x throughput
PagedAttention memory waste	60-80% -> <4%
FP8 speedup vs FP16	2.3x
FlashAttention-3 BF16	840 TFLOPS, 85% utilization
FlashAttention-3 FP8	1.3 PFLOPS
EAGLE-3 speedup	2-6x
Speculative acceptance rate	0.6-0.8
TensorRT-LLM throughput (H100)	180-220 req/s
vLLM throughput (H100)	100-150 req/s
SGLang RadixAttention cache hit	10-20x faster
MoE compute reduction	90-95%
Latency target TTFT	<100ms
Latency alert TBT	>200ms

Заблуждение: квантизация всегда безопасна

FP8 дает <1% quality loss на general benchmarks, но на coding/STEM задачах degradation может быть 3-5%. INT4 теряет 8-10% на общих бенчмарках и до 15% на математике. Модели <13B параметров особенно чувствительны: 7B INT4 может терять 12-15% на HumanEval, тогда как 70B INT4 теряет только 5-7%. Правило: для моделей <13B используйте минимум INT8, для coding-задач -- минимум FP8.

Заблуждение: speculative decoding всегда ускоряет в 2-3x

Speedup 2-6x (EAGLE-3) достигается при acceptance rate 0.6-0.8. Но acceptance rate зависит от domain match между draft и target моделями. На out-of-distribution данных (модель обучена на English, запросы на Chinese) acceptance rate падает до 0.3-0.4, и speculative decoding может быть медленнее прямой генерации из-за overhead на верификацию. Всегда проверяйте acceptance rate на вашем реальном workload.

Заблуждение: больше GPU = линейно больше throughput

Tensor parallelism на 2 GPU дает ~1.8x throughput (не 2x) из-за communication overhead. На 4 GPU -- ~3.2x, на 8 GPU -- ~5-6x. При этом latency на single request может увеличиться из-за all-reduce synchronization. Для latency-sensitive workloads часто лучше 1 GPU с меньшей моделью (FP8 quantization), чем 4 GPU с полной.

Источники¶

Zylos Research -- "AI Inference Optimization Techniques (2025-2026)"
NVIDIA -- Speculative Decoding Blog, NVFP4 Blog
Google Research -- Speculative Decoding retrospective
arXiv -- EAGLE-3 (2503.01840), PagedAttention (2309.06180)
vLLM Blog -- "Anatomy of vLLM"
Clarifai -- "Comparing SGLang, vLLM, and TensorRT-LLM with GPT-OSS-120B"
Kanerika -- "SGLang vs vLLM: Which is Better in 2026?"
arXiv -- "A Survey of LLM Inference Systems" (2506.21901)
SemiAnalysis -- "InferenceMAX: Open Source Inference Benchmarking"
DigitalOcean -- "FlashAttention 4: Faster, Memory-Efficient Attention"
Modal -- "High-performance LLM Inference Guide"