Оптимизация инференса LLM¶
~10 минут чтения
Continuous batching, PagedAttention, FlashAttention, speculative decoding (EAGLE-3), quantization (FP8/INT4/NVFP4), MoE, сравнение движков (vLLM, SGLang, TensorRT-LLM) (2025-2026)
Предварительно: Архитектура памяти LLM, KV-кэш оптимизация
Зачем это нужно. Инференс LLM -- 80-90% от общей стоимости владения в production. Обслуживание Llama 70B на 4xA100 стоит ~$25K/мес при cloud-ценах. Без оптимизации GPU utilization составляет 30-50%, и вы платите за простой. Пять core-техник (continuous batching, PagedAttention, quantization, FlashAttention, speculative decoding) в сумме дают 10-50x улучшение throughput и 2-4x снижение стоимости per token. Порядок внедрения критичен: continuous batching дает 2-24x throughput при нулевой сложности, а speculative decoding -- 2-6x latency при средней сложности.
Ключевые концепции¶
Пять core техник¶
| # | Техника | Speedup | Сложность внедрения |
|---|---|---|---|
| 1 | Continuous batching | 2-24x throughput | Low |
| 2 | KV cache (PagedAttention) | 2-4x memory efficiency | Low |
| 3 | Quantization (FP8/INT8) | 2x memory, 2x speed | Low |
| 4 | FlashAttention | 10-20x memory savings | Low |
| 5 | Speculative decoding | 2-3x latency | Medium |
Порядок внедрения: сверху вниз -- наибольший ROI при минимальной сложности.
1. Continuous Batching¶
Проблема: static batching: GPU простаивает, ожидая медленные запросы.
Решение: dynamically add/remove requests as they complete.
| Метрика | Static Batching | Continuous Batching |
|---|---|---|
| GPU utilization | 30-50% | 70-90% |
| Throughput | Baseline | 2-4x |
| Latency variance | High | Low |
| vLLM vs baseline | -- | Up to 23x |
| vLLM vs TGI (high concurrency) | -- | 24x |
Framework support:
| Framework | Название фичи |
|---|---|
| vLLM | Iteration-level scheduling (core) |
| SGLang | Built-in |
| TensorRT-LLM | In-flight batching |
| LMDeploy | Persistent batching |
| HuggingFace TGI | Supported |
2. PagedAttention (KV Cache)¶
Проблема: KV cache contiguous allocation -> 60-80% memory waste из-за fragmentation.
Решение: page-based memory management (аналог virtual memory в OS).
graph LR
subgraph Traditional["Traditional KV Cache (60-80% waste)"]
direction LR
T1["Request 1: contiguous block"] --> TF["Fragmentation (waste)"]
T2["Request 2: contiguous block"] --> TF
end
subgraph Paged["PagedAttention (<4% waste)"]
direction LR
P1["R1"] --> P2["R1"] --> P3["R2"] --> P4["R1"] --> P5["R2"] --> P6["R2"] --> P7["R1"]
end
style T1 fill:#e8eaf6,stroke:#3f51b5
style T2 fill:#e8f5e9,stroke:#4caf50
style TF fill:#fce4ec,stroke:#c62828
style P1 fill:#e8eaf6,stroke:#3f51b5
style P2 fill:#e8eaf6,stroke:#3f51b5
style P3 fill:#e8f5e9,stroke:#4caf50
style P4 fill:#e8eaf6,stroke:#3f51b5
style P5 fill:#e8f5e9,stroke:#4caf50
style P6 fill:#e8f5e9,stroke:#4caf50
style P7 fill:#e8eaf6,stroke:#3f51b5
| Метрика | Traditional | PagedAttention |
|---|---|---|
| Memory waste | 60-80% | <4% |
| Memory efficiency | 60-70% | 95%+ |
| Max batch size | Limited by longest seq | 2-4x larger |
| Throughput vs FasterTransformer | -- | 2-4x |
Advanced techniques (2025-2026):
| Техника | Описание |
|---|---|
| LMCache | Hierarchical caching: GPU -> CPU -> network |
| Prefix caching | Reuse common prompt prefixes |
| KV compression | FP8/INT8 for 2-3x memory savings |
| Automatic prefix caching | vLLM built-in feature |
3. Quantization¶
| Format | Memory (7B) | Quality Loss | Best For |
|---|---|---|---|
| FP16/BF16 | ~14 GB | Baseline | Maximum quality |
| FP8 | ~7 GB | <1% | Production inference (Hopper+) |
| INT8 | ~7 GB | ~2% | Balanced deployment |
| INT4 | ~3.5 GB | 8-10% | Edge/resource-constrained |
| NVFP4 | ~4 GB | <1% | Next-gen GPUs (Blackwell) |
FP8 (State of the Art): requires Hopper (H100) or Ada. 2.3x inference speedup vs FP16 на LLaMA-v2-7B.
NVFP4 (Blackwell): 3.5x memory reduction vs FP16, <1% accuracy degradation (LiveCodeBench, MMLU-PRO).
Rankings:
| Rank | Method | Best For |
|---|---|---|
| 1 | FP8 | Batch >= 16, optimal perf/accuracy |
| 2 | Q5_K_M / GPTQ-INT8 | Best trade-off for most domains |
| 3 | AWQ | Better than GPTQ weight-only |
| 4 | INT4 (GPTQ) | Use cautiously, significant loss |
Task-specific impact: coding/STEM most affected. 70B+ can maintain 4-bit; smaller models need 8-bit.
4. FlashAttention¶
Fused operations: single kernel, block processing, never materializes full attention matrix.
| Version | GPU | TFLOPS (FP16) | Utilization | Key Features |
|---|---|---|---|---|
| FA-1 | A100 | ~300 | ~50% | Basic fusion |
| FA-2 | A100/H100 | ~400 | ~35% | Improved kernels |
| FA-3 | H100 | 840 | 85% | Warp specialization, FP8 |
| FA-4 | Blackwell | TBD | Higher | Scalability, reduced overhead |
FA-3 на H100: BF16 = 840 TFLOPS (85% utilization). FP8 = 1,300 TFLOPS (1.3 PFLOPS). 1.5-2x faster vs FA-2.
Memory: O(n^2) -> O(n). Standard attention at 128K = prohibitive. FlashAttention = manageable.
5. Speculative Decoding¶
| Step | Описание |
|---|---|
| Draft | Smaller model predicts multiple tokens ahead |
| Verify | Target model verifies candidates in single forward pass |
| Accept/Reject | Accepted tokens used; rejected trigger regeneration |
| Output | Mathematically identical to standard decoding |
EAGLE-3 (NeurIPS 2025, SOTA): - Architecture: 1-2 transformer layers as "draft head" (2-5% of target size) - Innovation: fusion of low-, mid-, high-level semantic features - Speedup: 2-6x - Framework: vLLM v0.8.5+, SGLang via SpecForge
Adoption: Google Search (AI Overviews), vLLM (native EAGLE-⅓ support), SGLang.
Limiting factors: acceptance rate 0.6-0.8, domain mismatch reduces effectiveness.
6. MoE для инференса¶
| Метрика | Improvement |
|---|---|
| Compute per inference | 90-95% reduction |
| Training efficiency | 2-7x faster |
| Power consumption | Up to 50% reduction |
Notable models:
| Model | Total | Active | Context |
|---|---|---|---|
| DeepSeek R1 | 671B | 37B | Standard |
| Gemini 1.5 | ~1T | 150-200B | 1M tokens |
| Kimi K2 | ~1T | 32B | Long context |
Research (2025): Super Experts (critical subset), MaxScore routing (constrained optimization), MegaScale-Infer (disaggregated expert parallelism).
Движки инференса¶
Сравнение лидеров¶
| Engine | Best For | Throughput | TTFT | License |
|---|---|---|---|---|
| TensorRT-LLM | NVIDIA hardware, max perf | 180-220 req/s | 35-50ms | NVIDIA |
| vLLM | Flexibility, community | 100-150 req/s | 50-80ms | Apache 2.0 |
| SGLang | Structured output, multi-turn | 120-180 req/s | 40-70ms | Apache 2.0 |
| llama.cpp | CPU/edge | 10-50 req/s | 100-500ms | MIT |
| TGI | HuggingFace ecosystem | -- | -- | HF License |
vLLM¶
graph TD
VS["vLLM Server"] --> SCH["Scheduler<br/>continuous batching, preemption/swapping"]
SCH --> BM["Block Manager<br/>PagedAttention, copy-on-write prefix caching"]
BM --> W["Worker(s)<br/>FlashAttention, tensor parallelism"]
style VS fill:#e8eaf6,stroke:#3f51b5
style SCH fill:#e8f5e9,stroke:#4caf50
style BM fill:#fff3e0,stroke:#ef6c00
style W fill:#f3e5f5,stroke:#9c27b0
- Developer: UC Berkeley Sky Lab (2023)
- 50+ model architectures, 2M+ monthly downloads
- Key: PagedAttention
SGLang¶
- Developer: Stanford/Berkeley (2024)
- Key: RadixAttention (tree-based KV cache sharing), best structured output (regex-constrained JSON/XML)
- Optimized for multi-turn, 5x throughput in multi-call workloads
- Most stable per-token latency (4-21ms)
TensorRT-LLM¶
graph TD
TRT["TensorRT-LLM"] --> RT["Runtime<br/>in-flight batching, KV cache management"]
RT --> KL["Kernel Library<br/>custom attention kernels, quantized matmul, fused ops"]
KL --> TE["TensorRT Engine<br/>graph optimization, kernel auto-tuning"]
style TRT fill:#fff3e0,stroke:#ef6c00
style RT fill:#e8eaf6,stroke:#3f51b5
style KL fill:#e8f5e9,stroke:#4caf50
style TE fill:#f3e5f5,stroke:#9c27b0
- Developer: NVIDIA (2023)
- Best on A100/H100/B200, highest throughput, lowest latency
Benchmarks по GPU tier (Llama 70B)¶
Throughput (req/sec):
| Engine | A100 (80GB) | H100 (80GB) | B200 |
|---|---|---|---|
| TensorRT-LLM | 120-150 | 180-220 | 250-300 |
| SGLang | 80-120 | 120-180 | 180-240 |
| vLLM | 70-100 | 100-150 | 150-200 |
| llama.cpp | 10-20 | 15-30 | N/A |
TTFT (ms) by sequence length:
| Engine | 128 tokens | 1024 tokens | 8192 tokens |
|---|---|---|---|
| TensorRT-LLM | 35-50 | 60-100 | 200-400 |
| SGLang | 40-70 | 80-120 | 250-500 |
| vLLM | 50-80 | 100-150 | 300-600 |
| llama.cpp | 100-200 | 200-400 | 500-1500 |
Memory:
| Engine | 7B | 70B | 120B |
|---|---|---|---|
| vLLM | 16GB | 140GB | 240GB |
| SGLang | 14GB | 130GB | 220GB |
| TensorRT-LLM | 12GB | 120GB | 200GB |
Prefix Caching¶
| Engine | Метод | Cache Hit Benefit |
|---|---|---|
| vLLM | Automatic | 5-10x faster |
| SGLang | RadixAttention | 10-20x faster |
| TensorRT-LLM | Manual | 3-5x faster |
Decision Tree¶
graph TD
START["Выбор движка"] --> Q1{"Max throughput<br/>на NVIDIA?"}
Q1 -->|"Да"| TRT["TensorRT-LLM"]
Q1 -->|"Нет"| Q2{"Structured output<br/>(JSON/XML)?"}
Q2 -->|"Да"| SGL["SGLang"]
Q2 -->|"Нет"| Q3{"Multi-turn<br/>conversations?"}
Q3 -->|"Да"| SGL2["SGLang"]
Q3 -->|"Нет"| Q4{"CPU / edge?"}
Q4 -->|"Да"| LLAMA["llama.cpp"]
Q4 -->|"Нет"| VLLM["vLLM"]
style START fill:#e8eaf6,stroke:#3f51b5
style TRT fill:#fff3e0,stroke:#ef6c00
style SGL fill:#e8f5e9,stroke:#4caf50
style SGL2 fill:#e8f5e9,stroke:#4caf50
style LLAMA fill:#f3e5f5,stroke:#9c27b0
style VLLM fill:#e8f5e9,stroke:#4caf50
Interview Questions¶
Q: Назовите 5 ключевых техник оптимизации инференса LLM в порядке приоритета.
Red flag: "Квантизация, pruning, knowledge distillation..." (путает inference optimization с model compression)
Strong answer: "1) Continuous batching (2-24x throughput, dynamically add/remove requests -- biggest ROI при нулевой сложности). 2) PagedAttention (KV cache paging, memory waste: 60-80% -> <4%, enables 2-4x larger batch sizes). 3) Quantization FP8 (2x memory + 2x speed, <1% quality loss на Hopper+). 4) FlashAttention (O(n^2) -> O(n) memory, FA-3: 840 TFLOPS, 85% utilization на H100). 5) Speculative decoding (draft-verify pattern, 2-6x latency с EAGLE-3, mathematically identical output). Порядок: throughput -> memory -> latency."
Q: Как работает PagedAttention и почему это важно?
Red flag: "Это оптимизация attention, ускоряет вычисления."
Strong answer: "Аналог virtual memory в OS. KV cache разбивается на fixed-size blocks (16 tokens), выделяемые non-contiguously on-demand. Block tables mapят virtual -> physical addresses. Решает проблему fragmentation: traditional KV cache -- 60-80% memory waste из-за contiguous allocation, PagedAttention -- <4% waste. Это позволяет 2-4x larger batch sizes на том же GPU. vLLM built-in, copy-on-write для prefix caching. Без PagedAttention Llama 70B на A100 80GB обслуживает ~8 concurrent requests, с ним -- до 32."
Q: Speculative decoding -- как работает и когда не помогает?
Red flag: "Маленькая модель генерирует ответ, большая проверяет."
Strong answer: "Draft model (2-5% params от target) генерирует N candidate tokens. Target model верифицирует все N за один forward pass (batch verification). Accepted tokens используются, rejected trigger regeneration. Output mathematically identical -- это не приближение. EAGLE-3 (NeurIPS 2025): 2-6x speedup, 1-2 transformer layers как draft head. Google Search использует для AI Overviews. Ограничение: acceptance rate 0.6-0.8, и при domain mismatch (draft не обучен на вашем домене) может быть медленнее прямой генерации."
Q: vLLM vs SGLang vs TensorRT-LLM -- как выбирать?
Red flag: "vLLM самый популярный, его и используем."
Strong answer: "Зависит от workload. TensorRT-LLM: max throughput (180-220 req/s), lowest latency (35-50ms TTFT), лидирует при concurrency < 16, но proprietary и сложнее в деплое. vLLM: best flexibility, 50+ model architectures, Apache 2.0, лидирует при high concurrency (128+), 2M+ monthly downloads. SGLang: best structured output (RadixAttention + regex-constrained JSON/XML), best multi-turn (5x throughput), RadixAttention дает 10-20x speedup при cache hit. Для chat с JSON output -- SGLang. Для batch processing -- TRT-LLM. Для прототипирования и гибкости -- vLLM."
Ключевые числа¶
| Факт | Значение |
|---|---|
| Continuous batching (vLLM vs TGI) | Up to 24x throughput |
| PagedAttention memory waste | 60-80% -> <4% |
| FP8 speedup vs FP16 | 2.3x |
| FlashAttention-3 BF16 | 840 TFLOPS, 85% utilization |
| FlashAttention-3 FP8 | 1.3 PFLOPS |
| EAGLE-3 speedup | 2-6x |
| Speculative acceptance rate | 0.6-0.8 |
| TensorRT-LLM throughput (H100) | 180-220 req/s |
| vLLM throughput (H100) | 100-150 req/s |
| SGLang RadixAttention cache hit | 10-20x faster |
| MoE compute reduction | 90-95% |
| Latency target TTFT | <100ms |
| Latency alert TBT | >200ms |
Заблуждение: квантизация всегда безопасна
FP8 дает <1% quality loss на general benchmarks, но на coding/STEM задачах degradation может быть 3-5%. INT4 теряет 8-10% на общих бенчмарках и до 15% на математике. Модели <13B параметров особенно чувствительны: 7B INT4 может терять 12-15% на HumanEval, тогда как 70B INT4 теряет только 5-7%. Правило: для моделей <13B используйте минимум INT8, для coding-задач -- минимум FP8.
Заблуждение: speculative decoding всегда ускоряет в 2-3x
Speedup 2-6x (EAGLE-3) достигается при acceptance rate 0.6-0.8. Но acceptance rate зависит от domain match между draft и target моделями. На out-of-distribution данных (модель обучена на English, запросы на Chinese) acceptance rate падает до 0.3-0.4, и speculative decoding может быть медленнее прямой генерации из-за overhead на верификацию. Всегда проверяйте acceptance rate на вашем реальном workload.
Заблуждение: больше GPU = линейно больше throughput
Tensor parallelism на 2 GPU дает ~1.8x throughput (не 2x) из-за communication overhead. На 4 GPU -- ~3.2x, на 8 GPU -- ~5-6x. При этом latency на single request может увеличиться из-за all-reduce synchronization. Для latency-sensitive workloads часто лучше 1 GPU с меньшей моделью (FP8 quantization), чем 4 GPU с полной.
Источники¶
- Zylos Research -- "AI Inference Optimization Techniques (2025-2026)"
- NVIDIA -- Speculative Decoding Blog, NVFP4 Blog
- Google Research -- Speculative Decoding retrospective
- arXiv -- EAGLE-3 (2503.01840), PagedAttention (2309.06180)
- vLLM Blog -- "Anatomy of vLLM"
- Clarifai -- "Comparing SGLang, vLLM, and TensorRT-LLM with GPT-OSS-120B"
- Kanerika -- "SGLang vs vLLM: Which is Better in 2026?"
- arXiv -- "A Survey of LLM Inference Systems" (2506.21901)
- SemiAnalysis -- "InferenceMAX: Open Source Inference Benchmarking"
- DigitalOcean -- "FlashAttention 4: Faster, Memory-Efficient Attention"
- Modal -- "High-performance LLM Inference Guide"
See Also¶
- vLLM & PagedAttention -- углубленный разбор PagedAttention и архитектуры vLLM
- Speculative Decoding -- детали draft-verify паттерна и EAGLE
- Serving Benchmarks -- методология бенчмаркинга: TTFT, TPOT, SLO
- Inference Engine Comparison -- vLLM vs SGLang vs TRT-LLM детальное сравнение
- Flash Attention 3 -- FA-3 kernel optimizations для H100
- LLM Production Deploy -- от бенчмарков к production: мониторинг, масштабирование