Перейти к содержанию

Сравнение движков инференса LLM

~13 минут чтения

Предварительно: vLLM и Paged Attention, Квантизация LLM

Выбор движка инференса определяет 3-5x разницу в стоимости обслуживания одного и того же трафика. vLLM на H100 выдает 680 tok/s на Llama-70B, TensorRT-LLM -- 850 tok/s (+25%), а SGLang на агентских workloads -- 1800 tok/s (3x vLLM). При 10M токенов/день это разница между $3K и $9K в месяц на GPU. При этом 90% production систем используют vLLM не потому что он быстрее, а потому что pip install vllm && vllm serve занимает 5 минут против нескольких часов компиляции TensorRT-LLM engine. Правильный выбор -- не "самый быстрый", а "самый быстрый для вашего workload при допустимой операционной сложности".

PagedAttention, RadixAttention, CUDA Graph Fusion, GGUF, continuous batching, disaggregated inference, benchmarks, decision framework, code examples (2025-2026)


Ключевые концепции

Comparison Matrix (2026)

Engine Best For Key Feature Hardware Throughput Ease of Use
vLLM Production serving PagedAttention NVIDIA/AMD/Ascend High Easy
SGLang Agents, structured output RadixAttention NVIDIA Highest (agents) Medium
TensorRT-LLM Maximum latency CUDA optimization NVIDIA only Best single-req Hard
llama.cpp CPU, Edge, Apple GGUF format CPU/Apple/ARM Medium Easy
Ollama Development, local Wrapper llama.cpp CPU/GPU Low Easiest
LMDeploy Tencent internal Turbomind kernels NVIDIA High Medium
TGI Enterprise HuggingFace integration NVIDIA/AMD Medium Easy

1. vLLM (PagedAttention Engine)

Paper: Kwon et al., SOSP 2023. Stars: 35k+ (Feb 2026).

Core Innovation: PagedAttention

Traditional allocation:
Request 1: [Reserved 4096 tokens] -> Actual 500  -> 87% waste
Request 2: [Reserved 4096 tokens] -> Actual 1200 -> 70% waste

PagedAttention:
GPU Memory -> Fixed 16-token blocks
Request 1: 32 blocks (512 tokens) -> <4% waste
Request 2: 75 blocks (1200 tokens) -> <4% waste

Key Features

Feature Description
PagedAttention Block-based KV cache allocation
Continuous Batching Iteration-level scheduling
Prefix Caching Automatic prefix sharing
OpenAI-compatible API Drop-in replacement
Multi-GPU Tensor + Pipeline parallelism
Multi-modal Vision-language models
Hardware NVIDIA, AMD (ROCm), Huawei Ascend

Performance (Llama-70B on H100)

Metric vLLM HF Transformers
Throughput 300 tok/s 50 tok/s
Memory waste <4% 60-80%
Max batch size 128+ 8
TTFT (128 ctx) 0.3s 1.8s

Code Example

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-70B",
    tensor_parallel_size=4,
    gpu_memory_utilization=0.9,
    max_model_len=32768,
)

prompts = ["Hello"] * 100
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=256)
outputs = llm.generate(prompts, sampling_params)

# OpenAI-compatible server:
# vllm serve meta-llama/Llama-3.1-70B --port 8000

Production Setup

from vllm.entrypoints.openai.api_server import run_server

if __name__ == "__main__":
    run_server(
        model="meta-llama/Llama-3.1-70B",
        tensor_parallel_size=4,
        host="0.0.0.0",
        port=8000,
        max_model_len=32768,
        gpu_memory_utilization=0.9,
        enable_prefix_caching=True,
    )

Strengths: easiest deployment (pip install, one command), best community, widest hardware support, extensive ecosystem (LangChain, LlamaIndex).

Limitations: not fastest single-request latency (TensorRT-LLM beats), SGLang outperforms on agent/structured workloads, less CUDA graph optimization.


2. SGLang (Structured Language Engine)

Paper: Zheng et al., arXiv:2312.07104. Developer: LMSYS (UC Berkeley).

Core Innovation: RadixAttention

Radix Tree for KV Cache:

                    [root]
                   /      \
              "System"    "User"
              /    \         \
         "helpful" "smart"   "query1"
            |         |         |
         [KV1]     [KV2]     [KV3]

Benefits:
- Automatic prefix sharing across requests
- Token-level granularity (not block-level like PagedAttention)
- Perfect for multi-turn agents

Key Features

Feature Description
RadixAttention Radix tree for prefix sharing
Structured Output Native JSON/regex constrained decoding
Disaggregated Prefill-decode separation
Speculative Decoding Built-in speculation
Function Calling Native tool use

Performance

H800 Benchmarks (July 2025):

Metric SGLang vLLM Speedup
Agent throughput 1800 tok/s 600 tok/s 3x
Structured output 1200 tok/s 400 tok/s 3x
Multi-turn (5 turns) 900 tok/s 300 tok/s 3x
Cost efficiency $0.12/1M tok $0.20/1M tok 40% cheaper

Prefix Reuse Latency (Llama-70B, 4x A100):

Scenario vLLM SGLang Speedup
Unique prompts 100ms 90ms 1.1x
Shared prefix (10 req) 1000ms 200ms 5x
Multi-turn conversation 500ms 150ms 3x

Code Example

import sglang as sgl

@sgl.function
def extract_info(s, text):
    s += "Extract from: " + text + "\n\n"
    s += sgl.gen(
        "json_output",
        max_tokens=512,
        regex=r'\{\s*"name":\s*"[^"]+",\s*"age":\s*\d+,\s*"email":\s*"[^"]+"\s*\}'
    )

runtime = sgl.Runtime(model_path="meta-llama/Llama-3.1-70B", tp_size=4)

# Multiple requests with shared prefix -- RadixAttention reuses KV cache
result = extract_info.run("John Smith is 30. Email: john@example.com", runtime=runtime)
print(result["json_output"])
# {"name": "John Smith", "age": 30, "email": "john@example.com"}

Strengths: best for agents (multi-turn, shared context), native structured output (JSON schema, regex), disaggregated inference, 40% cost savings on agent workloads, built-in speculative decoding.

Limitations: NVIDIA only (no AMD/Ascend), no pipeline parallelism (planned), smaller community than vLLM.


3. TensorRT-LLM (NVIDIA Optimized)

Developer: NVIDIA. Purpose: Maximum performance on NVIDIA GPUs.

Core Innovation: CUDA Graph Fusion

Standard PyTorch:
Layer 1 -> Kernel Launch -> Layer 2 -> Kernel Launch -> ...
Overhead: 5-10us per kernel

TensorRT-LLM:
Fused Kernel (Layer 1 + Layer 2 + ...) -> Single Launch
Overhead: 5-10us total

Key Features

Feature Description
CUDA Graphs Kernel fusion, reduced launch overhead
INT4/FP8 Native quantization support
In-flight Batching NVIDIA-optimized scheduling
Multi-Query Attention Optimized attention kernels
B200 Optimization Best on Blackwell

Performance (Llama-70B on B200, Dec 2025)

Metric TensorRT-LLM vLLM Speedup
Single-req latency 28ms 45ms 1.6x
Throughput (batch=1) 520 tok/s 300 tok/s 1.7x
Throughput (batch=64) 850 tok/s 680 tok/s 1.25x
Memory efficiency 85% 75% +10%

Code Example

# Step 1: Convert HuggingFace to TRT format
python convert_checkpoint.py \
    --model_dir ./llama-70b-hf \
    --output_dir ./llama-70b-trt \
    --tp_size 4

# Step 2: Build TensorRT engine
trtllm-build \
    --checkpoint_dir ./llama-70b-trt \
    --output_dir ./engine \
    --max_batch_size 128 \
    --max_input_len 4096 \
    --max_seq_len 8192 \
    --gemm_plugin auto

# Step 3: Run inference
python run.py --engine_dir ./engine --tokenizer_dir ./llama-70b-hf

Strengths: best single-request latency, best on H100/B200, native INT4/FP8, NVIDIA's recommended solution.

Limitations: NVIDIA only, complex setup (engine compilation takes hours), fewer model architectures, harder to customize.

Setup Complexity

Engine Setup Time Maintenance
vLLM Minutes Easy
SGLang Minutes Easy
TensorRT-LLM Hours Medium
llama.cpp Minutes Easy

4. llama.cpp (CPU + Edge)

Developer: Georgi Gerganov. Stars: 70k+ (Feb 2026).

Core Innovation: GGUF Format

GGUF File:
+-- Header (magic, version)
+-- Metadata (architecture, params)
+-- Tokenizer (vocabulary)
+-- Tensors (quantized weights)

Key Features

Feature Description
CPU inference Runs on laptop CPU
Apple Metal M-series GPU acceleration
GGUF format Efficient quantized storage
Cross-platform Linux, macOS, Windows, Android, iOS
Ecosystem Ollama, LM Studio built on it

Quantization Trade-offs (Llama-7B, MacBook Pro M3 Max)

Quantization Memory Speed Quality Loss
FP16 14 GB 18 tok/s 0%
Q8_0 7.5 GB 28 tok/s ~1%
Q5_K_M 5.5 GB 38 tok/s ~2%
Q4_K_M 4.5 GB 45 tok/s ~3-5%
Q3_K_M 3 GB ~55 tok/s ~8%

Code Example

from llama_cpp import Llama

llm = Llama(
    model_path="./llama-3.1-8b-q5_k_m.gguf",
    n_ctx=8192,
    n_gpu_layers=35,  # Offload to GPU
    verbose=False,
)

response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=256,
)

Strengths: no GPU required, best on Apple Silicon, edge/mobile deployment, single binary, Ollama/LM Studio ecosystem.

Limitations: 10-20x slower than vLLM on H100, limited batching (not for high concurrency), manual model conversion, single-node only.


5. LMDeploy и TGI

LMDeploy (Shanghai AI Laboratory)

Feature Description
Turbomind Custom CUDA kernels
FasterTransformer Based on FT optimization
INT4/INT8 SmoothQuant quantization

Performance (InternLM-20B on A100): 680 tok/s (vs vLLM 620 tok/s).

TGI (Text Generation Inference)

Feature Description
HF integration Direct Hub model loading
Flash Attention Optimized attention
Docker Containerized deployment

Best for: teams already using HuggingFace ecosystem, enterprise deployments.


6. Benchmarks

Throughput (Llama-70B, H100, batch=64)

Engine Throughput (tok/s) Relative
TensorRT-LLM 850 100%
SGLang 820 96%
vLLM 680 80%
LMDeploy 650 76%
TGI 580 68%

Throughput по batch size (Llama-7B, A100)

Engine Batch=1 Batch=32 Batch=128
HF Transformers 50 tok/s 500 tok/s OOM
vLLM 100 tok/s 2,500 tok/s 5,000 tok/s
SGLang 110 tok/s 2,400 tok/s 4,800 tok/s
TensorRT-LLM 120 tok/s 3,000 tok/s 5,500 tok/s

Latency (Llama-70B, single request)

Engine TTFT (ms) TBT (ms) Total (100 tok)
TensorRT-LLM 28 15 1,800
SGLang 40 20 2,150
vLLM 35 25 2,700
TGI 48 ~30 ~3,500

Agent Workloads (multi-turn, shared context)

Engine Throughput Relative to vLLM
SGLang 1800 tok/s 3x
vLLM 600 tok/s 1x
TensorRT-LLM 520 tok/s 0.87x

Structured Output (JSON)

Engine Throughput Constraint Support
SGLang 1200 tok/s Native
vLLM + Outlines 400 tok/s Plugin
TensorRT-LLM 350 tok/s Limited

Memory Efficiency (Llama-70B)

Engine Memory Max Batch Max Seq Len
HF Transformers 140 GB 4 2K
vLLM 80 GB 32 8K
TensorRT-LLM 70 GB 40 8K

7. Architecture Comparison

KV Cache Management

Engine Approach Memory Waste
vLLM PagedAttention (blocks) <4%
SGLang RadixAttention (tree) <1%
TensorRT-LLM Pre-allocated 10-20%
llama.cpp Static 15-30%

Batching Strategy

Engine Strategy Benefit
vLLM Continuous batching No waiting
SGLang Continuous + radix + Prefix sharing
TensorRT-LLM In-flight batching NVIDIA optimized
llama.cpp Simple batching CPU-friendly

Multi-GPU Support

Engine TP PP Notes
vLLM Yes Yes Best multi-node
SGLang Yes No Planned
TensorRT-LLM Yes Yes NVIDIA optimized
llama.cpp No No Single-node

Feature Matrix

Feature vLLM SGLang TensorRT-LLM llama.cpp
GPU support All NVIDIA NVIDIA All
CPU support No No No Yes
Quantization Yes Yes Yes Yes
Streaming Yes Yes Yes Yes
OpenAI API Yes Yes Yes Yes
Prefix caching Yes Best Yes No
Structured output Plugin Native Limited No
Multi-modal Yes Yes Yes Partial

8. Decision Framework

By Hardware

graph TD
    HW{"Hardware?"} -->|"NVIDIA H100/B200"| N1{"Priority?"}
    HW -->|"NVIDIA Consumer<br/>(RTX 4090)"| N2["vLLM<br/>(llama.cpp if VRAM limited)"]
    HW -->|"AMD MI300X"| AMD["vLLM (ROCm)"]
    HW -->|"Apple M-series"| APPLE["llama.cpp (Metal)"]
    HW -->|"CPU only"| CPU["llama.cpp"]
    HW -->|"Local dev"| DEV["Ollama"]

    N1 -->|"Max latency"| TRT["TensorRT-LLM"]
    N1 -->|"Agents"| SGL["SGLang"]
    N1 -->|"General"| VLLM["vLLM"]
    N1 -->|"HF ecosystem"| TGI["TGI"]

    style TRT fill:#e8eaf6,stroke:#3f51b5
    style SGL fill:#fff3e0,stroke:#ff9800
    style VLLM fill:#e8f5e9,stroke:#4caf50

By Use Case

Use Case Recommended Why
Chatbot (high QPS) vLLM Best throughput/latency balance
Multi-turn agents SGLang RadixAttention, prefix sharing, 3x throughput
JSON output SGLang Native constrained decoding
Maximum latency TensorRT-LLM CUDA optimization, 1.6x faster
Edge deployment llama.cpp No GPU required
Local development Ollama One-command setup
Production scale TensorRT-LLM or vLLM Battle-tested
Cost optimization SGLang 40% cheaper for agents

By Priority

Priority Ranking
Throughput TensorRT-LLM > SGLang > vLLM > LMDeploy > TGI
Latency TensorRT-LLM > vLLM > SGLang > TGI
Agent workloads SGLang >> vLLM > TensorRT-LLM
Ease of use Ollama > vLLM > llama.cpp > SGLang > TensorRT-LLM
Hardware flexibility llama.cpp > vLLM > TensorRT-LLM
Community vLLM > llama.cpp > SGLang > TensorRT-LLM

9. Formulas

Memory Savings (PagedAttention)

\[\text{Waste}_{static} = \frac{T_{max} - T_{actual}}{T_{max}} \approx 60-80\%\]
\[\text{Waste}_{paged} = \frac{B - (T \mod B)}{B} < 4\%\]

Where \(B\) = block size (typically 16 tokens).

Throughput Calculation

\[\text{Throughput} = \frac{B \times N \times L}{T}\]
  • \(B\) = batch size, \(N\) = sequences per batch, \(L\) = avg output length, \(T\) = total time

Latency Components

\[\text{TTFT} = T_{queue} + T_{prefill}\]
\[\text{TPOT} = \frac{T_{decode}}{\text{output tokens}}\]

Для интервью

Q: "Сравните vLLM, SGLang и TensorRT-LLM."

vLLM: PagedAttention (block-based KV cache, <4% waste), continuous batching, easiest deployment, NVIDIA/AMD/Ascend -- best general-purpose. SGLang: RadixAttention (tree-based prefix sharing, <1% waste), 3x throughput on agent workloads, native JSON constraints -- best for agents. TensorRT-LLM: CUDA graph fusion, lowest single-request latency (1.6x faster), NVIDIA only, complex setup (hours) -- best raw performance.

Q: "Что такое PagedAttention и RadixAttention?"

PagedAttention (vLLM): KV cache разбит на fixed-size blocks (16 tokens), allocated on-demand. Memory waste: 60-80% -> <4%. Аналог virtual memory в OS. RadixAttention (SGLang): KV cache хранится в radix tree. Automatic prefix detection и sharing между запросами. Token-level granularity (не block-level). 5x speedup при shared prefixes, <1% waste.

Q: "Когда выбрать llama.cpp?"

CPU-only / Edge / Apple M-series. GGUF format с квантизацией (Q4_K_M: 4.5 GB для 7B, ~3-5% quality loss). Runs on mobile (Android, iOS), no GPU required. Ecosystem: Ollama, LM Studio. Limitations: 10-20x slower than vLLM on H100, no distributed, limited batching.

Q: "Спроектируйте систему инференса для 10K concurrent users."

(1) vLLM или TensorRT-LLM на кластере H100 с tensor parallelism. (2) Load balancer -> multiple vLLM instances. (3) Continuous batching для GPU utilization 70-90%. (4) Prefix caching (RadixAttention если multi-turn). (5) FP8 quantization для 2x memory savings. (6) Autoscaling по queue depth. Target: TTFT <100ms, throughput 850+ tok/s per node.

Q: "Disaggregated inference -- что это?"

Разделение prefill и decode фаз на разные GPU. Prefill = compute-bound (parallel processing input tokens). Decode = memory-bound (sequential token generation). SGLang поддерживает disaggregated inference: prefill workers отдельно от decode workers. Результат: лучшее utilization обоих типов GPU, оптимизация под каждую фазу.

Ключевые числа

Факт Значение
vLLM PagedAttention waste <4% (vs 60-80% traditional)
SGLang RadixAttention waste <1%
SGLang agent speedup vs vLLM 3x
SGLang cost savings (agents) 40% cheaper
SGLang prefix reuse speedup 5x (10 shared requests)
TensorRT-LLM single-req latency advantage 1.6x faster
llama.cpp Q4_K_M (7B) 4.5 GB, 45 tok/s, ~3-5% loss
vLLM vs HF Transformers 6x throughput (Llama-70B)
TensorRT-LLM H100 throughput (batch=64) 850 tok/s
vLLM stars 35k+
llama.cpp stars 70k+
TensorRT-LLM setup time Hours (engine compilation)


Заблуждение: TensorRT-LLM всегда быстрее vLLM

TensorRT-LLM выигрывает 1.6x на single-request latency (28ms vs 45ms), но при batch=64 разница сокращается до 1.25x (850 vs 680 tok/s). На agent workloads с multi-turn контекстом SGLang обгоняет TensorRT-LLM в 3.5x благодаря RadixAttention. "Самый быстрый" зависит от workload: batched serving (TRT-LLM), agents (SGLang), general purpose (vLLM).

Заблуждение: vLLM и SGLang -- взаимозаменяемые

При unique prompts (zero prefix sharing) разница минимальна: 100ms vs 90ms. Но при 10 запросах с shared prefix SGLang выигрывает 5x (200ms vs 1000ms). Если ваш workload -- chatbot с system prompt + multi-turn, SGLang экономит 40% compute. Если single-shot API без shared контекста -- vLLM проще и имеет лучшую поддержку AMD/Ascend.

Заблуждение: квантизация Q4 на llama.cpp -- бесплатное ускорение

Q4_K_M дает 45 tok/s вместо 18 tok/s (FP16) на MacBook M3, но quality loss составляет 3-5% на общих бенчмарках и до 15% на math/reasoning задачах. Для production chatbot -- допустимо. Для юридического или медицинского AI -- Q8_0 (1% loss) или FP16. Всегда измеряйте quality на ВАШЕМ домене, а не на MMLU.


Interview Questions

Q: Вам нужно развернуть Llama-70B для 10K concurrent users. Какой движок выберете и почему?

❌ Red flag: "vLLM, потому что он самый популярный"

✅ Strong answer: "Зависит от workload profile. Если chatbot с multi-turn (shared system prompt, 3-5 turns) -- SGLang: RadixAttention дает 5x speedup на shared prefix, 3x throughput на agent workloads. Если single-shot API (translation, classification) -- vLLM: проще deployment, лучше community, NVIDIA+AMD support. Если latency-critical (real-time trading, voice) -- TensorRT-LLM: 28ms TTFT vs 45ms vLLM. Конкретно для 10K concurrent: vLLM на 4xH100 с tensor parallelism, continuous batching, prefix caching. HPA по queue depth, min 4 / max 20 replicas. Target: TTFT <200ms p95, 680+ tok/s per node."

Q: Что такое PagedAttention и почему это важно для serving?

❌ Red flag: "Это способ ускорить attention computation"

✅ Strong answer: "PagedAttention решает проблему memory fragmentation в KV cache. Традиционный подход: pre-allocate max_seq_len (4096 tokens) на каждый запрос, реальное использование -- 500 tokens, waste 87%. PagedAttention разбивает KV cache на fixed-size blocks (16 tokens), allocated on-demand -- аналог virtual memory в OS. Waste падает с 60-80% до <4%. Практический эффект: на тех же 80GB H100 max batch size растет с 8 до 128+, throughput 6x vs HuggingFace Transformers (300 vs 50 tok/s на Llama-70B). RadixAttention (SGLang) развивает идею: tree-based structure с token-level granularity, waste <1%, автоматический prefix sharing."

Q: Когда оправдано использовать llama.cpp вместо vLLM?

❌ Red flag: "Когда нет GPU"

✅ Strong answer: "Три сценария: (1) Edge/mobile deployment -- llama.cpp единственный движок работающий на Android/iOS, single binary без зависимостей. (2) Apple Silicon development -- Metal GPU acceleration, Q5_K_M на M3 Max дает 38 tok/s для 7B модели при 5.5GB RAM, достаточно для local dev. (3) CPU-only inference для low-traffic внутренних инструментов (<100 req/day) где GPU стоимость не оправдана. Но НЕ для production high-concurrency: llama.cpp 10-20x медленнее vLLM на H100, нет distributed inference, limited batching. Ecosystem (Ollama, LM Studio) -- для разработки и прототипирования, не для serving."

Q: Disaggregated inference -- что это и когда применять?

❌ Red flag: "Это когда модель распределена по нескольким GPU"

✅ Strong answer: "Disaggregated inference -- разделение prefill и decode фаз на разные GPU pools. Prefill = compute-bound (параллельная обработка всех input tokens, нагружает FLOPS). Decode = memory-bound (последовательная генерация по одному токену, нагружает memory bandwidth). На одном GPU оптимизация под обе фазы -- компромисс. Disaggregated: prefill workers на compute-оптимальных GPU (H100 SXM), decode workers на memory-оптимальных (с высоким bandwidth). SGLang поддерживает нативно. Применять при >1000 QPS: экономия 20-30% GPU за счет лучшего utilization каждого типа worker. При малом трафике overhead координации съедает выигрыш."


Источники

  1. Kwon et al. -- "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
  2. Zheng et al. -- "SGLang: Efficient Execution of Structured Language Model Programs" (arXiv:2312.07104)
  3. NVIDIA -- TensorRT-LLM In-flight Batching whitepaper
  4. Clarifai -- "Comparing SGLang, vLLM, and TensorRT-LLM with GPT-OSS-120B" (Aug 2025)
  5. Kanerika -- "SGLang vs vLLM: Which is Better in 2026?" (Sept 2025)
  6. LangCopilot -- "SGLang 3x Faster" (July 2025)
  7. SemiAnalysis -- "InferenceMAX" (Oct 2025)
  8. Medium (Ordina Data) -- "Choosing Your LLM Framework: Ollama, vLLM, SGLang, TensorRT-LLM"
  9. AI Merge -- "The AI Engineer's Guide to Inference Engines"

See Also