Сравнение движков инференса LLM¶

~13 минут чтения

Предварительно: vLLM и Paged Attention, Квантизация LLM

Выбор движка инференса определяет 3-5x разницу в стоимости обслуживания одного и того же трафика. vLLM на H100 выдает 680 tok/s на Llama-70B, TensorRT-LLM -- 850 tok/s (+25%), а SGLang на агентских workloads -- 1800 tok/s (3x vLLM). При 10M токенов/день это разница между $3K и $9K в месяц на GPU. При этом 90% production систем используют vLLM не потому что он быстрее, а потому что pip install vllm && vllm serve занимает 5 минут против нескольких часов компиляции TensorRT-LLM engine. Правильный выбор -- не "самый быстрый", а "самый быстрый для вашего workload при допустимой операционной сложности".

PagedAttention, RadixAttention, CUDA Graph Fusion, GGUF, continuous batching, disaggregated inference, benchmarks, decision framework, code examples (2025-2026)

Ключевые концепции¶

Comparison Matrix (2026)¶

Engine	Best For	Key Feature	Hardware	Throughput	Ease of Use
vLLM	Production serving	PagedAttention	NVIDIA/AMD/Ascend	High	Easy
SGLang	Agents, structured output	RadixAttention	NVIDIA	Highest (agents)	Medium
TensorRT-LLM	Maximum latency	CUDA optimization	NVIDIA only	Best single-req	Hard
llama.cpp	CPU, Edge, Apple	GGUF format	CPU/Apple/ARM	Medium	Easy
Ollama	Development, local	Wrapper llama.cpp	CPU/GPU	Low	Easiest
LMDeploy	Tencent internal	Turbomind kernels	NVIDIA	High	Medium
TGI	Enterprise	HuggingFace integration	NVIDIA/AMD	Medium	Easy

1. vLLM (PagedAttention Engine)¶

Paper: Kwon et al., SOSP 2023. Stars: 35k+ (Feb 2026).

Core Innovation: PagedAttention¶

Traditional allocation:
Request 1: [Reserved 4096 tokens] -> Actual 500  -> 87% waste
Request 2: [Reserved 4096 tokens] -> Actual 1200 -> 70% waste

PagedAttention:
GPU Memory -> Fixed 16-token blocks
Request 1: 32 blocks (512 tokens) -> <4% waste
Request 2: 75 blocks (1200 tokens) -> <4% waste

Key Features¶

Feature	Description
PagedAttention	Block-based KV cache allocation
Continuous Batching	Iteration-level scheduling
Prefix Caching	Automatic prefix sharing
OpenAI-compatible API	Drop-in replacement
Multi-GPU	Tensor + Pipeline parallelism
Multi-modal	Vision-language models
Hardware	NVIDIA, AMD (ROCm), Huawei Ascend

Performance (Llama-70B on H100)¶

Metric	vLLM	HF Transformers
Throughput	300 tok/s	50 tok/s
Memory waste	<4%	60-80%
Max batch size	128+	8
TTFT (128 ctx)	0.3s	1.8s

Code Example¶

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-70B",
    tensor_parallel_size=4,
    gpu_memory_utilization=0.9,
    max_model_len=32768,
)

prompts = ["Hello"] * 100
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=256)
outputs = llm.generate(prompts, sampling_params)

# OpenAI-compatible server:
# vllm serve meta-llama/Llama-3.1-70B --port 8000

Production Setup¶

from vllm.entrypoints.openai.api_server import run_server

if __name__ == "__main__":
    run_server(
        model="meta-llama/Llama-3.1-70B",
        tensor_parallel_size=4,
        host="0.0.0.0",
        port=8000,
        max_model_len=32768,
        gpu_memory_utilization=0.9,
        enable_prefix_caching=True,
    )

Strengths: easiest deployment (pip install, one command), best community, widest hardware support, extensive ecosystem (LangChain, LlamaIndex).

Limitations: not fastest single-request latency (TensorRT-LLM beats), SGLang outperforms on agent/structured workloads, less CUDA graph optimization.

2. SGLang (Structured Language Engine)¶

Paper: Zheng et al., arXiv:2312.07104. Developer: LMSYS (UC Berkeley).

Core Innovation: RadixAttention¶

Radix Tree for KV Cache:

                    [root]
                   /      \
              "System"    "User"
              /    \         \
         "helpful" "smart"   "query1"
            |         |         |
         [KV1]     [KV2]     [KV3]

Benefits:
- Automatic prefix sharing across requests
- Token-level granularity (not block-level like PagedAttention)
- Perfect for multi-turn agents

Key Features¶

Feature	Description
RadixAttention	Radix tree for prefix sharing
Structured Output	Native JSON/regex constrained decoding
Disaggregated	Prefill-decode separation
Speculative Decoding	Built-in speculation
Function Calling	Native tool use

Performance¶

H800 Benchmarks (July 2025):

Metric	SGLang	vLLM	Speedup
Agent throughput	1800 tok/s	600 tok/s	3x
Structured output	1200 tok/s	400 tok/s	3x
Multi-turn (5 turns)	900 tok/s	300 tok/s	3x
Cost efficiency	$0.12/1M tok	$0.20/1M tok	40% cheaper

Prefix Reuse Latency (Llama-70B, 4x A100):

Scenario	vLLM	SGLang	Speedup
Unique prompts	100ms	90ms	1.1x
Shared prefix (10 req)	1000ms	200ms	5x
Multi-turn conversation	500ms	150ms	3x

Code Example¶

import sglang as sgl

@sgl.function
def extract_info(s, text):
    s += "Extract from: " + text + "\n\n"
    s += sgl.gen(
        "json_output",
        max_tokens=512,
        regex=r'\{\s*"name":\s*"[^"]+",\s*"age":\s*\d+,\s*"email":\s*"[^"]+"\s*\}'
    )

runtime = sgl.Runtime(model_path="meta-llama/Llama-3.1-70B", tp_size=4)

# Multiple requests with shared prefix -- RadixAttention reuses KV cache
result = extract_info.run("John Smith is 30. Email: john@example.com", runtime=runtime)
print(result["json_output"])
# {"name": "John Smith", "age": 30, "email": "john@example.com"}

Strengths: best for agents (multi-turn, shared context), native structured output (JSON schema, regex), disaggregated inference, 40% cost savings on agent workloads, built-in speculative decoding.

Limitations: NVIDIA only (no AMD/Ascend), no pipeline parallelism (planned), smaller community than vLLM.

3. TensorRT-LLM (NVIDIA Optimized)¶

Developer: NVIDIA. Purpose: Maximum performance on NVIDIA GPUs.

Core Innovation: CUDA Graph Fusion¶

Standard PyTorch:
Layer 1 -> Kernel Launch -> Layer 2 -> Kernel Launch -> ...
Overhead: 5-10us per kernel

TensorRT-LLM:
Fused Kernel (Layer 1 + Layer 2 + ...) -> Single Launch
Overhead: 5-10us total

Key Features¶

Feature	Description
CUDA Graphs	Kernel fusion, reduced launch overhead
INT4/FP8	Native quantization support
In-flight Batching	NVIDIA-optimized scheduling
Multi-Query Attention	Optimized attention kernels
B200 Optimization	Best on Blackwell

Performance (Llama-70B on B200, Dec 2025)¶

Metric	TensorRT-LLM	vLLM	Speedup
Single-req latency	28ms	45ms	1.6x
Throughput (batch=1)	520 tok/s	300 tok/s	1.7x
Throughput (batch=64)	850 tok/s	680 tok/s	1.25x
Memory efficiency	85%	75%	+10%

Code Example¶

# Step 1: Convert HuggingFace to TRT format
python convert_checkpoint.py \
    --model_dir ./llama-70b-hf \
    --output_dir ./llama-70b-trt \
    --tp_size 4

# Step 2: Build TensorRT engine
trtllm-build \
    --checkpoint_dir ./llama-70b-trt \
    --output_dir ./engine \
    --max_batch_size 128 \
    --max_input_len 4096 \
    --max_seq_len 8192 \
    --gemm_plugin auto

# Step 3: Run inference
python run.py --engine_dir ./engine --tokenizer_dir ./llama-70b-hf

Strengths: best single-request latency, best on H100/B200, native INT4/FP8, NVIDIA's recommended solution.

Limitations: NVIDIA only, complex setup (engine compilation takes hours), fewer model architectures, harder to customize.

Setup Complexity¶

Engine	Setup Time	Maintenance
vLLM	Minutes	Easy
SGLang	Minutes	Easy
TensorRT-LLM	Hours	Medium
llama.cpp	Minutes	Easy

4. llama.cpp (CPU + Edge)¶

Developer: Georgi Gerganov. Stars: 70k+ (Feb 2026).

Core Innovation: GGUF Format¶

GGUF File:
+-- Header (magic, version)
+-- Metadata (architecture, params)
+-- Tokenizer (vocabulary)
+-- Tensors (quantized weights)

Key Features¶

Feature	Description
CPU inference	Runs on laptop CPU
Apple Metal	M-series GPU acceleration
GGUF format	Efficient quantized storage
Cross-platform	Linux, macOS, Windows, Android, iOS
Ecosystem	Ollama, LM Studio built on it

Quantization Trade-offs (Llama-7B, MacBook Pro M3 Max)¶

Quantization	Memory	Speed	Quality Loss
FP16	14 GB	18 tok/s	0%
Q8_0	7.5 GB	28 tok/s	~1%
Q5_K_M	5.5 GB	38 tok/s	~2%
Q4_K_M	4.5 GB	45 tok/s	~3-5%
Q3_K_M	3 GB	~55 tok/s	~8%

Code Example¶

from llama_cpp import Llama

llm = Llama(
    model_path="./llama-3.1-8b-q5_k_m.gguf",
    n_ctx=8192,
    n_gpu_layers=35,  # Offload to GPU
    verbose=False,
)

response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=256,
)

Strengths: no GPU required, best on Apple Silicon, edge/mobile deployment, single binary, Ollama/LM Studio ecosystem.

Limitations: 10-20x slower than vLLM on H100, limited batching (not for high concurrency), manual model conversion, single-node only.

5. LMDeploy и TGI¶

LMDeploy (Shanghai AI Laboratory)¶

Feature	Description
Turbomind	Custom CUDA kernels
FasterTransformer	Based on FT optimization
INT4/INT8	SmoothQuant quantization

Performance (InternLM-20B on A100): 680 tok/s (vs vLLM 620 tok/s).

TGI (Text Generation Inference)¶

Feature	Description
HF integration	Direct Hub model loading
Flash Attention	Optimized attention
Docker	Containerized deployment

Best for: teams already using HuggingFace ecosystem, enterprise deployments.

6. Benchmarks¶

Throughput (Llama-70B, H100, batch=64)¶

Engine	Throughput (tok/s)	Relative
TensorRT-LLM	850	100%
SGLang	820	96%
vLLM	680	80%
LMDeploy	650	76%
TGI	580	68%

Throughput по batch size (Llama-7B, A100)¶

Engine	Batch=1	Batch=32	Batch=128
HF Transformers	50 tok/s	500 tok/s	OOM
vLLM	100 tok/s	2,500 tok/s	5,000 tok/s
SGLang	110 tok/s	2,400 tok/s	4,800 tok/s
TensorRT-LLM	120 tok/s	3,000 tok/s	5,500 tok/s

Latency (Llama-70B, single request)¶

Engine	TTFT (ms)	TBT (ms)	Total (100 tok)
TensorRT-LLM	28	15	1,800
SGLang	40	20	2,150
vLLM	35	25	2,700
TGI	48	~30	~3,500

Agent Workloads (multi-turn, shared context)¶

Engine	Throughput	Relative to vLLM
SGLang	1800 tok/s	3x
vLLM	600 tok/s	1x
TensorRT-LLM	520 tok/s	0.87x

Structured Output (JSON)¶

Engine	Throughput	Constraint Support
SGLang	1200 tok/s	Native
vLLM + Outlines	400 tok/s	Plugin
TensorRT-LLM	350 tok/s	Limited

Memory Efficiency (Llama-70B)¶

Engine	Memory	Max Batch	Max Seq Len
HF Transformers	140 GB	4	2K
vLLM	80 GB	32	8K
TensorRT-LLM	70 GB	40	8K

7. Architecture Comparison¶

KV Cache Management¶

Engine	Approach	Memory Waste
vLLM	PagedAttention (blocks)	<4%
SGLang	RadixAttention (tree)	<1%
TensorRT-LLM	Pre-allocated	10-20%
llama.cpp	Static	15-30%

Batching Strategy¶

Engine	Strategy	Benefit
vLLM	Continuous batching	No waiting
SGLang	Continuous + radix	+ Prefix sharing
TensorRT-LLM	In-flight batching	NVIDIA optimized
llama.cpp	Simple batching	CPU-friendly

Multi-GPU Support¶

Engine	TP	PP	Notes
vLLM	Yes	Yes	Best multi-node
SGLang	Yes	No	Planned
TensorRT-LLM	Yes	Yes	NVIDIA optimized
llama.cpp	No	No	Single-node

Feature Matrix¶

Feature	vLLM	SGLang	TensorRT-LLM	llama.cpp
GPU support	All	NVIDIA	NVIDIA	All
CPU support	No	No	No	Yes
Quantization	Yes	Yes	Yes	Yes
Streaming	Yes	Yes	Yes	Yes
OpenAI API	Yes	Yes	Yes	Yes
Prefix caching	Yes	Best	Yes	No
Structured output	Plugin	Native	Limited	No
Multi-modal	Yes	Yes	Yes	Partial

8. Decision Framework¶

By Hardware¶

graph TD
    HW{"Hardware?"} -->|"NVIDIA H100/B200"| N1{"Priority?"}
    HW -->|"NVIDIA Consumer<br/>(RTX 4090)"| N2["vLLM<br/>(llama.cpp if VRAM limited)"]
    HW -->|"AMD MI300X"| AMD["vLLM (ROCm)"]
    HW -->|"Apple M-series"| APPLE["llama.cpp (Metal)"]
    HW -->|"CPU only"| CPU["llama.cpp"]
    HW -->|"Local dev"| DEV["Ollama"]

    N1 -->|"Max latency"| TRT["TensorRT-LLM"]
    N1 -->|"Agents"| SGL["SGLang"]
    N1 -->|"General"| VLLM["vLLM"]
    N1 -->|"HF ecosystem"| TGI["TGI"]

    style TRT fill:#e8eaf6,stroke:#3f51b5
    style SGL fill:#fff3e0,stroke:#ff9800
    style VLLM fill:#e8f5e9,stroke:#4caf50

By Use Case¶

Use Case	Recommended	Why
Chatbot (high QPS)	vLLM	Best throughput/latency balance
Multi-turn agents	SGLang	RadixAttention, prefix sharing, 3x throughput
JSON output	SGLang	Native constrained decoding
Maximum latency	TensorRT-LLM	CUDA optimization, 1.6x faster
Edge deployment	llama.cpp	No GPU required
Local development	Ollama	One-command setup
Production scale	TensorRT-LLM or vLLM	Battle-tested
Cost optimization	SGLang	40% cheaper for agents

By Priority¶

Priority	Ranking
Throughput	TensorRT-LLM > SGLang > vLLM > LMDeploy > TGI
Latency	TensorRT-LLM > vLLM > SGLang > TGI
Agent workloads	SGLang >> vLLM > TensorRT-LLM
Ease of use	Ollama > vLLM > llama.cpp > SGLang > TensorRT-LLM
Hardware flexibility	llama.cpp > vLLM > TensorRT-LLM
Community	vLLM > llama.cpp > SGLang > TensorRT-LLM

9. Formulas¶

Memory Savings (PagedAttention)¶

\[\text{Waste}_{static} = \frac{T_{max} - T_{actual}}{T_{max}} \approx 60-80\%\]

\[\text{Waste}_{paged} = \frac{B - (T \mod B)}{B} < 4\%\]

Where $B$ = block size (typically 16 tokens).

Throughput Calculation¶

\[\text{Throughput} = \frac{B \times N \times L}{T}\]

$B$ = batch size, $N$ = sequences per batch, $L$ = avg output length, $T$ = total time

Latency Components¶

\[\text{TTFT} = T_{queue} + T_{prefill}\]

\[\text{TPOT} = \frac{T_{decode}}{\text{output tokens}}\]

Для интервью¶

Q: "Сравните vLLM, SGLang и TensorRT-LLM."¶

vLLM: PagedAttention (block-based KV cache, <4% waste), continuous batching, easiest deployment, NVIDIA/AMD/Ascend -- best general-purpose. SGLang: RadixAttention (tree-based prefix sharing, <1% waste), 3x throughput on agent workloads, native JSON constraints -- best for agents. TensorRT-LLM: CUDA graph fusion, lowest single-request latency (1.6x faster), NVIDIA only, complex setup (hours) -- best raw performance.

Q: "Что такое PagedAttention и RadixAttention?"¶

PagedAttention (vLLM): KV cache разбит на fixed-size blocks (16 tokens), allocated on-demand. Memory waste: 60-80% -> <4%. Аналог virtual memory в OS. RadixAttention (SGLang): KV cache хранится в radix tree. Automatic prefix detection и sharing между запросами. Token-level granularity (не block-level). 5x speedup при shared prefixes, <1% waste.

Q: "Когда выбрать llama.cpp?"¶

CPU-only / Edge / Apple M-series. GGUF format с квантизацией (Q4_K_M: 4.5 GB для 7B, ~3-5% quality loss). Runs on mobile (Android, iOS), no GPU required. Ecosystem: Ollama, LM Studio. Limitations: 10-20x slower than vLLM on H100, no distributed, limited batching.

Q: "Спроектируйте систему инференса для 10K concurrent users."¶

(1) vLLM или TensorRT-LLM на кластере H100 с tensor parallelism. (2) Load balancer -> multiple vLLM instances. (3) Continuous batching для GPU utilization 70-90%. (4) Prefix caching (RadixAttention если multi-turn). (5) FP8 quantization для 2x memory savings. (6) Autoscaling по queue depth. Target: TTFT <100ms, throughput 850+ tok/s per node.

Q: "Disaggregated inference -- что это?"¶

Разделение prefill и decode фаз на разные GPU. Prefill = compute-bound (parallel processing input tokens). Decode = memory-bound (sequential token generation). SGLang поддерживает disaggregated inference: prefill workers отдельно от decode workers. Результат: лучшее utilization обоих типов GPU, оптимизация под каждую фазу.

Ключевые числа¶

Факт	Значение
vLLM PagedAttention waste	<4% (vs 60-80% traditional)
SGLang RadixAttention waste	<1%
SGLang agent speedup vs vLLM	3x
SGLang cost savings (agents)	40% cheaper
SGLang prefix reuse speedup	5x (10 shared requests)
TensorRT-LLM single-req latency advantage	1.6x faster
llama.cpp Q4_K_M (7B)	4.5 GB, 45 tok/s, ~3-5% loss
vLLM vs HF Transformers	6x throughput (Llama-70B)
TensorRT-LLM H100 throughput (batch=64)	850 tok/s
vLLM stars	35k+
llama.cpp stars	70k+
TensorRT-LLM setup time	Hours (engine compilation)

Заблуждение: TensorRT-LLM всегда быстрее vLLM

TensorRT-LLM выигрывает 1.6x на single-request latency (28ms vs 45ms), но при batch=64 разница сокращается до 1.25x (850 vs 680 tok/s). На agent workloads с multi-turn контекстом SGLang обгоняет TensorRT-LLM в 3.5x благодаря RadixAttention. "Самый быстрый" зависит от workload: batched serving (TRT-LLM), agents (SGLang), general purpose (vLLM).

Заблуждение: vLLM и SGLang -- взаимозаменяемые

При unique prompts (zero prefix sharing) разница минимальна: 100ms vs 90ms. Но при 10 запросах с shared prefix SGLang выигрывает 5x (200ms vs 1000ms). Если ваш workload -- chatbot с system prompt + multi-turn, SGLang экономит 40% compute. Если single-shot API без shared контекста -- vLLM проще и имеет лучшую поддержку AMD/Ascend.

Заблуждение: квантизация Q4 на llama.cpp -- бесплатное ускорение

Q4_K_M дает 45 tok/s вместо 18 tok/s (FP16) на MacBook M3, но quality loss составляет 3-5% на общих бенчмарках и до 15% на math/reasoning задачах. Для production chatbot -- допустимо. Для юридического или медицинского AI -- Q8_0 (1% loss) или FP16. Всегда измеряйте quality на ВАШЕМ домене, а не на MMLU.

Interview Questions¶

Q: Вам нужно развернуть Llama-70B для 10K concurrent users. Какой движок выберете и почему?

Red flag: "vLLM, потому что он самый популярный"

Strong answer: "Зависит от workload profile. Если chatbot с multi-turn (shared system prompt, 3-5 turns) -- SGLang: RadixAttention дает 5x speedup на shared prefix, 3x throughput на agent workloads. Если single-shot API (translation, classification) -- vLLM: проще deployment, лучше community, NVIDIA+AMD support. Если latency-critical (real-time trading, voice) -- TensorRT-LLM: 28ms TTFT vs 45ms vLLM. Конкретно для 10K concurrent: vLLM на 4xH100 с tensor parallelism, continuous batching, prefix caching. HPA по queue depth, min 4 / max 20 replicas. Target: TTFT <200ms p95, 680+ tok/s per node."

Q: Что такое PagedAttention и почему это важно для serving?

Red flag: "Это способ ускорить attention computation"

Strong answer: "PagedAttention решает проблему memory fragmentation в KV cache. Традиционный подход: pre-allocate max_seq_len (4096 tokens) на каждый запрос, реальное использование -- 500 tokens, waste 87%. PagedAttention разбивает KV cache на fixed-size blocks (16 tokens), allocated on-demand -- аналог virtual memory в OS. Waste падает с 60-80% до <4%. Практический эффект: на тех же 80GB H100 max batch size растет с 8 до 128+, throughput 6x vs HuggingFace Transformers (300 vs 50 tok/s на Llama-70B). RadixAttention (SGLang) развивает идею: tree-based structure с token-level granularity, waste <1%, автоматический prefix sharing."

Q: Когда оправдано использовать llama.cpp вместо vLLM?

Red flag: "Когда нет GPU"

Strong answer: "Три сценария: (1) Edge/mobile deployment -- llama.cpp единственный движок работающий на Android/iOS, single binary без зависимостей. (2) Apple Silicon development -- Metal GPU acceleration, Q5_K_M на M3 Max дает 38 tok/s для 7B модели при 5.5GB RAM, достаточно для local dev. (3) CPU-only inference для low-traffic внутренних инструментов (<100 req/day) где GPU стоимость не оправдана. Но НЕ для production high-concurrency: llama.cpp 10-20x медленнее vLLM на H100, нет distributed inference, limited batching. Ecosystem (Ollama, LM Studio) -- для разработки и прототипирования, не для serving."

Q: Disaggregated inference -- что это и когда применять?

Red flag: "Это когда модель распределена по нескольким GPU"

Strong answer: "Disaggregated inference -- разделение prefill и decode фаз на разные GPU pools. Prefill = compute-bound (параллельная обработка всех input tokens, нагружает FLOPS). Decode = memory-bound (последовательная генерация по одному токену, нагружает memory bandwidth). На одном GPU оптимизация под обе фазы -- компромисс. Disaggregated: prefill workers на compute-оптимальных GPU (H100 SXM), decode workers на memory-оптимальных (с высоким bandwidth). SGLang поддерживает нативно. Применять при >1000 QPS: экономия 20-30% GPU за счет лучшего utilization каждого типа worker. При малом трафике overhead координации съедает выигрыш."

Источники¶

Kwon et al. -- "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
Zheng et al. -- "SGLang: Efficient Execution of Structured Language Model Programs" (arXiv:2312.07104)
NVIDIA -- TensorRT-LLM In-flight Batching whitepaper
Clarifai -- "Comparing SGLang, vLLM, and TensorRT-LLM with GPT-OSS-120B" (Aug 2025)
Kanerika -- "SGLang vs vLLM: Which is Better in 2026?" (Sept 2025)
LangCopilot -- "SGLang 3x Faster" (July 2025)
SemiAnalysis -- "InferenceMAX" (Oct 2025)
Medium (Ordina Data) -- "Choosing Your LLM Framework: Ollama, vLLM, SGLang, TensorRT-LLM"
AI Merge -- "The AI Engineer's Guide to Inference Engines"

Сравнение движков инференса LLM¶

Ключевые концепции¶

Comparison Matrix (2026)¶

1. vLLM (PagedAttention Engine)¶

Core Innovation: PagedAttention¶

Key Features¶

Performance (Llama-70B on H100)¶

Code Example¶

Production Setup¶

2. SGLang (Structured Language Engine)¶

Core Innovation: RadixAttention¶

Key Features¶

Performance¶

Code Example¶

3. TensorRT-LLM (NVIDIA Optimized)¶

Core Innovation: CUDA Graph Fusion¶

Key Features¶

Performance (Llama-70B on B200, Dec 2025)¶

Code Example¶

Setup Complexity¶

4. llama.cpp (CPU + Edge)¶

Core Innovation: GGUF Format¶

Key Features¶

Quantization Trade-offs (Llama-7B, MacBook Pro M3 Max)¶

Code Example¶

5. LMDeploy и TGI¶

LMDeploy (Shanghai AI Laboratory)¶

TGI (Text Generation Inference)¶

6. Benchmarks¶

Throughput (Llama-70B, H100, batch=64)¶

Throughput по batch size (Llama-7B, A100)¶

Latency (Llama-70B, single request)¶

Agent Workloads (multi-turn, shared context)¶

Structured Output (JSON)¶

Memory Efficiency (Llama-70B)¶

7. Architecture Comparison¶

KV Cache Management¶

Batching Strategy¶

Multi-GPU Support¶

Feature Matrix¶

8. Decision Framework¶

By Hardware¶

By Use Case¶

By Priority¶

9. Formulas¶

Memory Savings (PagedAttention)¶

Throughput Calculation¶

Latency Components¶

Для интервью¶

Q: "Сравните vLLM, SGLang и TensorRT-LLM."¶

Q: "Что такое PagedAttention и RadixAttention?"¶

Q: "Когда выбрать llama.cpp?"¶

Q: "Спроектируйте систему инференса для 10K concurrent users."¶

Q: "Disaggregated inference -- что это?"¶

Ключевые числа¶

Interview Questions¶

Источники¶

See Also¶