把所有思想缝合到一个引擎里

Part II-VI 的所有 trick——FlashAttention、PagedAttention、KV 量化、specdec、 MoE EP、INT4 weights——最终都要被一个 serving 引擎调度起来。 vLLM 把 Paged + continuous batching 做成事实标准； SGLang 在它上面加 RadixAttention + 结构化解码； TRT-LLM / LMDeploy 提供手写 kernel 的极限路径； Splitwise / Mooncake 把 prefill 和 decode 解耦到不同的卡池。

本页 5 节

§7.1vLLM
§7.2SGLang
§7.3TensorRT-LLM
§7.4LMDeploy / TurboMind
§7.5P/D Disaggregation

§7.1vLLM · continuous batching + paged

vLLM — High-throughput LLM serving

industrial · open-source · github · 原论文 SOSP 2023

把 §2.4 的 PagedAttention、§3.5 的 chunked prefill、§4.1–§4.3 的 specdec、§6.4 的 EP 全部塞进一个 Python 库。继承 HuggingFace API，一行 llm = LLM(model) 就跑。现在的事实标准 baseline。支持几乎所有主流模型架构、所有量化方案。

continuous batching · 一句话

传统 static batching：32 个请求一起 prefill / decode，最长那个还没出完整个 batch 都等。 Continuous batching：每个 forward step 之间动态加 / 撤请求—— 已完成的立刻退出 → 新请求立刻进来。利用率从 ~30% → ~80%。

# vLLM scheduler 一个 step 的伪代码
def step(engine):
    # 1. 决定本步 batch
    seq_groups = engine.scheduler.schedule()
    # 2. 一次 forward (paged attention kernel)
    output = engine.worker.execute_model(seq_groups)
    # 3. 对每个 seq 处理输出 token + 状态转移
    for seq, tok in zip(seq_groups, output):
        seq.append_token(tok)
        if seq.is_finished():
            engine.scheduler.free(seq)  # 还 KV pages
    # 4. 唤醒等待中的请求
    engine.scheduler.admit_new_requests()

§7.2SGLang · RadixAttention + 结构化生成

SGLang — Programming model for structured LLM workflows

UC Berkeley + LMSYS · github · arXiv:2312.07104

在 vLLM 的基础上加：

RadixAttention（§3.4）做 prefix 缓存树。
受限解码：grammar / regex / JSON schema 一边推理一边 mask logits，零开销。
多模态 first-class：Llava / Qwen-VL / DeepSeek-VL 直接跑。
EAGLE-3 / NSA 等前沿默认开启。

agent / RAG / tool-use 工作流上比 vLLM 快 ~3×。

SGLang 程序长这样

@sgl.function
def multi_turn_chat(s, system_prompt, history, question):
    s += system_prompt          # 这一段会被 cache (所有 sample 共享)
    for u, a in history:
        s += f"User: {u}\nAssistant: {a}\n"
    s += f"User: {question}\nAssistant: "
    s += sgl.gen("answer", max_tokens=256,
                  regex=r"[\w\s\.,?!]+")  # 受限解码

# 1000 个请求共享同样的 system_prompt → RadixAttention 只跑一次 prefill

§7.3TensorRT-LLM · NVIDIA 自家 kernel 武器库

TensorRT-LLM

NVIDIA · github

所有 fused attention / GEMM kernel 都由 NVIDIA 工程师直接调好；搭配 SmoothQuant / FP8 / AWQ kernel， H100 上的极限工程方案。缺点：编译图静态、迭代慢，每加一个新模型都要手写 plugin。生产部署常和 Triton Inference Server 配合。

速度上限通常比 vLLM / SGLang 高 10-30%（同硬件、同模型），但开发 / 迭代速度慢得多。 NVIDIA 内部 reference benchmark 与对外宣传几乎都用 TRT-LLM。

§7.4LMDeploy / TurboMind · 国产 H100 上的小快灵

LMDeploy / TurboMind

OpenMMLab / 上海 AI Lab · github

TurboMind 是 FasterTransformer 的精神后继：手写 CUDA、INT8/INT4 KV、persistent kernel、W4A16 等。国内 H800 / 4090 集群上常见。

NanoFlow / Aphrodite / LightLLM / MAX (Modular)

其它服务框架

生态边缘的开源选项：NanoFlow (UMich+Snowflake) 把 cuda graph + nano-batch 打到极致；Aphrodite/LightLLM 是社区 fork 的 vLLM 衍生； Modular 的 MAX 用自家 Mojo 语言做 cross-platform。选型上 vLLM/SGLang 仍是主流，但这些是 niche scene 的好选择。

§7.5Prefill / Decode 解耦 · Splitwise / DistServe / Mooncake

Splitwise / DistServe / Mooncake — Disaggregated Prefill & Decode

ISCA 2024 / OSDI 2024 / 2024-06 · Splitwise · DistServe · Mooncake (Kimi)

Prefill 是 compute-bound、decode 是 memory-bound—— 把它们塞同一张卡，互相打架。解耦后：prefill 节点用大 batch 高利用率算 KV，通过 NVLink / RDMA 把 KV 搬给decode 节点，后者轻量、batch 大、token-by-token 出。实测端到端 SLO（TTFT + TBT）显著降低。 Mooncake 是 Kimi 的生产系统，跨集群 KV 池化、命中率优化。

解耦的数据通路

# 请求经过两层节点池, KV 在中间传送
async def serve(request):
    # 1. 路由到 prefill 节点
    prefill_node = router.pick_prefill_node()
    kv_handle = await prefill_node.prefill(request.prompt)
    #    → kv_handle 包含 RDMA-pinned buffer 元数据

    # 2. 路由到 decode 节点 (基于 KV cache 亲和性)
    decode_node = router.pick_decode_node(kv_handle.size_gb)

    # 3. KV 转移: 走 NVLink (跨 PCIe domain 用 RDMA)
    await decode_node.ingest_kv(kv_handle)

    # 4. decode 节点开始 token-by-token 出
    async for token in decode_node.decode(request):
        yield token

速查 · 选哪个 serving 框架

需求	首选
"我就想跑通一个模型"	vLLM
agent / RAG / few-shot	SGLang (RadixAttention)
极致 latency, 单模型生产	TensorRT-LLM
边缘 / 4090 / 国产卡	LMDeploy / TurboMind
千卡集群 + 长 prompt	vLLM + chunked prefill + Mooncake-style 解耦