让解码不再 token-by-token

Decode 是 memory-bound——每个 token 要把整个模型权重从 HBM 拖一遍，算力闲着。如果有一个便宜的小模型先"猜" K 个 token，大模型一次 forward 验证——一个 step 平均出 2.5+ 个 token，几乎没掉精度。 Speculative Decoding 家族在 2023 之后成为 serving 框架的默认开关。

本页 4 节 · 1 个 demo

§4.1Speculative Decoding 速通
§4.2Medusa
§4.3EAGLE 家族
§4.4Lookahead / Self-spec

§4.1Speculative Decoding · 速通版

Speculative Decoding — Fast Inference from Transformers via Speculative Decoding

ICML 2023 Leviathan, Kalman, Matias · Google · arXiv:2211.17192 + 并行的 DeepMind 版本

Decode 是 memory-bound——一个 batch 一个 token，权重读入只用一次。但 GPU 算力闲着。

关键想法：让一个小、便宜的 drafter 一口气吐 $K$ 个候选 token；大模型 verifier 把这 $K$ 个 token 在一个前向里全部打分（prefill 一段 K-长的"假"输入）。对于 prefix 中第一个被 reject 的位置 $j$，我们在第 $j$ 个位置 sample 一个 verifier 自己分布下的 token，剩下的 draft 丢弃。这样一次前向期望吐 $\mathbb{E}[\text{accepted}+1]$ 个 token，数学上等价于大模型单步采样。

为什么"加速但保分布"？

Rejection sampling 的恒等式。设 drafter 分布 $q(x)$，verifier 分布 $p(x)$。把 draft 出来的 $x$ 以 $\min(1, p(x)/q(x))$ 接受；被拒就从 "$p - q$ 的正部分"重采。数学上输出仍来自 $p$。所以 specdec 不是近似，是无损加速。

主循环骨架

# 教学版 specdec 主循环
def speculative_decode(prompt, target, drafter, K=4, max_tokens=200):
    tokens = list(prompt)
    while len(tokens) < max_tokens:
        # 1. drafter 自回归 K 个 token (便宜)
        draft = []
        d_probs = []
        x = tokens[:]
        for _ in range(K):
            logits = drafter(x)
            p = softmax(logits[-1])
            t = sample(p)
            draft.append(t); d_probs.append(p)
            x.append(t)
        # 2. verifier 一个 forward 同时打分 K+1 个位置
        v_probs = []
        for j in range(K + 1):
            logits = target(tokens + draft[:j])  # 实际是 1 次 forward, K+1 个 head
            v_probs.append(softmax(logits[-1]))
        # 3. rejection sampling 接受最长 prefix
        n_accept = 0
        for j in range(K):
            t = draft[j]
            r = random()
            if r < min(1.0, v_probs[j][t] / d_probs[j][t]):
                n_accept += 1
            else:
                # reject: 从 (v - q)+ 重采
                resid = (v_probs[j] - d_probs[j]).clamp(min=0)
                resid /= resid.sum()
                draft[j] = sample(resid)
                break
        # 4. 接受的前缀 + 一个"修正/奖励" token
        tokens.extend(draft[:n_accept + 1])
    return tokens

Demo 4 · Speculative Decoding · 谁猜得准谁加速

K = draft length accept prob p

drafter 一次猜 K 个 token；verifier 一个前向接受最长前缀。 p 高 = drafter 与 verifier 对齐好 → 加速大； p 低时 verifier 的固定成本反而拖慢—— 选 K 是个 trade-off：K 太大也会更多浪费。 EAGLE 系列在 SOTA 实际工作里把 p 提到 ~0.85+，速度直接 ×3。

§4.2Medusa · 用多头并行猜，无 drafter 网络

Medusa — Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

2024-01 Cai et al. · Princeton / Together · arXiv:2401.10774

Drafter 模型自己要训、要维护。 Medusa 让同一个大模型多挂几个"prediction head"，分别预测 t+1, t+2, t+3 ...； head 间通过tree attention共享一次前向。每个 head 只是小 MLP，可以随主模型一起 LoRA。 1 个前向出 ~2.2 token，~2× 加速。

# Medusa head: 每个 head 直接吐一个 token 分布
# head_k 预测 (t + k + 1) 位置
class MedusaHead(nn.Module):
    def __init__(self, D, V):
        self.res = nn.Linear(D, D)
        self.act = nn.SiLU()
        self.out = nn.Linear(D, V)
    def forward(self, h):
        return self.out(self.act(self.res(h) + h))

# 推理: 一次 forward → 每个位置取 top-k_per_head 个候选 → 构造 tree → 用 tree attention 一次验证

§4.3EAGLE / EAGLE-2 / EAGLE-3 · 在 hidden state 上猜

EAGLE — Speculative Sampling Requires Rethinking Feature Uncertainty

ICML 2024 Li et al. · Vector Institute · arXiv:2401.15077

Medusa 直接预测下一 token 太"远"。 EAGLE 改为预测下一 hidden state 特征—— 更接近 target 输出空间，drafter 学得更准。使用同模型的最后一层 hidden 作监督，一个轻量 transformer drafter + tree attention candidate verification。 Vicuna 7B：~3× 加速。

EAGLE-2 / EAGLE-3 — Dynamic Draft Trees + Multi-feature

2024-06 / 2025-01 · EAGLE-2 · EAGLE-3

EAGLE-2：动态 token tree——按 drafter 的 confidence 自适应扩枝，比 fixed tree 多 ~20%。 EAGLE-3：让 drafter 看大模型低中高三个层级的 feature，更稳健。Llama-3-70B 实测 4–5×。 EAGLE-3 已被 SGLang / vLLM / TGI 集成。

Token Tree · 一次 forward 同时验 N 条 draft

Medusa / EAGLE 都用token tree：每个位置取 top-k 候选 → 排列组合形成 $k^K$ 条候选 → 实际上构造一个 N 节点的tree，让大模型一次 forward 同时验 N 条。通过tree attention mask（每个节点只能 attend 它的祖先）实现。

# 简化: depth=2, top-2 token-tree
# root = "the"
#  ├─ "cat"
#  │   ├─ "is"
#  │   └─ "sat"
#  └─ "dog"
#      ├─ "is"
#      └─ "ran"
# 大模型一次 forward, 7 个位置, 各自只 attend 其祖先链
tree_attn_mask = torch.tensor([
    [1,0,0,0,0,0,0],  # the (root)
    [1,1,0,0,0,0,0],  # cat → the
    [1,0,1,0,0,0,0],  # dog → the
    [1,1,0,1,0,0,0],  # is  → the,cat
    [1,1,0,0,1,0,0],  # sat → the,cat
    [1,0,1,0,0,1,0],  # is  → the,dog
    [1,0,1,0,0,0,1],  # ran → the,dog
])

§4.4Lookahead Decoding · 自蒸馏式无 drafter

Lookahead Decoding — Break the Sequential Dependency of LLM Inference

ICML 2024 Fu et al. · UCSD · arXiv:2402.02057

无 drafter：用 Jacobi 迭代——把"未来 $n$ 个 token 同时猜，迭代直到稳定"。每次前向喂一段 n-gram 池（来自历史 unconvergent 解），模型既要预测下一 token，也要验证 n-gram。模型完全无需改，~1.5–2× 加速。

Self-Speculative · LayerSkip / KangarooLLM / Draft & Verify

2024 · LayerSkip

Drafter 就是同一个大模型的前几层 + 早退出。成本几乎为零，部署不增模型。 KangarooLLM / Draft & Verify / LayerSkip 都属此线。精度 vs 速度比专用 drafter 略差，但工程上极便携。

MTP · Multi-Token Prediction (DeepSeek-V3)

DeepSeek 2024-12 · arXiv:2412.19437

DeepSeek-V3 在预训练时就让模型学预测 t+1 与 t+2 两个位置， "免费"得到一个内置 drafter。推理时把 MTP head 当 Medusa 用。训练既得到更好质量，推理直接加速 ~1.8×——一举两得。

速查 · 怎么挑 specdec 方案

已有同家族小模型（如 Llama-3.2-1B 配 Llama-3-70B）：标准 specdec 最稳。
无小模型，但能 LoRA / 训 head：Medusa / EAGLE-3。
无任何训练资源：Lookahead 或 LayerSkip。
新训练模型：直接加 MTP head（DeepSeek-V3 路线）。
chat 助手场景：EAGLE-3 + tree（vLLM/SGLang 默认）。