少算 = 算得对

稀疏的核心思想是：每个 token 用不上模型的全部参数。非结构化剪枝（SparseGPT/Wanda）追求理论 FLOPs； 2:4 / N:M 给硬件吃； MoE 把"按 token 选小模型"做成模型架构本身—— DeepSeek-V3 把这条路真正打到 production。

本页 4 节 · 1 个 demo

§6.1SparseGPT / Wanda
§6.22:4 / N:M 结构化稀疏
§6.3MoE · 路由两难
§6.4Expert Parallelism

§6.1SparseGPT / Wanda · 一发剪枝

SparseGPT — Massive Language Models Can be Accurately Pruned in One-Shot

ICML 2023 Frantar, Alistarh · arXiv:2301.00774

把 GPTQ 的二阶思路用到剪枝：逐列选要剪哪些 element， Hessian 把"损失"传给后面未处理的列。 OPT-175B 50% 稀疏度，<1% 损失，一小时跑完。

Wanda — A Simple and Effective Pruning Approach for LLMs

ICLR 2024 Sun, Liu, Bair, Kolter · CMU · arXiv:2306.11695

一行 numpy 即可：剪 $W_{ij}$ 当 $|W_{ij}| \cdot \|X_{:, j}\|_2$ 最小。无需 Hessian 计算、无需更新，效果接近 SparseGPT。告诉我们：activation 的 magnitude 才是关键信号。

# Wanda: 一行剪枝公式
def wanda_prune(W, X, sparsity=0.5):
    # W: [d_out, d_in],  X: [n_calib, d_in]
    act_norm = X.norm(p=2, dim=0)            # [d_in]
    importance = W.abs() * act_norm[None, :] # [d_out, d_in]
    # 按行选 top-(1-sparsity) 比例保留
    n_keep = int(W.shape[1] * (1 - sparsity))
    thr = importance.topk(n_keep, dim=1).values[:, -1:]
    mask = importance >= thr
    return W * mask

§6.22:4 / N:M 结构化稀疏 · 实际能加速

上面那种非结构化稀疏，理论 50% 但 GPU 上很难真加速—— 随机零分布在 SM 调度时跟 dense 没区别。 2:4 稀疏是 Ampere/Hopper 硬件支持：每 4 个连续权重必须有 2 个零。 tensor core 直接 2× 算力（一个 cycle 处理 2 个非零）。

方法	关键想法
SparseGPT 2:4	逐 4-tuple 选最优保留位置
Wanda 2:4	按 magnitude · activation 重要性挑
MaskLLM	把 2:4 mask 当 categorical 变量学
2:4 + LoRA / SFT recovery	剪完 LoRA 微调一下补质量

# 2:4 mask: 每 4 连续位置保留 magnitude 最大的 2 个
def make_24_mask(W):
    # W: [d_out, d_in], d_in % 4 == 0
    d_out, d_in = W.shape
    g = W.abs().reshape(d_out, d_in // 4, 4)
    top2 = g.topk(2, dim=-1).indices       # [d_out, d_in//4, 2]
    mask = torch.zeros_like(g, dtype=bool)
    mask.scatter_(-1, top2, True)
    return mask.reshape(d_out, d_in)

§6.3Mixture of Experts · 路由是两难

Switch Transformer / GShard / Mixtral / DeepSeek-MoE

2021–2024 · Switch · Mixtral · DeepSeek-MoE

把 FFN 换成 $N$ 个并列的"experts"，每个 token 经过一个 router 选 top-K 个 expert。 activation 量不变，但参数量 ×N——容量大、计算量不增。 Mixtral 8×7B = 47B 总参 / 13B 激活； DeepSeek-V3 = 671B / 37B； Llama-4 / GPT-4 / Gemini 都用了 MoE。

Router 的形状

# 教学版 top-K router: 一个小线性层 + softmax + top-K
class MoELayer(nn.Module):
    def __init__(self, D, N_experts, K=2):
        self.gate    = nn.Linear(D, N_experts, bias=False)
        self.experts = nn.ModuleList([FFN(D) for _ in range(N_experts)])
        self.K = K
    def forward(self, x):                  # x: [B, S, D]
        scores = self.gate(x)              # [B, S, N]
        probs = scores.softmax(dim=-1)
        # top-K 选择
        top_p, top_i = probs.topk(self.K, dim=-1)
        top_p = top_p / top_p.sum(dim=-1, keepdim=True)
        # 分发: 每个 expert 接收的 token 子集
        out = torch.zeros_like(x)
        for e_idx, expert in enumerate(self.experts):
            mask = (top_i == e_idx).any(dim=-1)
            if not mask.any(): continue
            weight = (top_p * (top_i == e_idx)).sum(dim=-1, keepdim=True)
            out += mask[..., None] * weight * expert(x)
        return out

三大工程难点

负载均衡——router 倾向于把所有 token 倒进同一个 expert，需要 aux loss / capacity factor 控制；DeepSeek-V3 用 "auxiliary-loss-free" per-expert bias，避免 aux loss 拖累质量。
通信——all-to-all 把 token 发去对应 expert，通信量随 MoE size 暴增；DeepEP 用 FP8 send + double buffer overlap。
稀疏激活的不规则计算—— 每个 expert 拿到的 token 数不同，需要"grouped GEMM"或者 padding。

Demo 6 · MoE Routing · 看 token 怎么分给 expert

N experts top-K balance loss / bias 校正

左侧是流过来的 token, 颜色 = 选中的 top-1 expert；右侧是每个 expert 的累计 token 数。关掉 balance loss → 几个 expert 被"冲死"，capacity 不够时灰格 = overflow drop。 capacity factor 设的是 expected×1.25，所以即使开了 balance，随机性会导致少量 overflow。production 里 capacity 通常设到 1.25–2.0。

§6.4Expert Parallelism · DeepSeek-V3 的工程范式

DeepSeek-V3 / DeepEP — Open-source All-to-All Library for MoE

2024-12 / 2025 · arXiv:2412.19437 · DeepEP

DeepSeek-V3 一战把 MoE 真正打到 production： 256 个 experts、每 token 选 8，671B 参数训练 / 推理都在大规模 GPU 集群。

auxiliary-loss-free 负载均衡（per-expert bias）：路由的 logits 加一个可学的偏置 $b_i$，over-load 的 expert 的 $b_i$ 自动下调， under-load 的上调。不用 aux loss，质量不受 trade-off 影响。
Multi-Token Prediction (MTP) 既加速训练又提供 specdec drafter。
node-limited routing——每个 token 限制最多去 4 个节点，把 all-to-all 通信压缩。
DeepEP：开源的 fine-grained EP 通信库， FP8 send、双 buffer overlap。

比 Megatron 的 EP/TP 混合配置在 H800 上吞吐高 ~2×。现在 vLLM / SGLang 都集成了 DeepEP。

# auxiliary-loss-free balance: 每个 expert 维护一个 bias
class AuxFreeRouter(nn.Module):
    def __init__(self, D, N):
        self.gate = nn.Linear(D, N, bias=False)
        # bias 不是参数, 用 running avg 调整
        self.register_buffer('bias', torch.zeros(N))
    def forward(self, x):
        logits = self.gate(x) + self.bias        # 加 bias (不参与梯度)
        probs = logits.softmax(dim=-1)
        top_p, top_i = probs.topk(self.K, dim=-1)
        # update bias based on observed load
        with torch.no_grad():
            load = torch.bincount(top_i.flatten(), minlength=self.N).float()
            avg = load.mean()
            self.bias -= 0.01 * (load - avg).sign()  # over-load → bias 减小
        return top_p, top_i

EP × TP × PP 的协奏

一个生产级 MoE 部署： EP=8（experts 切到 8 卡），TP=2（每张 expert 内再 tensor-parallel）， PP=4（不同 layer 在不同节点），DP=N。 Megatron-Core / DeepSpeed-MoE / DeepEP 都在拼这套配置。 DeepSeek-V3 工程上证明 EP-only + node-limited routing 比 EP × TP 更省通信。