A visual essay · CUDA + computer graphics · 2026-05 图解长文 · CUDA + 图形学 · 2026-05

How 3D Gaussian Splatting got fast. 3D Gaussian Splatting 是怎么变快的

From the 2023 paper that shocked SIGGRAPH to the 2025–2026 systems that train scenes in seconds. Every major rendering speedup, drawn out in pictures and the bare minimum of code.

从 2023 年那篇把 SIGGRAPH 一夜震住的论文，到 2025–2026 年训练一个场景只要几十秒的最新系统 —— 渲染端每一次重大加速的来龙去脉，配上最少的图和最少的代码，一口气说清楚。

Assumed background: basic NeRF / SDF, machine learning, linear algebra, and the chain rule. You do not need to know CUDA — by the end you'll have intuition for tiles, warps, shared memory, and atomics, in that order.

预设读者：会一点 NeRF / SDF、基础机器学习、线性代数、会用 chain rule。 不需要会 CUDA —— 读完会对 tile、warp、shared memory、atomic 这一串概念有直观感受，按这个顺序循序展开。

24 papers on the timeline · 6 interactive demos · ~1500 lines of canvas-only JS, no WebGL · last updated 2026-05-19 时间线上 24 篇论文 · 6 个可交互演示 · 约 1500 行纯 canvas JS（没用 WebGL）· 最后更新 2026-05-19

Reading path 阅读路径建议

Never read the original 3DGS paper: read §1–§5 in order, do the two demos, then pick the deep-dives you care about. Already comfortable with 3DGS: skip to §6 for the timeline; jump to §7.3 DISTWAR for the most surprising single CUDA trick.

没读过原版 3DGS 论文：从 §1 顺读到 §5，玩一下前两个 demo，再挑你感兴趣的 deep-dive。已经熟悉 3DGS：直接跳到 §6 时间线看演进，或者从 §7.3 DISTWAR 开始 —— 那篇是整个文献里最反直觉、单点收益最大的 CUDA 招数。

The problem, in one paragraph 痛点，一段话讲完

You already know the NeRF setup: photos in, a volumetric radiance field out, novel views by integrating $\text{color} \times \text{density}$ along rays. You also know the catch — every pixel of every frame calls a small MLP a few hundred times. Even Instant-NGP, with its hash grid, struggles to keep a 1080p frame under 30 ms on consumer cards.

NeRF 的 setup 你熟：照片进来，得到一个体积式的 radiance field，新视角靠沿射线积分 $\text{color} \times \text{density}$。你也知道代价 —— 每个像素每帧都要 call 一个小 MLP 几百次。哪怕换成 Instant-NGP 的 hash grid，要让 1080p 一帧稳在 30 ms 以内，消费级显卡都吃力。

In July 2023, Kerbl, Kopanas, Leimkühler & Drettakis (Inria + MPI) shipped a paper that simply walked around the problem. They replaced the MLP-evaluated continuous field with an explicit mixture: a few million anisotropic 3D Gaussians, each carrying a position, covariance, opacity, and view-dependent color. Volume rendering became splatting — rasterize each Gaussian's 2D footprint, alpha-composite front-to-back. No MLP in the inner loop. A custom CUDA rasterizer pushed it to ~135 FPS at 1080p on an RTX A6000, matching Mip-NeRF 360 quality with ~30 minutes of training instead of days.

2023 年 7 月，Inria + MPI 的 Kerbl, Kopanas, Leimkühler & Drettakis 发了一篇直接绕开这个痛点的论文。他们把 MLP 评估的连续 field 换成显式的混合表达 —— 几百万个各向异性的 3D Gaussian，每个带 position、covariance、opacity 和 view-dependent color。 Volume rendering 变成了 splatting：每个 Gaussian 投影到屏幕得到一个 2D 椭圆，再做 front-to-back 的 alpha 合成。内循环里没有 MLP。一套手写的 CUDA rasterizer 直接在 RTX A6000 上跑到 ~135 FPS @ 1080p，质量和 Mip-NeRF 360 持平，训练时间从几天压到 ~30 分钟。

The interesting question isn't "what is a radiance field" — you know — but "what does the CUDA actually do, kernel by kernel, that made this $1000\times$ faster than NeRF rendering?" And then: every year since, someone has found another constant factor and removed it. SnugBox tile bounds, warp-aggregated atomics, per-pixel sort, MCMC densification, hardware rasterization. This essay walks all of it.

所以真正值得问的不是「什么是 radiance field」—— 你已经会了 —— 而是 「这套 CUDA 一个 kernel 一个 kernel 到底做了什么，凭什么比 NeRF 渲染快 $1000\times$」。再往后：之后的每一年，都有人找到一个新的常数因子并把它干掉 —— SnugBox 的 tile 包围盒、warp 聚合的 atomic、per-pixel sort、MCMC 式 densification、硬件 rasterization…… 这篇长文把这些一个个走一遍。

§1 The atom: a single 3D Gaussian, dissected 原子：拆开一个 3D Gaussian 看看

Forget neural networks. The fundamental unit of a 3DGS scene is a fuzzy ellipsoid:

把神经网络放一边。3DGS 场景里的基本单位是一个模糊的椭球：

$$ G(\mathbf{x}) = \exp\!\Big(-\tfrac{1}{2}\,(\mathbf{x}-\boldsymbol{\mu})^{\!\top}\,\boldsymbol{\Sigma}^{-1}\,(\mathbf{x}-\boldsymbol{\mu})\Big) $$

Two parameters describe its shape and place: a center $\boldsymbol{\mu} \in \mathbb{R}^3$ and a $3\times 3$ covariance $\boldsymbol{\Sigma}$ that says how stretched and oriented the blob is. Plus an opacity $o \in (0,1)$ and a color $\mathbf{c}$ that depends on viewing direction (so reflections work) — more on that in a moment.

两个参数刻画它的形状与位置：一个中心 $\boldsymbol{\mu} \in \mathbb{R}^3$，和一个$3\times 3$ 协方差 $\boldsymbol{\Sigma}$ 表示这个 blob 怎么拉伸、怎么旋。再加一个 opacity $o \in (0,1)$ 和一个视角相关的颜色 $\mathbf{c}$（这样反射才有得玩）—— view-dependent color 那块等下再说。

Why factor $\Sigma$ into $\text{rotation} \times \text{scale}$? 为什么要把 $\Sigma$ 拆成 $\text{rotation} \times \text{scale}$？

Covariance matrices have to be symmetric positive semi-definite. If you optimize the 6 independent entries of $\boldsymbol{\Sigma}$ directly, gradient descent will happily push them into an invalid region (negative eigenvalues = imaginary ellipsoid). Kerbl et al. dodge this by storing $\Sigma$ as

Covariance matrix 必须对称正定。如果你直接优化 $\Sigma$ 的 6 个独立项，gradient descent 完全有可能把它推进非法区域（负特征值就意味着虚数椭球，几何上不存在）。Kerbl 等人的对策是把 $\Sigma$ 写成：

$$ \boldsymbol{\Sigma} = R\,S\,S^{\!\top}\,R^{\!\top} $$

where $S = \mathrm{diag}(s_x, s_y, s_z)$ is a diagonal scale and $R$ is a rotation matrix built from a unit quaternion $\mathbf{q}$. Any 3D ellipsoid can be written this way, and the parameterization is automatically valid no matter what gradient lands on $(\mathbf{q}, \mathbf{s})$. This trick — pick coordinates so the constraints are free — appears again and again in the field.

其中 $S = \mathrm{diag}(s_x, s_y, s_z)$ 是对角的 scale，$R$ 由单位四元数 $\mathbf{q}$ 构成。任何 3D 椭球都能这么写，且不管梯度怎么砸到 $(\mathbf{q}, \mathbf{s})$ 上，参数化都结构性合法。这种「换坐标让约束自动满足」的招数，在 3DGS 这一脉里反复出现。

Demo 1 · Shape one yourself Demo 1 · 亲手拉一个

Σ = …

Drag the orange dot to move $\mu$; the cyan / magenta dots to scale along the two principal axes; the green dot to rotate. The readout updates the $2\times 2$ covariance live. This is the 2D analog of the 3D parameterization above — the math is identical.

拖橙色点移动 $\mu$；拖青色 / 品红点缩放两个主轴；拖绿色点旋转。下方读数实时显示 $2\times 2$ covariance。这是上面那套 3D 参数化的 2D 版本，数学上是完全一样的一回事。

The splatting projection 把椭球「拍扁」到屏幕上

To draw the ellipsoid into a 2D image, project the center $\boldsymbol{\mu}$ through the camera (standard pinhole) and the covariance too — that's the EWA splatting trick from Zwicker et al. (2001). If $W$ is the world-to-camera matrix and $J$ is the Jacobian of the perspective projection at $\boldsymbol{\mu}$, then the 2D screen-space covariance is

要把这个椭球画到 2D 图像上，先把中心 $\boldsymbol{\mu}$ 通过相机投到屏幕（标准 pinhole）， covariance 也得跟着投 —— 这就是 Zwicker 等人 2001 年 EWA splatting 的做法。设 $W$ 是 world-to-camera，$J$ 是透视投影在 $\boldsymbol{\mu}$ 处的 Jacobian，那 2D screen-space covariance 就是：

$$ \boldsymbol{\Sigma}' = J\,W\,\boldsymbol{\Sigma}\,W^{\!\top}\,J^{\!\top} $$

Drop the bottom row and right column and you get a $2\times 2$ covariance — the ellipse the 3D ellipsoid casts onto the image. Every pixel within that ellipse gets a contribution $\alpha\cdot G_{\text{2D}}(\mathbf{p})$ from this Gaussian. That's it. That's "splatting."

扔掉最下一行和最右一列就得到 $2\times 2$ 的 covariance —— 这就是 3D 椭球投影到屏幕上的那个椭圆。椭圆覆盖到的每个像素都从这个 Gaussian 拿到 $\alpha\cdot G_{\text{2D}}(\mathbf{p})$ 的贡献。没了。这就是「splatting」全部含义。

Intuition 直观

Slap a rubber ball against a wall — it deforms into an ellipse on impact. EWA does exactly that: the image plane is the wall, $J W \Sigma W^\top J^\top$ describes how flat-and-wide that ellipse comes out.

一个橡胶球啪一下拍到墙上 —— 接触瞬间变成一个椭圆。EWA 干的就是这事： image plane 是那堵墙，$J W \Sigma W^\top J^\top$ 描述这个被拍扁的椭圆有多扁多大。

View-dependent color via spherical harmonics 用 SH 系数表达 view-dependent color

Same problem NeRFs solved with a view-direction input to the MLP — gloss and specularity demand $c$ depend on $\mathbf{d}$. 3DGS handles it without an MLP: store color as spherical harmonic coefficients up to degree 3 (16 per channel, 48 numbers). Evaluate at the per-Gaussian viewing direction, get an RGB. Critically this is evaluated once per Gaussian per frame in the preprocess kernel — not per pixel — so the inner compositor loop sees only a precomputed color and pays nothing extra for view-dependence.

这是 NeRF 用「把 view direction 喂进 MLP」解决的同一个问题 —— 高光、镜面反射要求 $c$ 依赖 $\mathbf{d}$。 3DGS 用 SH（spherical harmonics）绕开 MLP：每个 Gaussian 存 degree 3 以内的 SH 系数（每通道 16 个，共 48 个数）。在 Gaussian 的视角方向上 evaluate 一下，得到 RGB。关键是：这一步在 preprocess kernel 里每帧每 Gaussian 算一次，不是每像素一次 —— 所以内层 compositor 看到的就是已经算好的 RGB，view-dependence 没花额外算力。

§2 Stacking blobs: how a pixel gets its color 叠合 blob：一个像素的颜色是怎么算出来的

Same volume-rendering equation you've used for NeRFs, but evaluated over a sorted list of Gaussians instead of MLP samples along a ray. A pixel typically sits "under" hundreds of Gaussians; the renderer sorts them front-to-back by depth and composites:

跟 NeRF 用的是同一个 volume-rendering 方程，只是 evaluation 的对象从「沿射线的 MLP 采样」换成了「一个排好序的 Gaussian 列表」。一个像素下面通常压着几百个 Gaussian， renderer 按 depth front-to-back 排好，再合成：

$$ C \;=\; \sum_{i=1}^{N} \mathbf{c}_i\,\alpha_i\,T_i, \qquad T_i = \prod_{j=1}^{i-1}(1-\alpha_j) $$

$T_i$ is the transmittance — how much light from Gaussian $i$ survives the Gaussians in front of it. Notice $T_1 = 1$ (nothing in front), and $T_i$ shrinks fast once you hit opaque Gaussians. As soon as $T$ falls below a threshold ($10^{-4}$ in the paper), no later Gaussian can contribute meaningfully — the pixel terminates early. This single observation is the backbone of the entire speed story.

$T_i$ 是 transmittance —— 第 $i$ 个 Gaussian 的光能透过前面 $i-1$ 个的比例。 $T_1 = 1$（前面没东西挡），一旦遇到 opaque 的 Gaussian，$T_i$ 就掉得飞快。 $T$ 一旦小于阈值（论文里取 $10^{-4}$），后续任何 Gaussian 的贡献都没意义 —— 这个像素就提前结束。整个 3DGS 的速度故事，骨架就是这一个观察。

Demo 2 · March a pixel Demo 2 · 走一遍 pixel

step 0 / 20

Drag the slider to walk front-to-back through a synthetic stack of Gaussians. Watch $T$ decay and $C$ accumulate. Hit an opaque cluster — $T$ collapses, the pixel could early-out, and every Gaussian behind that point contributes essentially nothing. That's "early termination," the single biggest reason render kernels finish quickly on real scenes.

拖滑块从近到远走过一串假造的 Gaussian。看着 $T$ 衰减、$C$ 累加。一旦撞上一片 opaque Gaussian，$T$ 立刻塌下来；此后这个像素可以提前 early-out，后面所有 Gaussian 的贡献都几乎是 0。这就是「early termination」—— 实际场景里 render kernel 跑得这么快，最大的功臣就是它。

§3 One frame, six CUDA kernels 一帧六个 CUDA kernel，全貌

Now we have the pieces. A trained scene has roughly 1–6 million Gaussians. To render one frame, the original diff-gaussian-rasterization CUDA pipeline does six things:

零件齐了。一个训练好的场景大约有 1–6M 个 Gaussian。要渲染一帧，原版 diff-gaussian-rasterization 的 CUDA pipeline 做这六件事：

Preprocess. For every Gaussian, project $\mu$ to screen, project $\Sigma$ to 2D, decide which $16\times 16$ tiles it overlaps.
Duplicate-with-keys. Every Gaussian writes one (tile_id, depth) key per tile it touches into a flat array.
Sort. One global radix sort on those 64-bit keys.
Find tile ranges. Locate each tile's run in the sorted array.
Render. One CUDA block per tile, threads cooperate to walk front-to-back.
Backward. The same in reverse, with atomic gradient adds.

Preprocess（预处理）。 每个 Gaussian：把 $\mu$ 投到屏幕、把 $\Sigma$ 投到 2D、确定它会盖哪些 $16\times 16$ 的 tile。
Duplicate-with-keys。 每个 Gaussian 把它覆盖到的每个 tile 都展开成一条 (tile_id, depth) 的 key，写进一个 flat 数组。
Sort。 对这些 64-bit key 跑一次全局 radix sort。
Find tile ranges。 在排好序的数组里定位每个 tile 占的那段 [start, end)。
Render。 每个 tile 一个 CUDA block，里面 256 个 thread 协同 walk front-to-back。
Backward。 反着来一遍，梯度用 atomic add 累加。

The single most important architectural decision is the tile. Instead of asking "for each pixel, which Gaussians touch me?" (a many-to-many nightmare), the renderer asks "for each $16\times 16$ tile, which Gaussians touch me?" and a CUDA block of 256 threads handles that tile in lockstep. $16\times 16 = 256$ threads = exactly 8 warps = one perfectly-sized CUDA block.

整个架构里最关键的一个选择是 tile。不是问「每个像素，哪些 Gaussian 碰到我？」（一个多对多的灾难），而是问「每个 $16\times 16$ tile，哪些 Gaussian 碰到我？」一个 256 thread 的 CUDA block 同步搞定一个 tile。 $16\times 16 = 256$ thread = 正好 8 个 warp = 一个尺寸完美的 CUDA block。

Demo 3 · Tile binning Demo 3 · Tile binning

tiles touched: 0 · duplicate-list entries: 0

Drag the Gaussian (center, principal axes, rotation handle). Tiles its 2D ellipse actually overlaps light up in burnt orange. Tiles inside the loose AABB but missed by the true ellipse light up yellow — those are wasted entries the original pipeline queues anyway. The Speedy-Splat paper (§7.4) attacks exactly that waste.

拖动 Gaussian（中心、两根主轴、旋转 handle）。 2D 椭圆真正盖到的 tile 变成 burnt orange；被 AABB 圈进去但椭圆其实没碰到的 tile 变成 mustard yellow —— 这些都是白干的活，原版 pipeline 照样把它们加进队列。Speedy-Splat（§7.4）干的就是这一块浪费。

The radix-sort trick Radix-sort 的妙招

Here is the cleverest line in the paper. The duplicate-list keys are 64 bits — high 32 bits are the tile id, low 32 bits are the depth (as a float bit-pattern). One cub::DeviceRadixSort call sorts the whole thing. The result? Every tile's Gaussians are now contiguous in memory AND sorted by depth. No per-tile sort. No two-stage anything. A global sort of ~10–30 M keys runs in a few hundred microseconds on a modern GPU.

这是原论文里最聪明的一手。duplicate list 的 key 是 64 bit —— 高 32 位是 tile id，低 32 位是 depth（浮点的 bit pattern）。一次 cub::DeviceRadixSort 把整个列表排完。结果：每个 tile 的 Gaussian 既在内存里连续，又按 depth 排好了。 不用 per-tile sort、不用两段式什么的。现代 GPU 上几千万 key 的全局排序也就几百微秒。

// Pack tile id into the high bits, depth into the low bits.
uint64_t key = ((uint64_t)tile_id << 32) | __float_as_uint(depth);
keys_unsorted[idx] = key;
values_unsorted[idx] = gaussian_id;

// One global sort. CUB picks bit ranges for us.
cub::DeviceRadixSort::SortPairs(
    workspace, workspace_bytes,
    keys_unsorted, keys_sorted,
    values_unsorted, values_sorted,
    n_entries);

Rendering a tile Render 单个 tile

The render kernel launches one block per tile. Inside the block, 256 threads share the work of loading Gaussians from global memory into shared memory in batches of 256; every thread (one per pixel) walks through that batch front-to-back, accumulating its own pixel's color. When every thread in the block hits $T \lt 10^{-4}$, the block exits.

render kernel 每个 tile 起一个 block。块内 256 个 thread 分摊从 global memory 把 Gaussian 加载到 shared memory 的活，每次 256 个一批；然后每个 thread（一个 pixel 一个）front-to-back 遍历这批 Gaussian，累加自己 pixel 的颜色。当 block 里每个 thread 都 $T \lt 10^{-4}$ 时，整个 block 退出。

// One block per tile, 256 threads (16x16 pixels)
__shared__ Gaussian batch[BATCH];          // cooperative load
float T = 1.0f, C[3] = {0,0,0};
for (int b = start; b < end; b += BATCH) {
    // cooperative load: each thread fetches one Gaussian
    batch[threadIdx] = gaussians[ sorted_ids[b + threadIdx] ];
    __syncthreads();

    for (int k = 0; k < BATCH; ++k) {
        float g = eval_2d_gaussian(batch[k], pixel);     // exp(-1/2 x^T Σ⁻¹ x)
        float a = batch[k].opacity * g;
        C[0] += batch[k].color[0] * a * T;
        C[1] += batch[k].color[1] * a * T;
        C[2] += batch[k].color[2] * a * T;
        T *= (1.0f - a);
        if (T < 1e-4f) { done = true; break; }
    }
    if (__syncthreads_count(done) == blockDim.x) break;   // whole tile finished
    __syncthreads();
}

Three details to notice. (1) Each Gaussian is fetched from slow global memory once per tile, not once per pixel — shared memory turns 256 redundant loads into one. (2) The pixel loop has zero branches except the early-out, so the warp stays coherent. (3) When most of a tile finishes early, the whole block exits with a single __syncthreads_count; CUDA's block-wide ballot is essentially free.

三个细节值得拎出来。(1) 每个 Gaussian 从慢的 global memory 取出来每个 tile 只取一次，不是每个 pixel 一次 —— shared memory 把 256 次冗余 load 变成 1 次。 (2) pixel 循环除了 early-out 没有分支，warp 内部完全 coherent。 (3) 一个 tile 大多数 pixel 提前结束时，整个 block 通过一次 __syncthreads_count 集体退出 —— CUDA 的 block-wide ballot 基本上 0 开销。

§4 The backward pass: chain rule, by hand 反向传播：chain rule 全部手写

What makes 3DGS a learning system is that the rasterizer is differentiable. Every step above — the SH evaluation, the EWA projection, the alpha composition — has a hand-written backward kernel. PyTorch never sees the gradients; CUDA computes them directly.

3DGS 之所以是个可学习系统，关键在于 rasterizer 可微。上面每一步 —— SH evaluation、EWA projection、alpha composition —— 都有一份手写的 backward kernel。 PyTorch 完全看不到梯度从哪儿来，CUDA 自己一路算到底。

The trick: instead of recording every intermediate value (which would cost gigabytes), the backward pass replays the forward composition in reverse, using two stored pieces of state per pixel — final $T$ and the index of the last Gaussian that contributed. From those two numbers it can reconstruct every $T_i$ on the fly via $T_i = T_{i+1} / (1 - \alpha_i)$.

招数：与其把每个中间值都存下来（要占几个 GB），backward 选择把 forward 的合成倒着重放一遍，每个 pixel 只存两份状态 —— 最后的 $T$ 和最后一个起作用的 Gaussian 的 index。靠这两个数，可以即时反推每一个 $T_i$：$T_i = T_{i+1} / (1 - \alpha_i)$。

The unpleasant part: many Gaussians contribute to many pixels, so the gradient updates $\partial L / \partial \mu_i$ for one Gaussian arrive from many threads at once. The original implementation uses atomicAdd on global memory — correct, but a contention nightmare. DISTWAR and gsplat's fused backward later attack exactly this (see §7.3).

讨厌的地方在于：一个 Gaussian 会被很多 pixel 用到，于是它的梯度 $\partial L / \partial \mu_i$ 会从很多 thread 同时砸过来。原版直接 atomicAdd 到 global memory，正确，但争用得一塌糊涂。DISTWAR 和 gsplat 的 fused backward 后来攻的就是这一块（见 §7.3）。

Aside 题外话

Why $\exp$, and not a more general kernel? Because $\exp(-\tfrac{1}{2} \mathbf{x}^{\!\top}\Sigma^{-1}\mathbf{x})$ has a closed-form gradient with respect to $\Sigma$, $\mu$, and the pixel coordinate. GES (Hamdi et al., CVPR 2024) later showed that a generalized exponential — same gradient story, sharper falloff — can match quality with ~half as many primitives. The math is friend, not master.

为什么用 $\exp$，不用更通用的 kernel？因为 $\exp(-\tfrac{1}{2} \mathbf{x}^{\!\top}\Sigma^{-1}\mathbf{x})$ 对 $\Sigma$、$\mu$、pixel coordinate 的梯度都有闭式解。GES（Hamdi 等，CVPR 2024）后来证明换成 generalized exponential —— 梯度还是好算，falloff 更陡 —— 能用一半数量的 primitive 达到同样的质量。数学是工具不是主子。

§5 What "fast" meant in 2023 2023 年的「快」是什么概念

135 FPS

1080p, Mip-NeRF 360 garden scene, RTX A6000 1080p · Mip-NeRF 360 garden · RTX A6000

~30 min

Train time per scene to SOTA quality 训到 SOTA 的单场景耗时

~1–6 M

Final Gaussians per scene 最终每个场景的 Gaussian 数

~1 GB

Disk size for a 5M-Gaussian scene 5M Gaussian 的磁盘体积

NeRF at the time: ~1 FPS, days of training, a few hundred MB of network weights. 3DGS was three orders of magnitude faster to render and $\sim 50\times$ faster to train. And the file you shipped was not a black-box neural network — it was a transparent point cloud you could open in a debugger.

同时代的 NeRF：~1 FPS、几天训练、几百 MB 的网络权重。3DGS 的渲染快了 三个数量级，训练快了 $\sim 50\times$。而且交付出去的文件不是一个黑盒神经网络，而是一个可以直接用 debugger 打开看的点云。

§6 Three years of speedups, on one rail 三年间的加速史，一条 rail 串起来

The 2023 paper opened a floodgate. Below is a curated timeline of the works that pushed rendering or training efficiency the most. Click any entry for the one-line CUDA idea.

2023 那篇论文开了个闸。下面是对渲染或训练效率推动最大的工作的精选时间线。点任何一条看「一句话讲清的 CUDA idea」。

§7 The CUDA tricks, one by one 六个 CUDA 招数，逐个拆开看

Picking from the timeline, six ideas have done the most to make 3DGS render faster without changing what 3DGS fundamentally is. Each is its own little CUDA lesson.

从时间线里挑出来，这六个 idea 在不动 3DGS 本质的前提下，对渲染加速贡献最大。每一个都是一节小小的 CUDA 课。

7.1 · gsplat — the open rewrite开源重写版

gsplat — a clean PyTorch+CUDA rewrite for 3DGS — 一份干净的 PyTorch + CUDA 重写

Nerfstudio · 2024-onwards Ye, Turkulainen, Kerr, et al. · arXiv:2409.06765 · code

The original Inria rasterizer was research-grade C++ with PyTorch hooks. Hard to extend, hard to fuse new ideas into.

原版 Inria rasterizer 是 research-grade 的 C++ 加一层 PyTorch 钩子。扩展难、把新 idea 融进来更难。

Key idea. gsplat is the production rewrite: clean Python API, fused forward+backward, a tighter projection bounding box, packed/sparse modes, and a plug-in registry. Today every new 3DGS paper extends gsplat. Two CUDA-level wins worth knowing:

核心想法。 gsplat 是 production 级的重写：干净的 Python API、forward 和 backward 合并到一个 fused pass、更紧的投影 bounding box、packed / sparse 模式、插件式 registry。现在几乎所有新的 3DGS 论文都基于 gsplat 扩展。两条 CUDA 层面的收益值得记住：

Tighter screen-space bound. The original used an axis-aligned bounding box around a $3\sigma$ circle around the projected ellipse. gsplat uses the actual ellipse's tight AABB, often cutting touched-tile count by 30–50%.
Fused backward. Computing $\partial L / \partial \Sigma$ and $\partial L / \partial \mu$ in the same pass reduces global-memory traffic; gsplat's backward is $\sim 1.5\text{–}2\times$ faster than the original for typical scenes.

更紧的 screen-space bound。 原版用「以 $3\sigma$ 圆为基础的 AABB」当包围盒； gsplat 用真实椭圆的 tight AABB，touched-tile 通常少 30–50%。
Fused backward。 把 $\partial L / \partial \Sigma$ 和 $\partial L / \partial \mu$ 放在同一个 pass 里算，减少 global memory 流量；gsplat 的 backward 在普通场景下比原版快 $\sim 1.5\text{–}2\times$。

Original Inria code: monolithic, hard to mod. gsplat: clean Python+CUDA API, fused passes, de-facto base of every paper after mid-2024.

原版 Inria：一坨硬骨头不好改。gsplat：Python+CUDA API 清爽、forward/backward fused， 2024 年中以后几乎所有新论文的事实基线。

Demo 4 · Loose vs tight tile bound Demo 4 · 松 vs 紧 tile 包围

Rotation Aspect

Same ellipse, two ways to bound it. The yellow tiles on the left are wasted work in vanilla 3DGS; gsplat (and later Speedy-Splat) skip them. The cost of skipping is ~one ellipse-test per candidate tile.

同一个椭圆，两种 bound 方式。左边的 mustard tile 在原版 3DGS 里都是白干的活； gsplat（以及后来的 Speedy-Splat）把它们跳过去。代价：每个候选 tile 多做一次椭圆相交测试。

7.2 · StopThePop — sort, per pixel按 pixel 重新排序

StopThePop — Hierarchical Per-Pixel Sort & Cull for Real-Time 3DGS — 实时 3DGS 的分层 per-pixel sort & cull

SIGGRAPH 2024 Radl, Steiner, Parger, Weinrauch, Steinberger, Kerbl · arXiv:2402.00525

Sorting by Gaussian center is wrong. The right order for a pixel is by where the Gaussian's center projects along that pixel's ray, which differs across pixels in a tile. Move the camera and the chosen order can flip — producing the visible "popping" artifact 3DGS is famous for.

按 Gaussian 中心排序本质上是错的。对某个 pixel 来说，正确顺序应该按「Gaussian 中心投影到这条 pixel ray 上的位置」算，这个顺序在一个 tile 里像素之间是不一样的。相机一动，选定的顺序就可能翻转 —— 这就是 3DGS 著名的「popping」artifact。

Key idea. Sort per pixel. Naively that's catastrophic ($256\times$ more sorts per tile). The trick is hierarchical: a coarse per-tile sort first, then a tiny insertion-sorted window of size 4 per pixel. The pop is gone, and the cost is only ~10% over baseline. A great example of "do the expensive thing, but only on the small set that needs it."

核心想法。 按 pixel 排。直接全排是灾难（一个 tile 多排 256 次）。招数在「分层」：先做一次粗的 per-tile sort，每个 pixel 再维护一个长度 4 的插入排序窗口。 popping 没了，开销只比 baseline 多 ~10%。这是「贵的事只在该贵的地方做」的典范。

// Per-pixel insertion buffer of size K=4
Gaussian queue[4];
int  qlen = 0;
for (each gaussian g in tile order) {
    float d_pix = depth_along_ray(g, pixel);
    int pos = qlen;
    while (pos > 0 && queue[pos-1].d > d_pix) {
        queue[pos] = queue[pos-1]; --pos;
    }
    queue[pos] = {g, d_pix};
    if (++qlen > 4) { composite(queue[0]); shift_left(queue); --qlen; }
}

Vanilla: one global sort, popping. StopThePop: same global sort + a 4-slot per-pixel buffer, no popping, ~10% slower. Or $1.6\times$ faster if you co-train Gaussians for consistency.

原版：一次全局 sort，带 popping。StopThePop：同样的全局 sort + 每像素一个 4 槽 buffer， popping 消失，慢 ~10%；如果同时训练 Gaussian 的一致性，反而比原版快 $1.6\times$。

7.3 · DISTWAR — fixing the backward atomic stormtreat 反向 atomic 雪崩

DISTWAR — Distributed Warp Atomic Reduction for Differentiable Rasterization — 可微 rasterization 的 distributed warp atomic reduction

HPCA 2024 Durvasula, Zhao, Chen, et al. · arXiv:2401.05345

During backward, every pixel that touched Gaussian $g$ wants to atomicAdd to grad[g]. In a $16\times 16$ tile, that's potentially 256 atomic adds to the same address. Atomics serialize. The backward pass spends 30–60% of its time waiting on these.

backward 的时候，每个碰到 Gaussian $g$ 的 pixel 都想 atomicAdd 到 grad[g]。一个 $16\times 16$ tile 里可能有 256 个 atomic 砸到同一个地址。 atomic 会被串行化。backward 30–60% 的时间都在等这些 atomic 排队。

Key idea. Aggregate within a warp first. Use warp-level shuffles (__shfl_xor_sync) to sum the 32 contributions across threads of a warp, then have one thread per warp do a single atomicAdd. $32\times$ fewer atomics. The paper reports $2.44\times$ average, up to $5.7\times$, backward speedup on contention-heavy scenes. gsplat now does this by default.

核心想法。 先在 warp 里把 32 个 thread 的贡献用 register-only 的 __shfl_xor_sync 做 butterfly 加和，最后只让 warp 里 lane 0 那个 thread 发一次 atomicAdd。 atomic 数量直接降到 1/32。论文报 backward 平均 $2.44\times$、最高 $5.7\times$ 的加速。gsplat 现在默认就这么干。

// Warp-level reduction before atomicAdd
float v = local_grad_mu_x;
#pragma unroll
for (int off = 16; off > 0; off >>= 1)
    v += __shfl_xor_sync(0xffffffff, v, off);
if ((threadIdx.x & 31) == 0)            // lane 0 of each warp
    atomicAdd(&grad_mu_x[g_id], v);

Demo 5 · Atomic contention, before and after Demo 5 · Atomic 争用，前后对比

mode: naive (32 atomics)

32 threads, 1 target memory address. Naive mode: 32 serialized atomics, each waiting for the previous. Warp-reduce mode: 1 atomic after a register-only butterfly reduction. Same answer, $32\times$ less contention. This is the most under-appreciated CUDA trick in the entire 3DGS literature.

32 个 thread，1 个目标地址。Naive 模式：32 个 atomic 串行； Warp-reduce 模式：用 register-only butterfly 先把 32 个值加和，最后只发 1 个 atomic。结果一样，争用降 $32\times$。整个 3DGS 文献里被低估得最严重的 CUDA 招数。

7.4 · Speedy-Splat — count tiles correctly把 tile 数对

Speedy-Splat — Fast 3DGS with Accurate Tile Selection & Pruning — 精确 tile 选择 + 剪枝的快版 3DGS

CVPR 2025 Hanson, Tu, Lin, Singla, Zwicker, Goldstein · arXiv:2412.00578 · code

Every system from §7.1 still over-counts tiles. The original asks "what's the AABB of the ellipse?" and adds every tile inside that box. Many of those tiles only contain a corner of the box, not the ellipse itself.

§7.1 里的所有系统其实都还在数多 tile。原版问的是「椭圆的 AABB 是啥」，然后把这个矩形里每个 tile 都加进队列。这些 tile 里很多只是擦到了 AABB 的角，根本没碰到椭圆本身。

Key idea. SnugBox tests each candidate tile against the actual conic, dropping ~50% of false positives. AccuTile walks only the tiles SnugBox accepts. Combined with smarter pruning (drop low-contribution Gaussians during training), Speedy-Splat reports $6.7\times$ faster rendering while matching baseline PSNR. A great case study: the bottleneck was bookkeeping, not arithmetic.

核心想法。 SnugBox 对每个候选 tile 真正做一次椭圆相交测试，砍掉 ~50% 的假阳性。 AccuTile 只 walk SnugBox 接受的那些 tile。再加更聪明的剪枝（训练过程中丢掉低贡献 Gaussian）， Speedy-Splat 在 PSNR 持平的前提下报渲染快 $6.7\times$。一个漂亮的 case study：瓶颈不是算术，是 bookkeeping。

7.5 · 3DGS-MCMC — kill the densification hacks把 densification 启发式干掉

3DGS-MCMC — 3D Gaussian Splatting as Markov Chain Monte Carlo — 把 3DGS 训练当成 MCMC 看

NeurIPS 2024 Kheradmand, Rebain, Sharma, Sun, Tseng, Isack, Kar, Tagliasacchi, Yi · arXiv:2404.09591

Not a render-time speedup — a train-time one. The original paper's densification ("clone Gaussians with large gradients, split Gaussians with large variance, reset opacity every N steps") is a pile of well-tuned heuristics that are brittle across scenes. MCMC reframes the entire optimization as sampling from a posterior over Gaussians: a "death" of a low-opacity Gaussian is teleported to a new location proportional to current opacity. No more clone/split rules. Training is more stable and the final scene has fewer wasted Gaussians, which makes rendering faster too.

这不是渲染端的加速，是训练端的。原版的 densification（梯度大就 clone、variance 大就 split、每 N 步重置 opacity）是一堆精心调出来的启发式，换场景就脆。MCMC 把整个优化重新解释成「从 Gaussian 的后验里采样」：一个低 opacity Gaussian 的「死亡」相当于按当前 opacity 概率「传送」到新位置。再没有 clone/split 规则。训练更稳，最终场景里没用的 Gaussian 也少了 —— 顺带渲染也快了。

Original: heuristic clone/split + opacity reset. MCMC: relocation moves that respect detailed balance — provably preserves the sample distribution, hits a hard Gaussian count budget.

原版：clone/split 启发式 + 周期性 opacity reset。MCMC：满足 detailed balance 的 relocation move —— 可证明保持采样分布、可以硬性约束 Gaussian 总数。

7.6 · LightGaussian & RadSplat — fewer Gaussians, free FPSGaussian 越少，FPS 越白嫖

LightGaussian — $15\times$ Compression, 200+ FPS — $15\times$ 压缩、200+ FPS

NeurIPS 2024 Spotlight Fan et al. · arXiv:2311.17245

Most Gaussians in a trained scene contribute almost nothing to the final image. LightGaussian ranks each Gaussian by a "global significance" score (sum of $\alpha\cdot T$ over training views) and prunes the bottom 66%. Then it vector-quantizes SH coefficients and distills to lower SH degrees. Result: $15\times$ smaller files, $\sim 2\times$ faster rendering, near-identical PSNR.

训完的场景里大多数 Gaussian 对最终图像几乎无贡献。LightGaussian 给每个 Gaussian 打一个「global significance」分数（训练视角下 $\alpha\cdot T$ 的总和），砍掉后 66%；再对 SH 系数做向量量化、蒸馏到低阶 SH。结果：文件小 $15\times$， 渲染快 $\sim 2\times$，PSNR 几乎不变。

RadSplat — NeRF-Supervised 3DGS Pruning to 900+ FPS — NeRF 监督下的 3DGS 剪枝到 900+ FPS

3DV 2024 Niemeyer et al. · arXiv:2403.13806

RadSplat goes further by using a NeRF as a teacher to decide which Gaussians matter, hitting 900+ FPS at 1080p on the original Mip-NeRF 360 scenes. Same scene, same quality, $\sim 10\times$ the framerate of vanilla 3DGS.

RadSplat 再进一步：用一个慢但准的 NeRF 当 teacher，挑哪些 Gaussian 重要。在原始 Mip-NeRF 360 上做到 1080p 900+ FPS —— 同场景、同质量，比原版 3DGS 快 $\sim 10\times$。

LightGaussian uses heuristic significance scores; RadSplat borrows judgement from a slower but accurate NeRF. The trade is more training time for steeper inference gains.

LightGaussian 用启发式 significance 打分；RadSplat 借慢但准的 NeRF 来判断。用「更多训练时间」换「更陡的推理收益」。

§8 2026 and onwards 2026 之后

The pattern is consistent: every year someone notices a constant factor and removes it. Tiles got tighter, sorts got finer, atomics got aggregated, bookkeeping got smarter. The remaining big targets:

套路很稳：每年都有人盯出一个常数因子并干掉它。tile 越收越紧、sort 越做越细、atomic 越收越拢、 bookkeeping 越来越聪明。剩下还能啃的几个大方向：

Memory bandwidth. Even with all the above, the render kernel is memory-bound. SoA layouts (struct-of-arrays) and FP16 attributes are already in flight (EAGLES, Compact-3DGS, Reduced-3DGS).
Hardware rasterization. Several 2025 papers (Petrov et al.) show that you can emit Gaussians as tessellated triangles to the standard hardware rasterizer and let the GPU's fixed-function pipeline do the heavy lifting. Faster on consumer cards, friendlier to VR and AR.
Dynamic and editable scenes. Spacetime Gaussians, Deformable-GS, and 4D-GS extend the primitive to time. The CUDA story is mostly "the same pipeline, indexed by $(\text{gaussian}, t)$" with new tricks for temporal coherence.
Mobile and web. Several open-source WebGL / WebGPU implementations now hit 60 FPS on a phone for small scenes (1M Gaussians). The bottleneck shifts from arithmetic to texture-cache hit rate.

Memory bandwidth。 即便把上面所有招数都用了，render kernel 还是 memory-bound。 SoA 布局（struct-of-arrays）和 FP16 属性已经在路上了（EAGLES, Compact-3DGS, Reduced-3DGS）。
Hardware rasterization。 2025 年几篇论文（如 Petrov 等）表明可以把 Gaussian tessellate 成 triangle 喂给 GPU 自带的硬件 rasterizer，让 fixed-function pipeline 接手最重的活。消费级显卡更快，对 VR/AR 也更友好。
Dynamic / editable scene。 Spacetime Gaussians, Deformable-GS, 4D-GS 都把 primitive 拉到时间维。 CUDA 故事大体上是「同一套 pipeline，用 $(\text{gaussian}, t)$ 索引」加几个时间一致性的新招数。
移动端和 web。 一些开源 WebGL / WebGPU 实现已经能在手机上跑小场景（1M Gaussian）到 60 FPS。这里的瓶颈从算术换成了 texture-cache 命中率。

What's striking is how durable the original 2023 architecture has proven. Three years and dozens of papers later, every fast renderer still has a tile loop, a sorted Gaussian list, a front-to-back walk, and an early-out. The constants change. The shape doesn't.

最让人惊讶的是 2023 年那套架构的耐用程度。三年、几十篇论文之后，每一个跑得快的 renderer 都还是：tile loop、排好序的 Gaussian list、front-to-back walk、early-out。常数变了，形状没变。

Reading list 扩展阅读

Kerbl et al. — 3D Gaussian Splatting (SIGGRAPH 2023) — the foundation paper.
Zwicker et al. — EWA Splatting (IEEE TVCG 2002). Where the 2D covariance projection comes from.
diff-gaussian-rasterization — the original CUDA code, ~1500 lines worth reading in full.
gsplat — the well-engineered modern base.
awesome-3D-gaussian-splatting — community-maintained paper list.

Kerbl 等 — 3D Gaussian Splatting (SIGGRAPH 2023) — 奠基论文。
Zwicker 等 — EWA Splatting（IEEE TVCG 2002）。2D covariance projection 的源头。
diff-gaussian-rasterization —— 原版 CUDA，~1500 行，值得通读。
gsplat —— 工程化的现代 baseline。
awesome-3D-gaussian-splatting —— 社区维护的论文列表。