A visual essay · CUDA + computer graphics · 2026-05图解长文 · CUDA + 图形学 · 2026-05
How 3D Gaussian Splatting got fast.3D Gaussian Splatting 是怎么变快的
From the 2023 paper that shocked SIGGRAPH to the 2025–2026 systems that train scenes in
seconds. Every major rendering speedup, drawn out in pictures and the bare minimum of code.
Assumed background: basic NeRF / SDF, machine learning, linear algebra, and the chain rule.
You do not need to know CUDA — by the end you'll have intuition for tiles, warps, shared
memory, and atomics, in that order.
24 papers on the timeline · 6 interactive demos · ~1500 lines of canvas-only JS, no WebGL · last updated 2026-05-19时间线上 24 篇论文 · 6 个可交互演示 · 约 1500 行纯 canvas JS(没用 WebGL)· 最后更新 2026-05-19
Reading path阅读路径建议
Never read the original 3DGS paper: read §1–§5 in order, do the two demos, then pick the
deep-dives you care about. Already comfortable with 3DGS: skip to §6
for the timeline; jump to §7.3 DISTWAR for the most surprising
single CUDA trick.
You already know the NeRF setup: photos in, a volumetric radiance field out, novel views by
integrating $\text{color} \times \text{density}$ along rays. You also know the catch — every pixel of every frame
calls a small MLP a few hundred times. Even Instant-NGP, with its hash grid, struggles to keep
a 1080p frame under 30 ms on consumer cards.
In July 2023, Kerbl, Kopanas, Leimkühler & Drettakis (Inria + MPI) shipped a paper that
simply walked around the problem. They replaced the MLP-evaluated continuous field with an
explicit mixture: a few million anisotropic 3D Gaussians, each carrying a position,
covariance, opacity, and view-dependent color. Volume rendering became splatting —
rasterize each Gaussian's 2D footprint, alpha-composite front-to-back. No MLP in the inner
loop. A custom CUDA rasterizer pushed it to ~135 FPS at 1080p on an RTX A6000,
matching Mip-NeRF 360 quality with ~30 minutes of training instead of days.
The interesting question isn't "what is a radiance field" — you know — but
"what does the CUDA actually do, kernel by kernel, that made this $1000\times$ faster than NeRF
rendering?" And then: every year since, someone has found another constant factor and
removed it. SnugBox tile bounds, warp-aggregated atomics, per-pixel sort, MCMC densification,
hardware rasterization. This essay walks all of it.
Two parameters describe its shape and place: a center
$\boldsymbol{\mu} \in \mathbb{R}^3$ and a $3\times 3$ covariance $\boldsymbol{\Sigma}$
that says how stretched and oriented the blob is. Plus an opacity $o \in (0,1)$ and a color
$\mathbf{c}$ that depends on viewing direction (so reflections work) — more on that in a moment.
Covariance matrices have to be symmetric positive semi-definite. If you optimize the 6
independent entries of $\boldsymbol{\Sigma}$ directly, gradient descent will happily push them
into an invalid region (negative eigenvalues = imaginary ellipsoid). Kerbl et al. dodge
this by storing $\Sigma$ as
where $S = \mathrm{diag}(s_x, s_y, s_z)$ is a diagonal scale and $R$ is a rotation matrix built
from a unit quaternion $\mathbf{q}$. Any 3D ellipsoid can be written this way, and the
parameterization is automatically valid no matter what gradient lands on $(\mathbf{q}, \mathbf{s})$.
This trick — pick coordinates so the constraints are free — appears again and again in the field.
Drag the orange dot to move $\mu$; the cyan / magenta dots to scale along the two principal axes;
the green dot to rotate. The readout updates the $2\times 2$ covariance live. This is the 2D analog of
the 3D parameterization above — the math is identical.
To draw the ellipsoid into a 2D image, project the center $\boldsymbol{\mu}$ through the camera
(standard pinhole) and the covariance too — that's the EWA splatting trick from
Zwicker et al. (2001). If $W$ is the world-to-camera matrix and $J$ is the Jacobian of the
perspective projection at $\boldsymbol{\mu}$, then the 2D screen-space covariance is
Drop the bottom row and right column and you get a $2\times 2$ covariance — the ellipse the 3D
ellipsoid casts onto the image. Every pixel within that ellipse gets a contribution
$\alpha\cdot G_{\text{2D}}(\mathbf{p})$ from this Gaussian. That's it. That's "splatting."
Slap a rubber ball against a wall — it deforms into an ellipse on impact. EWA does exactly
that: the image plane is the wall, $J W \Sigma W^\top J^\top$ describes how flat-and-wide that
ellipse comes out.
View-dependent color via spherical harmonics用 SH 系数表达 view-dependent color
Same problem NeRFs solved with a view-direction input to the MLP — gloss and specularity demand
$c$ depend on $\mathbf{d}$. 3DGS handles it without an MLP: store color as spherical
harmonic coefficients up to degree 3 (16 per channel, 48 numbers). Evaluate at the
per-Gaussian viewing direction, get an RGB. Critically this is evaluated once per Gaussian
per frame in the preprocess kernel — not per pixel — so the inner compositor loop sees
only a precomputed color and pays nothing extra for view-dependence.
§2Stacking blobs: how a pixel gets its color叠合 blob:一个像素的颜色是怎么算出来的
Same volume-rendering equation you've used for NeRFs, but evaluated over a sorted list of
Gaussians instead of MLP samples along a ray. A pixel typically sits "under" hundreds of
Gaussians; the renderer sorts them front-to-back by depth and composites:
$$ C \;=\; \sum_{i=1}^{N} \mathbf{c}_i\,\alpha_i\,T_i, \qquad T_i = \prod_{j=1}^{i-1}(1-\alpha_j) $$
$T_i$ is the transmittance — how much light from Gaussian $i$ survives the Gaussians
in front of it. Notice $T_1 = 1$ (nothing in front), and $T_i$ shrinks fast once you hit opaque
Gaussians. As soon as $T$ falls below a threshold ($10^{-4}$ in the paper), no later Gaussian
can contribute meaningfully — the pixel terminates early. This single
observation is the backbone of the entire speed story.
Drag the slider to walk front-to-back through a synthetic stack of Gaussians. Watch $T$ decay
and $C$ accumulate. Hit an opaque cluster — $T$ collapses, the pixel could early-out,
and every Gaussian behind that point contributes essentially nothing. That's "early
termination," the single biggest reason render kernels finish quickly on real scenes.
Now we have the pieces. A trained scene has roughly 1–6 million Gaussians. To render one frame,
the original diff-gaussian-rasterization CUDA pipeline does six things:
Render。 每个 tile 一个 CUDA block,里面 256 个 thread 协同 walk front-to-back。
Backward。 反着来一遍,梯度用 atomic add 累加。
The single most important architectural decision is the tile. Instead of asking
"for each pixel, which Gaussians touch me?" (a many-to-many nightmare), the renderer asks "for
each $16\times 16$ tile, which Gaussians touch me?" and a CUDA block of 256 threads handles
that tile in lockstep. $16\times 16 = 256$ threads = exactly 8 warps = one perfectly-sized CUDA block.
Drag the Gaussian (center, principal axes, rotation handle). Tiles its 2D ellipse actually
overlaps light up in burnt orange. Tiles inside the loose AABB but missed by the true ellipse
light up yellow — those are wasted entries the original pipeline queues anyway. The
Speedy-Splat paper (§7.4) attacks exactly that waste.
Here is the cleverest line in the paper. The duplicate-list keys are 64 bits — high 32 bits are
the tile id, low 32 bits are the depth (as a float bit-pattern). One
cub::DeviceRadixSort call sorts the whole thing. The result?
Every tile's Gaussians are now contiguous in memory AND sorted by depth.
No per-tile sort. No two-stage anything. A global sort of ~10–30 M keys runs in a few hundred
microseconds on a modern GPU.
// Pack tile id into the high bits, depth into the low bits.
uint64_t key = ((uint64_t)tile_id << 32) | __float_as_uint(depth);
keys_unsorted[idx] = key;
values_unsorted[idx] = gaussian_id;
// One global sort. CUB picks bit ranges for us.
cub::DeviceRadixSort::SortPairs(
workspace, workspace_bytes,
keys_unsorted, keys_sorted,
values_unsorted, values_sorted,
n_entries);
Rendering a tileRender 单个 tile
The render kernel launches one block per tile. Inside the block, 256 threads share the work of
loading Gaussians from global memory into shared memory in batches of 256;
every thread (one per pixel) walks through that batch front-to-back, accumulating its own
pixel's color. When every thread in the block hits $T \lt 10^{-4}$, the block exits.
// One block per tile, 256 threads (16x16 pixels)
__shared__ Gaussian batch[BATCH]; // cooperative load
float T = 1.0f, C[3] = {0,0,0};
for (int b = start; b < end; b += BATCH) {
// cooperative load: each thread fetches one Gaussian
batch[threadIdx] = gaussians[ sorted_ids[b + threadIdx] ];
__syncthreads();
for (int k = 0; k < BATCH; ++k) {
float g = eval_2d_gaussian(batch[k], pixel); // exp(-1/2 x^T Σ⁻¹ x)
float a = batch[k].opacity * g;
C[0] += batch[k].color[0] * a * T;
C[1] += batch[k].color[1] * a * T;
C[2] += batch[k].color[2] * a * T;
T *= (1.0f - a);
if (T < 1e-4f) { done = true; break; }
}
if (__syncthreads_count(done) == blockDim.x) break; // whole tile finished
__syncthreads();
}
Three details to notice. (1) Each Gaussian is fetched from slow global memory
once per tile, not once per pixel — shared memory turns 256 redundant loads into one.
(2) The pixel loop has zero branches except the early-out, so the warp stays
coherent. (3) When most of a tile finishes early, the whole block exits with a
single __syncthreads_count; CUDA's block-wide ballot is essentially free.
§4The backward pass: chain rule, by hand反向传播:chain rule 全部手写
What makes 3DGS a learning system is that the rasterizer is differentiable. Every step
above — the SH evaluation, the EWA projection, the alpha composition — has a hand-written
backward kernel. PyTorch never sees the gradients; CUDA computes them directly.
The trick: instead of recording every intermediate value (which would cost gigabytes), the
backward pass replays the forward composition in reverse, using two stored pieces of
state per pixel — final $T$ and the index of the last Gaussian that contributed. From those two
numbers it can reconstruct every $T_i$ on the fly via $T_i = T_{i+1} / (1 - \alpha_i)$.
The unpleasant part: many Gaussians contribute to many pixels, so the gradient updates
$\partial L / \partial \mu_i$ for one Gaussian arrive from many threads at once. The original
implementation uses atomicAdd on global memory — correct, but a contention
nightmare. DISTWAR and gsplat's fused backward later attack exactly this (see
§7.3).
Why $\exp$, and not a more general kernel? Because $\exp(-\tfrac{1}{2} \mathbf{x}^{\!\top}\Sigma^{-1}\mathbf{x})$
has a closed-form gradient with respect to $\Sigma$, $\mu$, and the pixel coordinate. GES (Hamdi et al.,
CVPR 2024) later showed that a generalized exponential — same gradient story, sharper
falloff — can match quality with ~half as many primitives. The math is friend, not master.
Train time per scene to SOTA quality训到 SOTA 的单场景耗时
~1–6 M
Final Gaussians per scene最终每个场景的 Gaussian 数
~1 GB
Disk size for a 5M-Gaussian scene5M Gaussian 的磁盘体积
NeRF at the time: ~1 FPS, days of training, a few hundred MB of network weights. 3DGS was
three orders of magnitude faster to render and $\sim 50\times$ faster to train. And the file you
shipped was not a black-box neural network — it was a transparent point cloud you could open in
a debugger.
§6Three years of speedups, on one rail三年间的加速史,一条 rail 串起来
The 2023 paper opened a floodgate. Below is a curated timeline of the works that pushed
rendering or training efficiency the most. Click any entry for the one-line CUDA idea.
2023 那篇论文开了个闸。下面是对渲染或训练效率推动最大的工作的精选时间线。
点任何一条看「一句话讲清的 CUDA idea」。
§7The CUDA tricks, one by one六个 CUDA 招数,逐个拆开看
Picking from the timeline, six ideas have done the most to make 3DGS render faster without
changing what 3DGS fundamentally is. Each is its own little CUDA lesson.
从时间线里挑出来,这六个 idea 在不动 3DGS 本质的前提下,对渲染加速贡献最大。
每一个都是一节小小的 CUDA 课。
7.1 · gsplat — the open rewrite开源重写版
gsplat— a clean PyTorch+CUDA rewrite for 3DGS— 一份干净的 PyTorch + CUDA 重写
Nerfstudio · 2024-onwards
Ye, Turkulainen, Kerr, et al.
· arXiv:2409.06765
· code
The original Inria rasterizer was research-grade C++ with PyTorch hooks. Hard to extend,
hard to fuse new ideas into.
原版 Inria rasterizer 是 research-grade 的 C++ 加一层 PyTorch 钩子。扩展难、把新 idea 融进来更难。
Key idea. gsplat is the production rewrite: clean Python API, fused
forward+backward, a tighter projection bounding box, packed/sparse modes, and a plug-in
registry. Today every new 3DGS paper extends gsplat. Two CUDA-level wins worth knowing:
Tighter screen-space bound. The original used an axis-aligned bounding box
around a $3\sigma$ circle around the projected ellipse. gsplat uses the actual ellipse's tight
AABB, often cutting touched-tile count by 30–50%.
Fused backward. Computing $\partial L / \partial \Sigma$ and
$\partial L / \partial \mu$ in the same pass reduces global-memory traffic; gsplat's
backward is $\sim 1.5\text{–}2\times$ faster than the original for typical scenes.
Fused backward。 把 $\partial L / \partial \Sigma$ 和 $\partial L / \partial \mu$
放在同一个 pass 里算,减少 global memory 流量;gsplat 的 backward 在普通场景下比原版快 $\sim 1.5\text{–}2\times$。
Original Inria code: monolithic, hard to mod. gsplat: clean Python+CUDA API, fused passes,
de-facto base of every paper after mid-2024.
原版 Inria:一坨硬骨头不好改。gsplat:Python+CUDA API 清爽、forward/backward fused,
2024 年中以后几乎所有新论文的事实基线。
Demo 4 · Loose vs tight tile boundDemo 4 · 松 vs 紧 tile 包围
Same ellipse, two ways to bound it. The yellow tiles on the left are wasted work in vanilla
3DGS; gsplat (and later Speedy-Splat) skip them. The cost of skipping is ~one ellipse-test
per candidate tile.
Sorting by Gaussian center is wrong. The right order for a pixel is by where the
Gaussian's center projects along that pixel's ray, which differs across pixels in a tile.
Move the camera and the chosen order can flip — producing the visible "popping" artifact 3DGS
is famous for.
Key idea. Sort per pixel. Naively that's catastrophic ($256\times$ more
sorts per tile). The trick is hierarchical: a coarse per-tile sort first, then a tiny
insertion-sorted window of size 4 per pixel. The pop is gone, and the cost is only ~10% over
baseline. A great example of "do the expensive thing, but only on the small set that needs it."
// Per-pixel insertion buffer of size K=4
Gaussian queue[4];
int qlen = 0;
for (each gaussian g in tile order) {
float d_pix = depth_along_ray(g, pixel);
int pos = qlen;
while (pos > 0 && queue[pos-1].d > d_pix) {
queue[pos] = queue[pos-1]; --pos;
}
queue[pos] = {g, d_pix};
if (++qlen > 4) { composite(queue[0]); shift_left(queue); --qlen; }
}
Vanilla: one global sort, popping. StopThePop: same global sort + a 4-slot per-pixel buffer,
no popping, ~10% slower. Or $1.6\times$ faster if you co-train Gaussians for consistency.
During backward, every pixel that touched Gaussian $g$ wants to atomicAdd to
grad[g]. In a $16\times 16$ tile, that's potentially 256 atomic adds to the same
address. Atomics serialize. The backward pass spends 30–60% of its time waiting on these.
Key idea. Aggregate within a warp first. Use warp-level shuffles
(__shfl_xor_sync) to sum the 32 contributions across threads of a warp, then have
one thread per warp do a single atomicAdd. $32\times$ fewer atomics. The paper reports $2.44\times$ average,
up to $5.7\times$, backward speedup on contention-heavy scenes. gsplat now does this by default.
// Warp-level reduction before atomicAdd
float v = local_grad_mu_x;
#pragma unroll
for (int off = 16; off > 0; off >>= 1)
v += __shfl_xor_sync(0xffffffff, v, off);
if ((threadIdx.x & 31) == 0) // lane 0 of each warp
atomicAdd(&grad_mu_x[g_id], v);
Demo 5 · Atomic contention, before and afterDemo 5 · Atomic 争用,前后对比
mode: naive (32 atomics)
32 threads, 1 target memory address. Naive mode: 32 serialized atomics, each waiting for the
previous. Warp-reduce mode: 1 atomic after a register-only butterfly reduction. Same answer,
$32\times$ less contention. This is the most under-appreciated CUDA trick in the entire 3DGS
literature.
Every system from §7.1 still over-counts tiles. The original asks "what's the AABB of the
ellipse?" and adds every tile inside that box. Many of those tiles only contain a corner of
the box, not the ellipse itself.
Key idea.SnugBox tests each candidate tile against the actual
conic, dropping ~50% of false positives. AccuTile walks only the tiles SnugBox
accepts. Combined with smarter pruning (drop low-contribution Gaussians during training),
Speedy-Splat reports $6.7\times$ faster rendering while matching baseline PSNR. A
great case study: the bottleneck was bookkeeping, not arithmetic.
Not a render-time speedup — a train-time one. The original paper's densification
("clone Gaussians with large gradients, split Gaussians with large variance, reset opacity
every N steps") is a pile of well-tuned heuristics that are brittle across scenes. MCMC
reframes the entire optimization as sampling from a posterior over Gaussians: a "death" of a
low-opacity Gaussian is teleported to a new location proportional to current opacity. No more
clone/split rules. Training is more stable and the final scene has fewer wasted Gaussians,
which makes rendering faster too.
Most Gaussians in a trained scene contribute almost nothing to the final image. LightGaussian
ranks each Gaussian by a "global significance" score (sum of $\alpha\cdot T$ over training
views) and prunes the bottom 66%. Then it vector-quantizes SH coefficients and distills to
lower SH degrees. Result: $15\times$ smaller files, $\sim 2\times$ faster
rendering, near-identical PSNR.
RadSplat goes further by using a NeRF as a teacher to decide which Gaussians matter, hitting
900+ FPS at 1080p on the original Mip-NeRF 360 scenes. Same scene, same
quality, $\sim 10\times$ the framerate of vanilla 3DGS.
LightGaussian uses heuristic significance scores; RadSplat borrows judgement from a slower but
accurate NeRF. The trade is more training time for steeper inference gains.
The pattern is consistent: every year someone notices a constant factor and removes it. Tiles
got tighter, sorts got finer, atomics got aggregated, bookkeeping got smarter. The remaining
big targets:
Memory bandwidth. Even with all the above, the render kernel is
memory-bound. SoA layouts (struct-of-arrays) and FP16 attributes are already in flight
(EAGLES, Compact-3DGS, Reduced-3DGS).
Hardware rasterization. Several 2025 papers (Petrov et al.) show that you
can emit Gaussians as tessellated triangles to the standard hardware rasterizer and let the
GPU's fixed-function pipeline do the heavy lifting. Faster on consumer cards, friendlier to
VR and AR.
Dynamic and editable scenes. Spacetime Gaussians, Deformable-GS, and 4D-GS
extend the primitive to time. The CUDA story is mostly "the same pipeline, indexed by
$(\text{gaussian}, t)$" with new tricks for temporal coherence.
Mobile and web. Several open-source WebGL / WebGPU implementations now hit
60 FPS on a phone for small scenes (1M Gaussians). The bottleneck shifts from arithmetic to
texture-cache hit rate.
What's striking is how durable the original 2023 architecture has proven. Three years and
dozens of papers later, every fast renderer still has a tile loop, a sorted Gaussian list, a
front-to-back walk, and an early-out. The constants change. The shape doesn't.