A visual essay · the foundations of 3DGS 图文长读 · 3DGS 的基础

3D Gaussian Splatting,
built from scratch.
3D 高斯泼溅,
从零搭起。

You know NeRF. You've heard "splatting is faster." This essay walks you through the why and the how — every ingredient of a 3DGS scene, drawn out by hand, from a single ellipsoid to a full optimization loop. 你已经懂 NeRF,也听过"飞溅渲染更快"这句话。这篇文章要把为什么怎么做一口气讲透——从一颗高斯椭球开始,把 3DGS 场景的每个零件都摊给你看,最后拼成完整的训练回路。

For readers with NeRF/SDF and basic linear algebra. No CUDA needed. 假设你懂 NeRF/SDF 和基本线性代数;不需要 CUDA 经验。

§ 0 · The premise§ 0 · 前提

From rays to splats 从光线到溅斑

A NeRF stores the world as a function: feed it a 3D point and a viewing direction, get back a color and a density. To render a pixel you trace its ray, sample the function a few hundred times along the way, and combine the samples with the volume rendering integral. It works, beautifully — but it's a brute-force question to ask a small MLP a few hundred million times per frame.

3D Gaussian Splatting (Kerbl, Kopanas, Leimkühler & Drettakis, SIGGRAPH 2023) flipped the question. Instead of asking "what's at point x?", it asks "which Gaussian blobs touch this pixel?". The world is no longer an opaque function — it's an explicit cloud of a few million tiny, fuzzy ellipsoids. Rendering becomes projection plus alpha compositing, the kind of thing a rasterizer is born to do.

NeRF 把世界存成一个函数:丢进一个 3D 点和一个视角方向,吐出一个颜色和一个密度。要画一个像素,就沿着它的射线走一遭、采几百个点、把这些样本喂给体渲染积分。效果漂亮——但说到底,是在用蛮力让一个小 MLP 每一帧被反复问几亿次。

3D 高斯泼溅(Kerbl、Kopanas、Leimkühler、Drettakis,SIGGRAPH 2023)把问法倒了过来。它不问"点 x 上有什么?",它问"哪些高斯团触碰到了这个像素?"世界不再是一个黑盒函数,而是几百万颗显式存储的、毛茸茸的小椭球。渲染于是变成了投影 + α 合成——这正是光栅化硬件天生擅长的事。

NeRF · implicit field NeRF · 隐式场

  • Scene = weights of an MLP (~5 MB).
  • Render: cast a ray, query MLP at ~256 points, integrate.
  • 1–10 FPS at 1080p on a good GPU. Days of training.
  • Quality is excellent; latency is the killer.
  • 场景 = 一个 MLP 的权重(约 5 MB)。
  • 渲染:发射一条射线,在 ~256 个点上查询 MLP,再积分。
  • 好显卡上 1080p 只有 1–10 FPS。训练要好几天。
  • 画质非常好;延迟是硬伤。

3DGS · explicit primitives 3DGS · 显式基元

  • Scene = list of 1–6 M Gaussians (~1 GB). Each has μ, Σ, α, SH color.
  • Render: project each Gaussian to 2D, sort by depth, composite.
  • 100+ FPS at 1080p. ~30 min training. Same quality as Mip-NeRF 360.
  • You can open the scene in a debugger and look at the points.
  • 场景 = 一份 1–6 M 颗高斯的列表(约 1 GB)。每颗存 μ、Σ、α、SH 颜色。
  • 渲染:把每颗高斯投影到 2D,按深度排序,然后合成。
  • 1080p 下 100+ FPS。训练约 30 分钟。画质对齐 Mip-NeRF 360。
  • 整个场景就是个点云文件,你可以用调试器打开看每一颗点。

The interesting question isn't "which one is better" — both rest on the same volume rendering equation. The interesting question is what was the right set of design choices that let you swap an MLP for an explicit primitive without giving up the photorealism. This essay walks every one of those choices, in the order you'd discover them if you tried to invent 3DGS yourself.

有意思的问题不是"哪个更好"——它俩底层都是同一条体渲染方程。有意思的问题是:到底做对了哪几个设计选择,才让你能把 MLP 换成显式基元、却没有牺牲照片级真实感?这篇文章就把这些选择一个一个挑出来,按你自己若要从头发明 3DGS 时会遇到它们的顺序讲。

§ 1 · The atom§ 1 · 原子

One Gaussian, dissected 把一颗高斯解剖给你看

Forget neural networks for a moment. The smallest unit of a 3DGS scene is a 3D Gaussian blob — a fuzzy ellipsoid in space:

先把神经网络放一边。3DGS 场景里最小的单位是一颗 3D 高斯团——空间里一团毛茸茸的椭球:

$$ G(\mathbf{x}) \;=\; \exp\!\Big(-\tfrac{1}{2}\,(\mathbf{x}-\boldsymbol{\mu})^{\!\top}\,\boldsymbol{\Sigma}^{-1}\,(\mathbf{x}-\boldsymbol{\mu})\Big) $$

Two parameters give it shape and place: the center \(\boldsymbol{\mu}\in\mathbb{R}^3\) is where it lives; the 3×3 covariance \(\boldsymbol{\Sigma}\) says how stretched, how rotated, and how anisotropic it is. Two more parameters dress it up: an opacity \(\alpha\in(0,1)\), and a view-dependent color \(c(\mathbf{d})\) that varies with the direction you look from (so reflections can work). We'll meet \(c(\mathbf{d})\) in §3.

Why a Gaussian, of all functions? Three reasons that compound:

  1. Closed-form projection. The 2D image of a 3D Gaussian under a linear-ish camera is — to first order — another Gaussian. We get a 2D ellipse for free. (EWA splatting, §2.)
  2. Differentiable everywhere. Smooth in μ, smooth in Σ. Gradients land cleanly, no kinks, no NaNs.
  3. Compact support, in practice. Outside about 3σ the contribution is negligible, so we can clip rendering to a tight footprint per Gaussian. That's what makes splatting fast.

两个参数确定它的形状和位置:中心 \(\boldsymbol{\mu}\in\mathbb{R}^3\) 决定它在哪;3×3 的协方差矩阵 \(\boldsymbol{\Sigma}\) 决定它被拉得多长、转向哪儿、有多各向异性。再加两个参数让它"穿上衣服":不透明度 \(\alpha\in(0,1)\),以及一个视角相关的颜色 \(c(\mathbf{d})\),它随你看向的方向变化(这样才能表达反光)。\(c(\mathbf{d})\) 我们留到 §3 再讲。

那么——所有函数里,为什么偏偏挑高斯?有三条理由,互相加成:

  1. 投影有闭式解。一颗 3D 高斯经过近似线性的相机变换,到一阶精度后还是一颗高斯。我们白白得到一个 2D 椭圆。(EWA splatting,§2 详谈。)
  2. 处处可微。对 μ、对 Σ 都光滑。梯度干干净净落下来,没有折点、没有 NaN。
  3. 实际上是紧支撑的。3σ 以外贡献已经可以忽略,于是每颗高斯都可以裁剪到一个紧凑的覆盖区域。这是 splatting 之所以快的关键。

Interactive · sculpt a 2D Gaussian 交互 · 亲手捏一颗 2D 高斯

Drag the handles. The dot is μ. The side handles set the principal scales. The arc rotates. Watch how Σ updates and how the ellipse changes shape. 拖动几个手柄:圆点是 μ,两个边上的手柄设定两条主轴的尺度,弧线是旋转。一边拖一边看右下角的 Σ 矩阵怎么变、椭圆形状怎么变。

Σ = …

A 3D Gaussian is the same idea with one more axis. The shape we sculpt in the box above is what you'd see if you sliced the 3D ellipsoid along a plane parallel to the screen — which, as the next section shows, is exactly what a camera does to it.

3D 高斯就是同一件事再加一个轴。上面你捏出的那个 2D 形状,就相当于把一颗 3D 椭球沿着平行于屏幕的平面切一刀的截面——下一节会看到,相机做的事正好就是这样。

§ 2 · The Σ trick§ 2 · Σ 的技巧

Why we never optimize Σ directly 为什么我们永远不直接优化 Σ

Here's a trap. Σ is a 3×3 symmetric matrix, so it has 6 unique entries — you might think you can just put those 6 numbers in your parameter vector and let SGD figure it out. Don't. A covariance matrix must be symmetric positive semi-definite (PSD); gradient descent doesn't know that, and the moment one of your eigenvalues goes negative your "ellipsoid" turns imaginary.

The 3DGS paper sidesteps the constraint by storing Σ as

这里有个坑。Σ 是个 3×3 对称矩阵,独立元素只有 6 个——你可能想:那就把这 6 个数往参数向量里一塞,剩下交给 SGD 不就行了?千万别。协方差矩阵必须是对称半正定 (PSD) 的;梯度下降不知道这件事,一旦某个特征值滑到负数,你的"椭球"就变成虚数了。

3DGS 论文绕开这个约束的办法,是把 Σ 存成:

$$ \boldsymbol{\Sigma} \;=\; R\,S\,S^{\!\top}\,R^{\!\top}, \qquad S = \mathrm{diag}(s_x, s_y, s_z), \quad R \text{ from a unit quaternion } \mathbf{q}. $$

Two facts make this great. (a) Every PSD Σ can be written this way — it's the polar/eigen decomposition. So we don't lose any expressivity. (b) Any choice of \((\mathbf{q}, \mathbf{s})\) gives a valid Σ, even an absurd one. The constraint vanishes; gradient descent operates in an unconstrained space.

This pattern — pick coordinates so the constraint is free — recurs everywhere in the field. SDFs use it (any scalar field is a valid SDF candidate, you re-distance after). Rotations use it (quaternions instead of Euler angles to dodge gimbal lock). Variational autoencoders use it (reparameterize σ as \(\exp(\log\sigma)\) so you can never sample a negative variance). 3DGS just applies it to ellipsoid shape.

这么做有两点好处。(a) 任何半正定 Σ 都可以这样写出来——这就是极/特征分解,所以表达力一点不损。(b) 不管 \((\mathbf{q}, \mathbf{s})\) 取什么值,组出来的 Σ 都是合法的,哪怕参数本身荒谬。约束消失,梯度下降在一个无约束的空间里跑。

这种思路——选一组坐标让约束自动满足——在这个领域到处都是。SDF 这么干(任何标量场都是合法的 SDF 候选,事后重新做距离场即可);旋转这么干(用四元数避开欧拉角的万向锁);VAE 这么干(把 σ 重参数化成 \(\exp(\log\sigma)\),永远不会采到负方差)。3DGS 只是把这套思路用在了椭球的形状上。

# How the paper actually stores a Gaussian.  ~16 floats per primitive (before SH color).
gaussian = dict(
    mu     = torch.zeros(3),                # position (3)
    q      = torch.tensor([1, 0, 0, 0.0]),  # rotation as unit quaternion (4)
    s_log  = torch.zeros(3),                # log-scale per axis (3) — exp() to recover s
    a_logit= torch.zeros(1),                # opacity in logit space (1) — sigmoid() for α
    sh     = torch.zeros(48),               # spherical harmonics color (16 per channel × 3)
)
# Σ is never stored. We recompute it on the fly when needed:
def sigma(g):
    R = quat_to_R(g["q"])
    S = torch.diag(g["s_log"].exp())
    return R @ S @ S.T @ R.T            # 3x3, always PSD by construction

Two more reparameterizations in that snippet are worth flagging. \(\alpha\) is stored as its logit — so SGD can push it anywhere in \(\mathbb{R}\) and a \(\sigma(\cdot)\) maps it back to \((0,1)\). Scales are stored as log-scale so \(\exp(\cdot)\) keeps them positive. Same trick, different constraint. None of these are arbitrary — they're the bare minimum to make a Gaussian's parameters well-conditioned for gradient descent.

这段代码里另外两个重参数化也值得点一下。\(\alpha\) 存的是它的 logit——这样 SGD 可以把它推到 \(\mathbb{R}\) 上任何位置,再用 \(\sigma(\cdot)\) 映射回 \((0,1)\)。尺度用 log 空间存,\(\exp(\cdot)\) 保证它恒正。同样的招,不同的约束。这些选择都不是随手定的——它们是让一颗高斯的参数对梯度下降"好调"的最低限度。

§ 3 · The projection§ 3 · 投影

Splatting: how a 3D blob becomes a 2D ellipse Splatting:一颗 3D 团子是怎么变成 2D 椭圆的

To draw the Gaussian into an image we need its image-plane footprint. The center is easy: apply the camera matrix to \(\boldsymbol{\mu}\), do the perspective divide, and you have a screen-space pixel. The covariance is the subtle part — and the one piece of math you really have to grasp to understand 3DGS.

The camera's job is a function \(\phi: \mathbb{R}^3 \to \mathbb{R}^2\) (world point → pixel). It's not linear — perspective divide is a ratio of coordinates. But it's smooth, and at any one point we can replace it by its best linear approximation. That's just the chain rule on a coordinate change:

要把一颗高斯画到图像里,我们需要它在像平面上的"足印"。中心好办:把 \(\boldsymbol{\mu}\) 乘以相机矩阵,做完透视除法就得到一个屏幕像素位置。协方差才是微妙的部分——这也是想搞懂 3DGS 必须吃透的一段数学。

相机做的是一个函数 \(\phi: \mathbb{R}^3 \to \mathbb{R}^2\)(世界点 → 像素)。它不是线性的——透视除法是个比值。但它是光滑的,在任何一点都可以用它的最佳线性近似替代。这无非就是对一个坐标变换用链式法则:

$$ \mathbf{x}' \;\approx\; \phi(\boldsymbol{\mu}) + J_{\phi}(\boldsymbol{\mu})\,(\mathbf{x}-\boldsymbol{\mu}) $$

where \(J_\phi\) is the 2×3 Jacobian of the projection evaluated at the Gaussian's center. Under that linearization, a 3D Gaussian remains a Gaussian — its image-plane covariance is just the pushed-forward 3D one. Combine the world-to-camera matrix \(W\) with the projection Jacobian \(J\) and you get the famous EWA splatting formula (Zwicker et al., 2001):

这里 \(J_\phi\) 是投影在高斯中心处的 2×3 雅可比矩阵。在这种线性化下,一颗 3D 高斯依旧是一颗高斯——它在像平面上的协方差,就是 3D 协方差"推过去"的结果。把世界到相机变换 \(W\) 和投影雅可比 \(J\) 凑到一起,就得到了著名的 EWA splatting 公式(Zwicker 等,2001):

$$ \boldsymbol{\Sigma}' \;=\; J\,W\,\boldsymbol{\Sigma}\,W^{\!\top}\,J^{\!\top} \quad \in \mathbb{R}^{2\times 2}. $$

\(\Sigma'\) is the screen-space covariance — the ellipse the 3D blob casts onto the image. Once you have it, every pixel under the ellipse gets the Gaussian's contribution:

\(\Sigma'\) 就是屏幕空间的协方差——这颗 3D 团子投到画面上形成的椭圆。有了它,落在椭圆覆盖范围内的每个像素都收到这颗高斯的一份贡献:

$$ G_{\text{2D}}(\mathbf{p}) \;=\; \exp\!\Big(-\tfrac{1}{2}\,(\mathbf{p}-\mathbf{p}_\mu)^{\!\top}\,\boldsymbol{\Sigma}'^{-1}\,(\mathbf{p}-\mathbf{p}_\mu)\Big). $$

That single 2×2 inverse is the inner loop of the renderer. If you can compute that fast — and a GPU sure can — you can splat a million Gaussians a frame.

整个渲染器的内层循环就在那个 2×2 矩阵的逆上。这件事算得快——GPU 当然算得快——一帧泼出一百万颗高斯也就是它了。

Interactive · 3D ellipsoid → 2D ellipse 交互 · 3D 椭球 → 2D 椭圆

Spin the camera. The 3D Gaussian on the left projects to the 2D ellipse on the right via \(\Sigma' = J W \Sigma W^\top J^\top\). Notice how rotation changes the screen ellipse's aspect — that's the projection geometry, not your eyes. 转动相机。左边的 3D 高斯通过 \(\Sigma' = J W \Sigma W^\top J^\top\) 投到右边变成 2D 椭圆。注意右边椭圆的纵横比会随相机旋转变化——这是投影几何决定的,不是错觉。

§ 4 · The color§ 4 · 颜色

Spherical harmonics, the cheap reflection trick 球谐函数:用便宜的招数搞定反光

A real surface looks different depending on where you stand. A matte wall barely changes; a polished apple gleams at a specific angle. To match a NeRF's photorealism, our Gaussians need view-dependent color too — a function from viewing direction \(\mathbf{d}\) on the unit sphere to RGB.

NeRFs do this by feeding \(\mathbf{d}\) into the MLP. 3DGS can't afford an MLP, so it stores a compact basis expansion: spherical harmonics (SH). SH are to the sphere what Fourier series are to the circle — a complete orthonormal basis \(\{Y_\ell^m\}\) you can linearly combine to represent any (smooth) function on \(S^2\).

真实物体从不同角度看上去不一样。哑光墙变化很小;一颗抛过光的苹果只在某个角度才闪。要跟上 NeRF 的照片级真实感,我们的高斯也得有视角相关的颜色——从单位球面上的方向 \(\mathbf{d}\) 映射到 RGB 的函数。

NeRF 是把 \(\mathbf{d}\) 喂进 MLP 解决这件事的。3DGS 用不起 MLP,于是改存一组紧凑的基底展开:球谐函数 (SH)。球谐之于球面,就像傅里叶级数之于圆——是一组完备的正交基 \(\{Y_\ell^m\}\),可以线性组合表达 \(S^2\) 上任何(光滑的)函数。

$$ c(\mathbf{d}) \;=\; \sum_{\ell=0}^{L}\sum_{m=-\ell}^{\ell} k_{\ell,m}\,Y_\ell^m(\mathbf{d}). $$

3DGS uses degree \(L=3\) by default. That's 16 basis functions per channel, 48 floats of color per Gaussian. The DC term \(Y_0^0\) is just a constant — the Gaussian's "base color." Degree 1 adds smooth left/right/up/down variation (good enough for diffuse-ish surfaces). Degree 2 and 3 sharpen up the lobes — enough to capture metallic highlights and edge specularity without an MLP in the inner loop.

3DGS 默认用阶数 \(L=3\),每个颜色通道 16 个基函数,每颗高斯一共 48 个浮点的颜色。直流项 \(Y_0^0\) 就是个常数——这颗高斯的"基色"。一阶项贡献左右上下的平滑变化(足以应付偏漫反射的表面)。二阶、三阶让方向瓣更"尖"——足以捕捉金属高光和边缘反光,而内层循环里完全不需要 MLP。

Interactive · view direction → color 交互 · 视角方向 → 颜色

Orbit around the sphere. The color you see depends on the SH coefficients on the right. Crank up the degree-2 sliders to add a glossy lobe. Reset and you've got a Lambert ball. 绕着球转动视角。看到的颜色取决于右边的 SH 系数。把二阶系数推上去就能加一个光亮的反射瓣。Reset 之后就是一颗朗伯球。

There's no magic here — it's a linear function of 48 numbers, evaluated once per Gaussian per frame, no per-pixel branching. Compared to an MLP that's effectively free, and the closed-form gradient with respect to each \(k_{\ell,m}\) is so simple it disappears in the noise of the backward pass. You give up some ability to model sharp, microfacet-class specularity (which is why §7 papers like GaussianShader add proper BRDFs on top), but for the original goal — photorealistic novel view synthesis — SH is plenty.

这没什么魔法——就是 48 个数的线性函数,每帧每颗高斯算一次,像素级根本没有分支。跟 MLP 比这几乎是免费的;而且对每个 \(k_{\ell,m}\) 的梯度都是闭式简单到反向传播里几乎看不见。你确实放弃了一部分对锐利的微面元高光的建模能力(这也是为什么后来 GaussianShader 这类论文要在上面加正经的 BRDF——见 3dgs-relighting),但对原本"照片级新视角合成"这个目标来说,SH 已经够用了。

§ 5 · The composite§ 5 · 合成

How a pixel gets its color 一个像素的颜色是怎么算出来的

Each Gaussian gives a pixel one tiny contribution. Hundreds of Gaussians might touch one pixel — in front of each other, layered like an onion. We have to combine them in a way that respects occlusion. This is the same volume rendering equation you've seen in NeRF, evaluated discretely over a list of Gaussians sorted front-to-back by depth:

每颗高斯给一个像素一点点贡献。一个像素可能被几百颗高斯覆盖——前后层叠,像洋葱皮。要把它们组合起来还得正确处理遮挡。这正是你在 NeRF 里见过的同一条体渲染方程,离散化在一串按深度从前往后排好的高斯上:

$$ C \;=\; \sum_{i=1}^{N}\mathbf{c}_i\,\alpha_i\,T_i, \qquad T_i \;=\; \prod_{j=1}^{i-1}(1-\alpha_j). $$

\(\alpha_i\) here is the Gaussian's stored opacity multiplied by its 2D Gaussian value at this pixel: an off-center pixel gets a smaller \(\alpha\). \(T_i\) is the transmittance — how much of the original light from Gaussian \(i\) survives the Gaussians in front of it. \(T_1 = 1\) (nothing in front), and \(T_i\) shrinks fast: once you hit a few solid Gaussians, \(T\) falls below \(10^{-4}\) and no later one can matter. The pixel terminates early. This early-out is what makes 3DGS rendering O(visible Gaussians per pixel) instead of O(total Gaussians in scene).

这里的 \(\alpha_i\) 是这颗高斯存的不透明度乘以它在这个像素处的 2D 高斯值——偏离中心的像素拿到的 \(\alpha\) 就小。\(T_i\) 是透射率——第 \(i\) 颗高斯发出来的光,能从它前面那些高斯的"夹缝"里活着透过来的比例。\(T_1 = 1\)(前面没东西挡),\(T_i\) 衰减得很快:碰到几颗厚实的高斯后,\(T\) 就跌到 \(10^{-4}\) 以下,再后面的高斯就再也贡献不了什么了。像素提早终止。正是这次提早跳出,让 3DGS 的渲染复杂度从 O(场景里所有高斯) 降到了 O(每个像素看见的高斯)。

Interactive · march a pixel front-to-back 交互 · 给一个像素从前往后走一遍

A synthetic stack of 20 Gaussians. Drag the slider to step through. The bar chart tracks \(T_i\) decaying and the running color accumulating. Most steps in, the rest stop contributing — that's the early-out. 合成一个 20 颗高斯的堆栈。拖动滑块一步一步往后走。柱状图同时记录 \(T_i\) 的衰减和累计颜色的积累。走到一半你会看到后面那些柱子贡献都熄火了——这就是提早跳出。

step 0 / 20

Notice the structure: the per-pixel computation is a tight loop over a sorted list with one multiply-add and one early-out check per Gaussian. No branching, no MLP, no per-sample interpolation. On a modern GPU this fits into shared memory and compiles to ~30 instructions per Gaussian. That is where the 100× speedup over NeRFs lives.

注意它的结构:每像素的计算就是在一个排好序的列表上跑一个紧凑循环,每颗高斯一次乘加 + 一次提早跳出检查。没有分支、没有 MLP、没有逐样本插值。在现代 GPU 上整个东西能塞进 shared memory,每颗高斯编译出来大约 30 条指令。这里就是 3DGS 比 NeRF 快两个数量级的真正原因。

§ 6 · The optimization§ 6 · 优化

How a point cloud learns to be a photo 一团点云是如何学会变成一张照片的

The renderer is differentiable end-to-end. Given a training image \(I^*\) and a candidate render \(I\), the loss is plain photometric:

整个渲染器端到端可微。给定一张训练图 \(I^*\) 和当前的渲染结果 \(I\),损失就是普通的光度损失:

$$ \mathcal{L} \;=\; (1-\lambda)\,\|I - I^*\|_1 \;+\; \lambda\,\mathcal{L}_{\text{D-SSIM}}(I, I^*), \quad \lambda = 0.2 $$

An L1 term forces colors to match. A D-SSIM term (1 − SSIM, structural similarity) forces local structure to match — this is what stops the optimizer from settling on a mushy "averaged" solution. Gradients flow back through the alpha composite, through the SH evaluation, through \(\Sigma' = JW\Sigma W^\top J^\top\), all the way to \(\boldsymbol{\mu},\mathbf{q},\mathbf{s},\alpha\) and the SH coefficients of every contributing Gaussian.

Where the extra Gaussians come from: densification

Initialize with a few hundred thousand SfM points and let SGD run. Two problems emerge fast:

  • Under-reconstruction. Some regions have no Gaussians but need detail. The few Gaussians nearby grow large gradients but can't move far without trashing other views.
  • Over-reconstruction. Other regions have one Gaussian trying to cover too much area — also large gradients, but the cure is "split me into smaller pieces."

Every 100 steps, the trainer inspects each Gaussian's positional gradient magnitude. If it's large and the Gaussian is small: clone (copy it nearby — fill in the missing region). If it's large and the Gaussian is big: split (replace it with two smaller children sampled from its own ellipsoid). If a Gaussian's opacity drifts near zero or it grows huge: prune. Every ~3000 steps, the opacity of every Gaussian is reset toward zero, forcing the optimizer to re-justify each one — a clever way to cull dead points.

L1 项逼颜色对齐。D-SSIM 项(1 − SSIM,结构相似度)逼局部结构对齐——这一项才是阻止优化器收敛到一坨"平均"糊状结果的关键。梯度沿着 α 合成 → SH 求值 → \(\Sigma' = JW\Sigma W^\top J^\top\) 一路传回去,最终落到每颗参与贡献的高斯的 \(\boldsymbol{\mu},\mathbf{q},\mathbf{s},\alpha\) 和 SH 系数上。

多出来的高斯从哪儿来:致密化

先用几十万个 SfM 点初始化,然后让 SGD 跑。很快会冒出两个问题:

  • 欠重建。某些区域根本没有高斯但应该有细节。附近为数不多的高斯梯度变得很大,但又没法远跑过去——一跑别的视角就糊了。
  • 过重建。另一些区域只有一颗高斯却想覆盖太大一片——梯度同样很大,但解法是"把我切成更小的几块"。

每 100 步,训练器检查每颗高斯位置梯度的模。如果它大、并且高斯本身又小:执行克隆(在它附近复制一份——把空着的位置填上)。如果它大、高斯本身又大:执行分裂(用从它椭球里采样得到的两颗更小的子高斯替换掉它)。如果一颗高斯的不透明度飘到接近零,或者长得太巨大:裁剪掉。每 ~3000 步,所有高斯的不透明度被重置回接近零,强迫优化器重新"证明"每一颗的存在意义——这是一招杀死无效点的妙手。

Interactive · densification in action 交互 · 致密化的实际样子

A toy 2D scene starts as 30 random Gaussians trying to fit the target image (top). Click "step 100" repeatedly. Watch clone/split fill in the missing detail and prune kill the dead points. The whole real algorithm fits in one CUDA kernel; here it's pure JavaScript. 一个玩具 2D 场景,初始 30 颗随机高斯,想拟合上面的目标图像。反复点 "step 100"。看着克隆/分裂把缺失的细节补上、裁剪把死掉的点清掉。真版本就是一个 CUDA kernel;这里是纯 JavaScript 写的。

iter 0 · N = 30

The full forward/backward loop, in 30 lines 完整的前向/反向循环,30 行讲完

for step in range(30_000):
    cam, gt = random_view()                       # pick a training camera + its ground-truth image

    # ---- forward ----
    img = rasterize(gaussians, cam)               # (project Σ, sort by depth, alpha composite)
    loss = (1-0.2) * (img - gt).abs().mean() \
         + 0.2 * (1 - ssim(img, gt))

    # ---- backward ----  (autograd handles the chain rule across the CUDA kernel)
    loss.backward()
    optimizer.step(); optimizer.zero_grad()

    # ---- densification, every 100 steps ----
    if step % 100 == 0 and step < 15_000:
        clone_under_reconstructed(gaussians)      # |∇μ| large, small Σ
        split_over_reconstructed(gaussians)       # |∇μ| large, big Σ
        prune_low_alpha(gaussians)                # α < 0.005 → delete
    if step % 3_000 == 0:
        reset_opacity(gaussians)                  # opacity ← logit(0.01)

That's it. That's all of 3DGS. ~30 lines of training logic plus a custom CUDA rasterizer for the inner loop. The rest of the field is variations on each piece of this skeleton — better densification (MCMC, Taming-3DGS), better projection (Mip-Splatting), tighter Σ (2DGS, GES), cheaper inner loop (gsplat, Speedy-Splat), or whole new domains where the same skeleton surprisingly still works (SLAM, avatars, generation).

就这些。整个 3DGS 就是这些。~30 行训练逻辑外加一个手写的 CUDA 光栅化器作为内层循环。整个领域剩下的工作,都是在这个骨架的某一块上改:更好的致密化(MCMC、Taming-3DGS)、更好的投影(Mip-Splatting)、更紧的 Σ(2DGS、GES)、更省的内层循环(gsplat、Speedy-Splat),或者把整个骨架搬到一个全新的领域里去(SLAM、虚拟人、生成)——这骨架居然都还撑得住。

§ 7 · Where this lands§ 7 · 这套东西落到哪儿了

What the original paper achieved 原版论文交出了什么成绩

135 FPS
1080p, Mip-NeRF 360 garden scene, RTX A60001080p,Mip-NeRF 360 garden 场景,RTX A6000
~30 min
Train per scene to SOTA quality每个场景训到 SOTA 质量的时间
~1–6 M
Final Gaussians per scene每个场景最终的高斯数
27+ PSNR
On Mip-NeRF 360 outdoor — matched or beat Mip-NeRF 360 itselfMip-NeRF 360 室外,对齐甚至超过 Mip-NeRF 360 本身

NeRF in 2023: ~1 FPS, days of training, ~5 MB of MLP weights. 3DGS: 100× faster to render, 50× faster to train, ~1 GB on disk, and the file is a transparent point cloud you can open in MeshLab. That last point matters — it's why every other vertical in this series (SLAM, avatars, generation, editing) got off the ground so fast. An explicit primitive is a primitive other systems can use.

2023 年的 NeRF:~1 FPS,训练几天,权重 ~5 MB。3DGS:渲染快 100 倍,训练快 50 倍,磁盘上 ~1 GB——而那个文件是一份你可以直接在 MeshLab 里打开的、透明的点云。最后这一点很关键——这就是为什么这个系列里所有别的方向(SLAM、虚拟人、生成、编辑)都能那么快起步:显式基元是一种别的系统可以拿去用的基元。

§ 8 · The roots§ 8 · 来龙

3DGS didn't appear from nowhere 3DGS 不是凭空出现的

Every idea in the 2023 paper has lineage. EWA splatting is from 2001. SH for radiance dates to the 1980s. Differentiable rendering as a paradigm is from 2014. Click each entry below.

2023 那篇里每一个想法都有出处。EWA splatting 来自 2001;用 SH 表示辐射度可以追到 1980 年代;可微渲染作为一种范式可以追到 2014。点下面的每一条看细节。

§ 9 · Two tangents worth a side trip§ 9 · 值得绕一下的两个分支

Things the intro doesn't tell you 教科书没明说的两件事

9.1 · Why is the volume rendering equation the same in both? 9.1 · 体渲染方程为什么在 NeRF 和 3DGS 里完全一样?

NeRFs sample a continuous field along a ray. 3DGS sums discrete primitives along the same ray. They look completely different — but the rendering formula is identical. Why?

Because both are discretizations of the same thing: the radiative transfer equation. The continuous form is

NeRF 沿着射线采样一个连续场。3DGS 沿着同一条射线对离散基元求和。看上去完全两码事——但渲染公式一模一样。为什么?

因为它们都是同一件事的离散化:辐射传输方程。它的连续形式是

$$ C \;=\; \int_0^\infty T(t)\,\sigma(t)\,c(t)\,dt, \quad T(t) = \exp\!\Big(-\int_0^t \sigma(s)\,ds\Big). $$

Discretize this integral with samples at depths \(t_1 \lt t_2 \lt \dots\) and define \(\alpha_i = 1 - \exp(-\sigma_i \Delta t_i)\) and you get the discrete sum we used in §5. Whether each "sample" is an MLP query (NeRF) or a Gaussian's center along the ray (3DGS), the chain rule of physical optics is the same. That's the reason the formula is universal: it isn't a rendering trick, it's the actual physics. Anything that emits and absorbs light obeys it.

在深度 \(t_1 \lt t_2 \lt \dots\) 上采样把这个积分离散化,再定义 \(\alpha_i = 1 - \exp(-\sigma_i \Delta t_i)\),就得到 §5 里那个离散求和。无论每个"样本"是一次 MLP 查询(NeRF)还是一颗高斯在射线上的中心(3DGS),物理光学的链式法则都一样。这就是这个公式通用的原因:它不是渲染上的小聪明,它是物理本身。任何会发光和吸光的东西都遵守它。

9.2 · "Explicit" is a feature, not a footnote 9.2 · "显式"不是脚注,是一项核心特性

NeRFs were criticized for being black boxes. You couldn't easily inspect, edit, animate, or deform them — every operation went through "retrain the network." 3DGS made the scene a literal point cloud, and that single property is what unlocked the entire downstream ecosystem:

  • SLAM and reconstruction (Splatting-SLAM, MonoGS) can add Gaussians as new frames arrive — you can't easily add new neurons to a trained MLP.
  • Animatable avatars (GaussianAvatars, HUGS) attach Gaussians to a skeleton mesh and rig them like ordinary points — impossible if the body lives inside neural weights.
  • Editing (GaussianEditor) literally selects, moves, deletes Gaussians like you'd select points in Blender.
  • Generation (DreamGaussian, LGM) outputs Gaussians as the format, and any downstream tool that consumes Gaussians works.

Almost the entire 2024–2026 explosion of the field is "we tried plugging 3DGS as the representation into our existing pipeline and it worked." The MLP-as-scene representation couldn't have done that. Explicitness compounds.

NeRF 一直被诟病为黑盒。你没法方便地检查、编辑、动画化或者形变它——一切都要走"把网络重训一遍"。3DGS 把场景变成了字面意义上的一坨点云,仅凭这一点就解锁了整个下游生态:

  • SLAM 与重建(Splatting-SLAM、MonoGS)能在新帧到来的时候添加高斯——你没法对一个已训完的 MLP 这样"加几个神经元"。
  • 可动画虚拟人(GaussianAvatars、HUGS)把高斯挂在骨骼网格上,像普通点一样绑定——如果身体藏在神经网络权重里,这是不可能的。
  • 编辑(GaussianEditor)真的就是字面意义上的"选中、移动、删除"高斯,跟你在 Blender 里选点一样。
  • 生成(DreamGaussian、LGM)直接以高斯作为输出格式,下游任何能吃高斯的工具都能直接用。

2024–2026 这两年这个领域的整片爆炸,几乎都是"我们试着把 3DGS 当成表示插进我们已有的流水线,居然就 work 了"。MLP 当场景表示是做不到这点的。显式性会复利。

§ 10 · Where to go next§ 10 · 接下来读什么

The rest of the series 系列的其余几篇

Now that you've got the atoms, every other essay in this series is "what happens when you change one piece":

  • 3dgs-cuda — the CUDA pipeline, kernel by kernel. The inner loop in §5 and §6, drawn out.
  • 3dgs-antialiasing — what goes wrong with §3's projection at the edges, and how Mip-Splatting and GES fix it.
  • 3dgs-compression — that 1 GB scene file in §7? It can be 30 MB. Here's how.
  • 3dgs-slam — what if cameras and Gaussians are both unknown? §6 with extra unknowns.
  • 3dgs-generation — text → Gaussians, without any photos at all.
  • 3dgs-large-scale — what if §5's sorted list is 100 M long? You partition.
  • 3dgs-avatars — attach §1's atoms to a skeleton. Now they move.
  • 3dgs-relighting — §4's SH lacks materials. Add BRDFs.

原子拿到手了。这个系列里其他每一篇都是在问"如果换掉这副骨架的某一块,会发生什么":

  • 3dgs-cuda —— 内层 CUDA 流水线,一颗 kernel 一颗 kernel 拆开看。把 §5、§6 里那个内循环画给你看。
  • 3dgs-antialiasing —— §3 的投影在边缘上会出什么毛病,Mip-Splatting、GES 是怎么补的。
  • 3dgs-compression —— §7 里那个 1 GB 的场景文件?可以压到 30 MB,怎么做。
  • 3dgs-slam —— 相机和高斯都是未知数怎么办?§6 加上更多未知量。
  • 3dgs-generation —— 文字 → 高斯,过程里一张照片都不需要。
  • 3dgs-large-scale —— 如果 §5 那个排好序的列表长达 1 亿呢?切分场景。
  • 3dgs-avatars —— 把 §1 的原子挂到骨架上,它们就能动起来。
  • 3dgs-relighting —— §4 的 SH 没有材质,加上 BRDF。