A visual essay · packing 1 GB into 30 MB 图文长读 · 把 1 GB 塞进 30 MB

3DGS, compressed. 3DGS,压一压。

The original 3D Gaussian Splatting paper trades elegance for storage: a typical scene ships as a 700 MB to 1 GB .ply file. Two years later, you can hit 1–3% of that with essentially no PSNR drop. This essay walks the four levers that get you there. 原版 3D 高斯泼溅以漂亮的代价换来了庞大的存储:一个典型场景的 .ply 文件能有 700 MB 到 1 GB。两年后,你能在几乎不损失 PSNR 的前提下把它压到 1–3%。这篇文章把通往这个目标的四根杠杆讲透。

Builds on the foundations essay — read that for what each parameter means. 建立在 foundations 那篇之上——每个参数的含义在那里讲过。

§ 0 · Why this matters§ 0 · 为什么这事重要

A gigabyte of fuzzy points is too many fuzzy points 一千兆字节的毛茸茸的点,太多了

A NeRF ships as ~5 MB of MLP weights. A 3DGS scene ships as a point cloud you can drop into MeshLab — which is great for tooling and terrible for bandwidth. The reason is structural: every Gaussian carries about 60 floats of state, and a good scene has 1–6 million of them. Multiply out: ~1.4 GB per scene with no compression at all.

This is fine on a workstation. It's a deal-breaker if you want to put 3DGS in a phone app, ship a Sketchfab catalog, or send a scene over a hotel Wi-Fi. The good news is that almost every one of those 60 floats is correlated with its neighbors, almost every primitive contributes negligibly to the final image, and the values that matter need far less than 32 bits of precision. The field exploited each of those observations in turn.

NeRF 发布出来是 ~5 MB 的 MLP 权重。3DGS 场景则是一份你能直接拖进 MeshLab 的点云——对工具链友好,对带宽不友好。原因是结构性的:每颗高斯带着大约 60 个浮点状态,一个像样的场景有 1–6 M 颗。乘出来:每个场景 ~1.4 GB,还是完全不压缩。

工作站上没问题。但你想把它塞进手机 app、放到 Sketchfab 目录里、或者用酒店 Wi-Fi 传输,就是死路一条。好消息是:这 60 个浮点里几乎每一个都和邻居高度相关;几乎每颗基元对最终图像的贡献都可以忽略不计;真正重要的那些数值,根本不需要 32 位的精度。整个领域就是逐一利用这几条观察。

Assumed background. You know the layout of a Gaussian: μ ∈ ℝ³, rotation quaternion q, scale s ∈ ℝ³, opacity α, SH coefficients (16 per RGB channel = 48 floats). If not, the Σ trick and SH section of the foundations essay cover both.
前置假设。你知道一颗高斯的存储布局:μ ∈ ℝ³、旋转四元数 q、尺度 s ∈ ℝ³、不透明度 α、SH 系数(每个 RGB 通道 16 个共 48 个浮点)。不熟的话,foundations 那篇的 Σ 技巧SH 一节讲过。

§ 1 · Anatomy of a 3DGS file§ 1 · 一份 3DGS 文件的解剖

Where do the bytes actually go? 字节到底花在了哪儿?

Before optimizing anything, look at where the storage is. A typical 3DGS scene with 3 million Gaussians stores 59 floats per primitive (the standard graphdeco-inria layout):

在开始优化之前,先看看存储到底在哪。一个有 3 M 颗高斯的典型 3DGS 场景,每颗基元存 59 个浮点(官方 graphdeco-inria 布局):

Storage breakdown · per-Gaussian 单颗高斯的存储分布

Hover any bar to see what each byte does. Notice how dominant SH is — 48 of 59 floats. Any scheme that touches SH compresses dramatically; any scheme that doesn't, doesn't. 把鼠标悬到任意一段上看每个字节在干什么。注意 SH 一段独大——59 个浮点里有 48 个是它。任何动 SH 的方案都能压得很猛;不动 SH 的方案就压不动。

hover a bar for details

So a 3 M-Gaussian scene is ~ 3 M × 59 floats × 4 B = ~708 MB. With degree-3 SH that's 81% of the file. Reduce SH, by any means, and you've already won.

The four levers, in order of impact

  1. Pruning — kill Gaussians the image doesn't need. 50–90% reductions in count with no PSNR loss. LightGaussian, RadSplat, Mini-Splatting.
  2. Quantization — drop 32-bit floats to 16-bit, 8-bit, or even less. Especially aggressive on SH. EAGLES, Compact-3DGS.
  3. Vector quantization — group similar SH coefficients into a small codebook, store an index per Gaussian. LightGaussian, Self-Organizing Gaussians.
  4. Structure — replace the unstructured cloud with a sparse grid of "anchors" that each predict a small bundle of Gaussians. The cloud becomes a function of the anchors. Scaffold-GS, Octree-GS.

所以 3 M 颗高斯的场景大约是 3 M × 59 浮点 × 4 B = ~708 MB。在 3 阶 SH 下这就是文件的 81%。用任何方式把 SH 缩小,已经赢一半。

四根杠杆,按收益排序

  1. 剪枝——干掉图像用不上的高斯。数量减少 50–90% 而 PSNR 不掉。LightGaussian、RadSplat、Mini-Splatting。
  2. 量化——把 32 位浮点压成 16 位、8 位甚至更少。对 SH 尤其激进。EAGLES、Compact-3DGS。
  3. 向量量化——把相似的 SH 系数聚到一个小码本里,每颗高斯只存一个索引。LightGaussian、Self-Organizing Gaussians。
  4. 结构化——把无结构的点云换成一张稀疏的"锚点"网格,每个锚点预测一小簇高斯。点云成了锚点的函数。Scaffold-GS、Octree-GS。

§ 2 · The first lever§ 2 · 第一根杠杆

Pruning: most Gaussians do almost nothing 剪枝:大多数高斯几乎啥都没干

The original training loop densifies aggressively — clone or split any time the position gradient is large — but it never properly culls. Once a Gaussian has been placed and its parameters stabilize, no signal pushes the optimizer to delete it unless its opacity drifts below the prune threshold (typically \(\alpha < 0.005\)). The result: a long tail of low-opacity, nearly-coplanar, or rarely-visible Gaussians clinging to the cloud.

LightGaussian's insight: rank every Gaussian by a "global significance" score summed over all training views:

原版训练循环致密化得很积极——只要位置梯度大就克隆或者分裂——但它从来没有认真做"剔除"。一颗高斯一旦放下来、参数稳定下来,除非不透明度漂到剪枝阈值(一般 \(\alpha < 0.005\))以下,否则没有任何信号会推优化器去删它。结果就是云团上贴着一条长尾:低不透明度、几乎共面、罕被看到的高斯。

LightGaussian 的洞察:用所有训练视角上累加的"全局重要性"分数给每颗高斯排序:

$$ \text{sig}(g) \;=\; \sum_{\text{view } v} \;\sum_{\text{pixel } p}\; \alpha_g(v, p)\,T_g(v, p) $$

where \(\alpha_g(v,p)T_g(v,p)\) is exactly how much Gaussian \(g\) contributed to pixel \(p\) in view \(v\) (you already compute this in the forward pass). Sort by significance, drop the bottom 66%, re-fine-tune for 5 K steps. PSNR loss: ~0.1 dB. File size: 0.33× the original.

其中 \(\alpha_g(v,p)T_g(v,p)\) 就是高斯 \(g\) 在视角 \(v\) 中对像素 \(p\) 的实际贡献(前向传播过程中已经算过了)。按重要性排序,丢掉最低的 66%,再微调 5 K 步。PSNR 损失:~0.1 dB;文件大小:变成原来的 0.33×。

Interactive · prune by significance 交互 · 按重要性剪枝

A toy scene of ~400 Gaussians with varying contribution. The histogram on the right shows the long tail of low-significance points. Drag the threshold; the scene above shows what survives. 一个有 ~400 颗高斯、贡献各异的玩具场景。右边的直方图显示低重要性点的长尾。拖动阈值,上图就是剪完之后还活着的部分。

keeping 100% (400 Gaussians)

§ 3 · The second lever§ 3 · 第二根杠杆

Quantization: do you really need 32 bits for opacity? 量化:不透明度真的需要 32 位吗?

Float32 has 23 bits of mantissa — about 7 decimal digits of precision. Look at what each Gaussian parameter is actually used for and ask whether it needs 7 digits:

  • Opacity α: used as a clamp on a 0–1 alpha. Quantizing to 8 bits introduces ~0.5% error, which is below the visible threshold. 32→8, 4× reduction.
  • Scale s: stored log-space. Log-scales need a few extra bits of dynamic range, but 16-bit is fine. 32→16, 2× reduction.
  • Rotation q: a unit quaternion is on \(S^3\), so 4 floats is redundant anyway. Stereographic projection puts it in \(\mathbb{R}^3\); 16-bit per coordinate is ample. 4·32 → 3·16, ~2.6× reduction.
  • Position μ: needs more dynamic range (scenes can be 100 m across, details millimeter-scale). Most papers use 16-bit half-floats with a per-block scale, or split into integer-grid + sub-voxel offset. ~2× reduction.
  • SH coefficients: the easy win. The DC term needs reasonable precision; higher-order coefficients are smaller in magnitude and tolerate 8-bit quantization or even 4-bit. 4–8× reduction.

Float32 有 23 位尾数——大约 7 位十进制精度。逐个看每个参数的实际用处,问问它真的需要 7 位精度吗:

  • 不透明度 α:用来夹一个 0–1 的 alpha。量化到 8 位引入 ~0.5% 误差,在可见阈值以下。32→8,省 4 倍。
  • 尺度 s:在 log 空间存。log 尺度需要多一点的动态范围,但 16 位够用。32→16,省 2 倍。
  • 旋转 q:单位四元数在 \(S^3\) 上,所以 4 个 float 本来就有冗余。立体投影把它放进 \(\mathbb{R}^3\);每个分量 16 位绰绰有余。4·32 → 3·16,省 ~2.6 倍。
  • 位置 μ:需要更大的动态范围(场景可能跨百米,细节又在毫米级)。大多数论文用 16 位半精度配每块一个缩放因子,或者拆成整数网格 + 子体素偏移。省 ~2 倍。
  • SH 系数:最容易省的一块。DC 项需要还可以的精度;高阶系数本身幅值小,对 8 位甚至 4 位量化都能容忍。省 4–8 倍。

Interactive · 8-bit quantization staircase 交互 · 8 位量化阶梯

A continuous value (opacity) maps to a discrete bin. The orange curve is the float32 original; the green staircase is what 8-bit storage gives you. The y-axis error is the worst-case visible artifact. 一个连续值(不透明度)被映射到一格离散的桶。橙色曲线是 float32 原版,绿色阶梯是 8 位存储给你的近似。y 方向的误差就是可能出现的最大可见瑕疵。

8 bits · 256 levels · max err ≈ 0.20%

Quantization is also one of the lowest-implementation-cost compression tricks: it's a few lines of code at save/load time and no change to training. Combined with pruning, you're at ~5× total reduction without touching the training pipeline.

量化还是实现成本最低的招——存盘 / 读盘时多几行代码就完事,训练流程一点不用改。配合剪枝一起用,不动训练流水线就能省 ~5 倍。

§ 4 · The third lever§ 4 · 第三根杠杆

Vector quantization: SH coefficients aren't independent 向量量化:SH 系数不是相互独立的

Quantization treats every coefficient separately. But the 48 SH coefficients of a Gaussian aren't 48 unrelated numbers — they jointly encode a color-on-a-sphere, and across a scene there are only so many distinct colors with similar reflectance profiles. The whole 48-D SH vector lives on a low-dimensional manifold within \(\mathbb{R}^{48}\). The right way to compress that is vector quantization: pick \(K\) representative 48-D vectors as a codebook, and replace each Gaussian's SH with an index \(k \in [K]\).

量化是把每个系数单独处理的。但一颗高斯的 48 个 SH 系数其实不是 48 个互不相干的数——它们联合编码了一个"球面上的颜色",而整个场景里具有相似反射特性的不同颜色其实就那么多。整个 48 维的 SH 向量躺在 \(\mathbb{R}^{48}\) 里某个低维流形上。压缩它的正确姿势是向量量化:挑 \(K\) 个有代表性的 48 维向量做码本,把每颗高斯的 SH 换成一个索引 \(k \in [K]\)。

$$ \mathbf{sh}_g \;\approx\; \mathbf{c}_{k_g}, \qquad k_g = \arg\min_k \|\mathbf{sh}_g - \mathbf{c}_k\| $$

K-means on the training-set SH vectors gives you the codebook; one fine-tuning pass lets the Gaussians adjust to their assigned bins. With \(K = 4096\) codes, each SH index is 12 bits — a 16× reduction on the dominant storage term. PSNR loss: ~0.2 dB.

在训练集的 SH 向量上跑 K-means 得到码本;再做一遍微调让每颗高斯适应它被分到的桶。\(K = 4096\) 码时每个 SH 索引就是 12 位——主导存储项被压16 倍。PSNR 损失:~0.2 dB。

Interactive · color codebook 交互 · 颜色码本

Each dot is a Gaussian, plotted by its first two color SH coefficients (the DC and one degree-1 term). Drag the slider to pick the codebook size. Each Gaussian snaps to the nearest cluster center. Most of the cloud collapses to ~50 centers; the rest is noise. 每个点是一颗高斯,按它的前两个 SH 颜色系数画出来(DC 项和一个 1 阶项)。拖滑块改变码本大小。每颗高斯被吸到最近的聚类中心。整团点基本上能塌缩到 ~50 个中心;其余的都是噪声。

K = 16

A nice property: at load time you reconstruct each Gaussian's SH by table lookup. The codebook is tens of kilobytes. Per-Gaussian storage drops from 48×4 B = 192 B to 1.5 B (12-bit index) — that's where the bulk of the compression ratio comes from in 2024-era pipelines.

一个很好用的性质:加载时通过查表就能重建每颗高斯的 SH。码本只占几十 KB。每颗高斯的存储从 48×4 B = 192 B 掉到 1.5 B(12 位索引)——这就是 2024 年这一代流水线里压缩比的主要来源。

§ 5 · The fourth lever§ 5 · 第四根杠杆

Scaffold-GS: anchors predict bundles of Gaussians Scaffold-GS:锚点预测一簇高斯

Lu, Yu, Wang, Yang, Liu, Dai · CVPR 2024 · arXiv:2312.00109

The prior three levers compress the cloud's storage. Scaffold-GS asks a different question: what if we don't store the cloud at all, and instead store a sparser structure that predicts it on demand?

The scene becomes a regular sparse grid of "anchor points" (typically a voxel grid). Each anchor stores: a position, a context vector \(\mathbf{f}_a \in \mathbb{R}^{32}\), and predicts \(k = 10\) child Gaussians by running \(\mathbf{f}_a\) plus the viewing direction through a small MLP. The MLP outputs each child's offset, scale, rotation, opacity, and color.

前面三根杠杆都在压缩"点云的存储"。Scaffold-GS 换个问法:如果我们干脆不存点云本身,只存一份更稀疏的结构,在需要的时候把点云预测出来呢?

场景变成一张规整的稀疏"锚点"网格(通常是体素网格)。每个锚点存:一个位置、一份上下文向量 \(\mathbf{f}_a \in \mathbb{R}^{32}\),并通过把 \(\mathbf{f}_a\) 和视角方向一起喂进一个小 MLP 来预测 \(k = 10\) 颗子高斯。这个 MLP 输出每颗子高斯的偏移、尺度、旋转、不透明度和颜色。

$$ \{(\mu_i, q_i, s_i, \alpha_i, c_i)\}_{i=1}^{k} \;=\; \text{MLP}_\theta(\mathbf{f}_a, \mathbf{d}, \mathbf{x}_a) $$

The anchors are sparse (~50 K instead of 3 M). The MLP is shared across the whole scene (~50 K parameters). Total storage: ~30 MB. PSNR: within 0.3 dB of full 3DGS. The cloud you render is huge — but it's materialized on the fly, never stored.

锚点很稀疏(~50 K,而不是 3 M)。这个 MLP 在整个场景里共享(~50 K 参数)。总存储:~30 MB。PSNR 在 0.3 dB 内对齐全 3DGS。渲染时实际用到的点云依然很大——但它是渲染时即时生成的,根本不存盘。

Interactive · anchor → bundle 交互 · 锚点 → 一簇高斯

The grid is the anchor lattice; each cell hosts a context vector. Hover an anchor: the 10 Gaussians it predicts pop up around it. Move the viewer (drag) and the predictions change — that's the view-dependent MLP doing its job. The colors are demo only. 网格就是锚点格点,每个单元格里住着一个上下文向量。悬停在某个锚点上:它预测出来的 10 颗高斯会冒出来。拖动视角,预测会变——这就是视角相关 MLP 在干活。颜色仅作示意。

hover any anchor (cyan dot)

Scaffold-GS is the bridge between explicit and implicit representations: you keep the differentiable rasterizer and most of the 3DGS pipeline, but the scene state moves into a much smaller neural representation. Subsequent work (Octree-GS, Mini-Splatting v2) extends this in two directions — making the anchor grid adaptive (octree) so empty space stores nothing, and making the predicted bundles level-of-detail aware so distant anchors emit fewer Gaussians.

Scaffold-GS 是显式表示和隐式表示之间的桥梁:你保留了可微的光栅化器和 3DGS 流水线的大部分组件,但场景状态搬到了一个小得多的神经表示里。后续工作(Octree-GS、Mini-Splatting v2)从两个方向延伸了它——让锚点网格自适应(八叉树),空区域不存任何东西;让预测出来的簇知道 LOD,远处的锚点发射更少的高斯。

§ 6 · Pareto§ 6 · 帕累托

The size–quality frontier, as of 2025 截至 2025 的"体积–画质"前沿

Stack the four levers and the size–quality curve looks like this:

把四根杠杆都叠加上去,体积–画质曲线长这样:

Interactive · stack the levers 交互 · 把杠杆叠起来

Each lever you toggle applies on top of the previous ones. The bar on the right tracks the file size; the bar on the left, the PSNR penalty. Stack all four and you're at ~30 MB with ~0.4 dB loss. That number is what's behind the "WebGL splat viewer on a phone" demos everyone is shipping. 每勾上一根杠杆,它会叠在已开启的前面那些上。右边那根条记录文件大小,左边那根记录 PSNR 代价。四根全开:~30 MB,损失 ~0.4 dB。这个数字就是"手机里跑 WebGL splat viewer"那一类 demo 的实际底气。

~700 MB
Baseline (3 M Gaussians, raw)基线(3 M 颗高斯,未压缩)
~40 MB
LightGaussian (prune + quant + VQ)LightGaussian(剪枝 + 量化 + VQ)
~30 MB
Scaffold-GSScaffold-GS
~10 MB
SOG / Compact-3DGS extreme (with quality drop)SOG / Compact-3DGS 极限版(带可见画质损失)

§ 7 · The lineage§ 7 · 脉络

Twelve months of compression papers, ordered 十二个月的压缩论文,按时间排序

Click any entry for the one-paragraph idea and its compression ratio.

点任意条目看一段话总结和它的压缩比。

§ 8 · Open§ 8 · 仍然敞着的问题

What's not solved 还没解决的几件事

Streaming. All current methods compress the whole scene as a single blob. For very large scenes you want progressive streaming — show the user a coarse scene immediately and refine as bytes arrive. Octree-GS is the closest the field has gotten; a true progressive 3DGS codec is still open.

Edits without re-compressing. If you VQ an SH codebook and the user moves a Gaussian, you have to either re-cluster (slow) or rebuild the codebook (slower). Real-time edits of compressed scenes is an unsolved practical bottleneck — see the explicit-is-a-feature section in foundations for why this matters.

Render-time decompression overhead. Vector quantization needs a codebook lookup per Gaussian per frame; Scaffold-GS needs an MLP eval per anchor per frame. Both add a constant factor to the inner loop. The frontier here is fusing the decompression with the rasterizer kernel — see 3dgs-cuda for the CUDA-side story.

流式加载。目前所有方法都把整个场景作为一坨 blob 压缩。对非常大的场景,你想要的是渐进式流式加载——立刻给用户一个粗糙的场景,然后随着字节到达逐步细化。Octree-GS 是社区最接近这个目标的工作;一个真正"渐进式 3DGS 编解码器"还是开放问题。

不重压缩的编辑。如果你对 SH 做了 VQ,用户挪动了一颗高斯,你要么重新聚类(慢),要么重建码本(更慢)。"压缩场景的实时编辑"是个还没解决的实践瓶颈——为什么这件事重要,见 foundations 的 "显式是核心特性"那一节。

渲染时的解压开销。向量量化在每帧每颗高斯都要查一次码本;Scaffold-GS 在每帧每个锚点都要跑一次 MLP。两者都给内层循环加了一个常数因子。这个前沿方向是把"解压"和"光栅化"融合到同一个 kernel 里——参见 3dgs-cuda 那一面的 CUDA 视角。