§ 0 · The premise§ 0 · 前提
The complexity isn't in the math, it's in the math times five million 问题不在数学,而在数学乘以五百万
Baseline 3DGS is fundamentally an optimization problem with millions of variables. Five million Gaussians × 60 parameters × 4 bytes is ~1.2 GB of parameters, plus the optimizer state (Adam keeps two moments per parameter, so 3× more), plus activations, plus the framebuffer. On a single 24 GB GPU you can fit maybe 6 million Gaussians during training. That's enough for an indoor scene at SOTA fidelity.
A drone capture of a city block — 1 km × 1 km, hundreds of buildings, thousands of trees — needs 50–200 million Gaussians for comparable fidelity. 30× too many to fit on one GPU. And that's just the storage problem. There are three others, all of which compound:
- Initialization gaps. Structure-from-Motion gives you a sparse point cloud on captured surfaces; the inside of buildings, the sky, distant facades are empty. Densification has to fill enormous regions from nothing.
- View imbalance. Some Gaussians are seen in 1000 training views; others in 5. SGD overweights the well-seen regions and never converges in the others.
- LOD mismatch. A Gaussian sized appropriately for a distant building is huge and useless when you fly close to it. A Gaussian sized for the close-up is sub-pixel from far away — see antialiasing.
基线 3DGS 本质上是一个百万变量级的优化问题。500 万颗高斯 × 60 个参数 × 4 字节 ≈ 1.2 GB 参数,再加上优化器状态(Adam 每个参数存两个矩,又 3× 多),再加上激活,再加上帧缓冲。单张 24 GB GPU 上训练时大约能塞 600 万颗高斯。这数字够搞一个 SOTA 画质的室内场景。
一次无人机城市街区扫描——1 km × 1 km,几百栋楼、几千棵树——要达到对等画质大约要 50–200 M 颗高斯。是单卡能塞下的 30 倍。而且这只是存储这一关。还有三个问题在同时复利地恶化:
- 初始化缺口。SfM 只给你拍到的表面上的稀疏点云;楼内、天空、远处建筑立面都是空的。致密化器要从零填出巨大的区域。
- 视角不均衡。有些高斯被 1000 个训练视图看到,有些只被 5 个看到。SGD 会把那些好被看到的区域优化得很好,剩下的永远不收敛。
- LOD 错配。给远处楼挑的高斯尺寸,飞近就显得巨大无用;给近距离挑的高斯尺寸,远处又跌到亚像素——参见 antialiasing。
§ 1 · The wall§ 1 · 那堵墙
VRAM grows linearly with scene volume 显存随场景体积线性增长
Empirically, you need about 30 K Gaussians per cubic meter of "interesting" volume to reach indoor-quality PSNR (≥28 dB). At that density:
经验上,要在"有意思"的体积内达到室内级 PSNR(≥28 dB),大约需要每立方米 30 K 颗高斯。在这个密度下:
Interactive · how big a scene fits on your GPU? 交互 · 你的 GPU 能塞多大场景?
Drag the scene volume. The bar tracks the Gaussian count required, the VRAM for the optimizer state, and where 24 GB / 80 GB GPUs hit their wall. Cities are well past both. 拖动场景体积。条形图同时显示需要的高斯数、优化器状态占用的显存、以及 24 GB / 80 GB 卡分别在哪撞墙。城市级场景早就把两堵墙都甩在身后了。
A typical room is 30 m³. A house is 300 m³. A block is 100 K m³. A city is 100 M m³. The Gaussians-per-VRAM math just doesn't scale linearly past about 10 K m³ on a single GPU.
一间普通房间 30 m³,一栋房子 300 m³,一个街区 100 K m³,一座城市 100 M m³。在单卡上,10 K m³ 以上"高斯数 vs 显存"这条曲线就线性不下去了。
§ 2 · Spatial partitioning§ 2 · 空间切分
VastGaussian: train tiles, blend boundaries VastGaussian:分块训练,边界融合
The most direct approach: cut the scene into rectangular tiles, train one 3DGS model per tile (each tile gets only the training views that "see" it), and blend the per-tile renders at test time. VastGaussian is the canonical implementation.
- Camera-aware partitioning. Split the SfM point cloud into a grid of cells. For each cell, gather every training camera whose view frustum intersects it — these are the "owners" of this tile.
- Per-tile training, with overlap. Train one full 3DGS model per cell, using only its owner cameras. Cells overlap on the boundary so neighboring tiles agree about seams.
- Appearance modeling for free. Different cameras have different exposure / white-balance; if you train them all into one shared SH color you average out lighting across the scene. VastGaussian adds a per-camera learnable appearance vector that's concatenated to the SH evaluation — restoring per-image color faithfulness without polluting the shared geometry.
- Render-time merging. The full scene at test time is a union of all tile clouds. Rasterize from the camera, alpha-composite across tiles. Overlap regions average.
最直接的思路:把场景切成方块,每个方块训练一份独立的 3DGS(每块只用能"看到它"的训练相机),渲染时再把各块的渲染结果融合起来。VastGaussian 是经典实现。
- 相机感知的切分。把 SfM 点云切成网格。对每个格子,收集所有视锥与之相交的训练相机——这些相机就是该块的"主人"。
- 逐块训练,块之间留重叠。每个格子单独训一整份 3DGS,只用它的主人相机。相邻格子在边界处保留重叠,让缝隙两边的高斯达成共识。
- 顺手把外观也建好。不同相机曝光、白平衡不一样;如果把它们都训进一份共享 SH 颜色里,相当于在场景上做平均。VastGaussian 给每个相机加一个可学的外观向量,拼到 SH 求值后面——在不污染共享几何的前提下保留每张图的颜色保真度。
- 渲染时合并。测试时的整个场景就是所有 tile 点云的并集。从相机做光栅化,跨 tile 做 α 合成。重叠区取平均。
Interactive · partition a "city" 交互 · 把一个"城市"切开
A toy 2D scene with 8 camera positions (orange triangles) and a few hundred Gaussians. Drag the grid resolution slider. Each tile lights up the cameras whose frustums intersect it — those are the cameras that train this tile. Notice how higher resolution gives smaller, more independent tiles (good for VRAM) but more boundary overlap (more redundant training). 一个玩具 2D 场景,8 个相机位置(橙色三角)+ 几百颗高斯。拖动网格分辨率滑块。每个 tile 会高亮视锥与之相交的相机——这些就是训练这块用的相机。注意:分辨率越高,tile 越小越独立(省显存),但边界重叠也越多(重复训练越多)。
VastGaussian's contribution wasn't the partition idea — that's old — but making it work with the differentiable 3DGS pipeline. Two key engineering details: (a) the per-camera appearance vector decouples lighting drift from geometry drift, and (b) boundary blending uses a smooth weight based on distance to the tile's center, which keeps gradients consistent at seams.
VastGaussian 的贡献不在"切块"这个想法本身——这想法很老——而是让它在可微的 3DGS 流水线上 work。两个关键的工程细节:(a) 每相机的外观向量把光照漂移和几何漂移解耦;(b) 边界融合用的是基于"到 tile 中心距离"的光滑权重,让缝隙处的梯度保持一致。
§ 3 · Hierarchical LOD§ 3 · 分层 LOD
Octree-GS: anchors at multiple resolutions Octree-GS:多分辨率锚点
Tiling solves the VRAM problem but doesn't solve LOD — a tile near the camera and a tile a kilometer away both store the same number of Gaussians. Octree-GS replaces the unstructured Gaussian cloud with a sparse octree of anchors, each predicting a small bundle of Gaussians (Scaffold-GS style — see compression §5).
At each render call, the viewer's distance to a region of the octree decides which depth to sample from: distant cells use shallow (large) anchors, near cells descend into deeper (finer) anchors. This is exactly how mesh-based games have done LOD for thirty years; Octree-GS is the 3DGS version.
切块解决了显存问题,但没解决 LOD——相机眼前的一块和一公里外的一块存的高斯数完全一样。Octree-GS 把无结构的高斯云换成了一棵稀疏八叉树锚点,每个锚点预测一小簇高斯(Scaffold-GS 风格——参见 compression §5)。
每次渲染时,观察者到某一片八叉树区域的距离决定从哪一层采样:远的格子用浅层(大)锚点,近的格子下钻到深层(细)锚点。这就是基于网格的游戏过去 30 年做 LOD 的方式;Octree-GS 是它的 3DGS 版本。
Interactive · LOD by distance 交互 · 按距离选 LOD
Drag the camera. Cells of the octree get filled in different colors per LOD. Yellow = level 0 (coarse), copper = level 2 (medium), cyan = level 4 (fine). Notice the cone-shaped "fine band" follows the camera — that's the only region where fine anchors materialize. 拖动相机。八叉树的格子按 LOD 上不同颜色:黄=L0(粗),铜=L2(中),青=L4(细)。注意那条锥形的"细节带"跟着相机走——只有这块区域会展开成细锚点。
The win compounds with §2 — Hierarchical 3DGS (Kerbl et al., SIGGRAPH 2024) combines tile partitioning and LOD octrees: each tile is internally an octree, the city-wide structure is a coarse octree of tiles. Net result: a city block streams at 30 FPS on a laptop GPU.
这招跟 §2 叠加效果更佳——Hierarchical 3DGS(Kerbl 等,SIGGRAPH 2024)把"切块"和"LOD 八叉树"合到一起:每块内部是一棵八叉树,城市级则是一棵粗八叉树由 tile 组成。最终结果:一整个城市街区在笔记本 GPU 上以 30 FPS 流式渲染。
§ 4 · Streaming§ 4 · 流式加载
What if the scene doesn't even fit on disk? 如果场景连硬盘都装不下呢?
Even after compression (see 3dgs-compression) a city-block scene is hundreds of megabytes. Loading the whole thing for a 30-second flythrough is wasteful; the viewer only ever sees ~5% of the cloud per frame. Streaming systems load only the visible chunks.
The pattern is standard from web maps:
- Indexing. The scene is keyed by spatial cell at multiple resolutions (octree of mini scenes). A small header file lists what's available where.
- Visibility query. Each frame, the camera's frustum is intersected against the index to compute a set of (cell, LOD) pairs needed.
- Async fetch. Missing chunks are requested from disk or network. Already- loaded chunks at lower LOD are upscaled or kept as fallback.
- Cache eviction. Chunks that haven't been visible for N seconds are unloaded. Working set stays roughly camera-frustum-sized.
The Hierarchical 3DGS paper (Kerbl et al., SIGGRAPH 2024) ships a working implementation. So does the CityGaussian (Liu et al., 2024) reference code. As of 2025, two open-source web viewers (SuperSplat, gsplat.js) support tiled streaming for scenes up to ~1 GB on disk.
就算压完(见 3dgs-compression),一个城市街区场景仍然是几百 MB。为了一段 30 秒的飞越就把整个加载进来很浪费——观察者每一帧也就看见整个云的 ~5%。流式系统只加载可见的那部分。
这套模式跟网页地图一样标准:
- 索引。场景按多分辨率的空间格子建索引(小场景的八叉树)。一份小的头文件列出哪儿能拿到什么。
- 可见性查询。每一帧,相机的视锥跟索引求交,得到一组需要的 (cell, LOD) 对。
- 异步获取。缺的块从硬盘或网络拉取。已加载的低 LOD 块作为 fallback 先撑着。
- 缓存淘汰。已经 N 秒不可见的块卸载掉。工作集大致与相机视锥同量级。
Hierarchical 3DGS(Kerbl 等,SIGGRAPH 2024)配套了一份能跑的实现。CityGaussian(Liu 等,2024)的参考代码也实现了这一套。截至 2025 年,两个开源 web viewer(SuperSplat、gsplat.js)支持磁盘上 ~1 GB 量级的 tiled streaming。
§ 5 · The systems§ 5 · 系统盘点
Two years of large-scale 3DGS 大场景 3DGS 的两年
§ 6 · Open§ 6 · 仍未解决
What still breaks at scale 大场景下仍然会崩的地方
Drift across tiles. Per-tile training assumes well-calibrated cameras. In drone capture, SfM pose error accumulates over hundreds of meters; neighboring tiles can disagree on geometry. Joint pose-refinement at the seam helps but doubles training cost.
Dynamic content. Cars, pedestrians, blowing trees — the entire scene assumes static. Cleaning the SfM cloud of dynamic objects before training (using deformable methods or simple mask priors) is now table stakes for outdoor captures.
Multi-day capture. Lighting differs morning to evening; cars come and go. Tile- local appearance vectors (VastGaussian §2) help but don't fully solve it. A few 2025 papers treat each day as a separate "scene-style" and learn cross-day correspondence; consensus is still settling.
tile 之间的漂移。逐 tile 训练假设相机标定准确。无人机采集时,SfM 位姿误差在几百米尺度上会累积;相邻 tile 在几何上可能不一致。在缝隙处做联合位姿精修能缓解,但训练成本翻倍。
动态内容。车、行人、被风吹动的树——整个场景假设的是静态。在训练前用 deformable 方法或简单 mask 先验把 SfM 点云清理掉动态物,是户外采集现在的标准动作。
多天采集。早晚光照不同;车来车往。每 tile 局部的外观向量(VastGaussian §2)有所缓解,但不能彻底解决。2025 年若干论文把每天当作一种"场景风格",学习跨日对应;社区还没收敛。