§ 0 · The setup§ 0 · 问题铺垫
Two unknowns, one image 两个未知数,一张图像
Classical visual SLAM (Simultaneous Localization And Mapping) splits the problem in two: tracking estimates the current camera pose given the existing map, and mapping updates the map given the latest pose. ORB-SLAM, KISS-ICP, DROID-SLAM — they all share this two-phase loop. What changes from system to system is the map representation: sparse 3D keypoints, a TSDF voxel grid, an octree of surfels, or — in the last 18 months — a cloud of 3D Gaussians.
Why Gaussians? Three reasons that ride on top of the same arguments in foundations:
- Photometric loss is the whole game. SLAM has access to RGB images. A 3DGS renderer is differentiable end-to-end with respect to both the Gaussian parameters and the camera pose. The same gradient that tells you to move a Gaussian also tells you to nudge the camera.
- Dense reconstruction comes free. A classical SLAM map is a hundred dots in space. A 3DGS map is the actual scene, rendered at 100+ FPS. You get a navigable photo as the side effect of tracking.
- Incremental updates are natural. New frames bring new viewpoints; you just spawn a few thousand more Gaussians where the new view's pixels project, leaving the rest of the cloud untouched. This is much harder with NeRFs (the weights are global).
经典视觉 SLAM(同时定位与建图)把问题拆成两半:跟踪是给定现有地图估当前相机位姿;建图是给定最新位姿更新地图。ORB-SLAM、KISS-ICP、DROID-SLAM——它们都共享这套两阶段循环。各系统的差别在地图表示上:稀疏 3D 关键点、TSDF 体素网格、surfel 八叉树……或者过去 18 个月里——一团 3D 高斯。
为什么是高斯?三条理由,叠加在 foundations 里那些论点之上:
- 光度损失就是全部。SLAM 有 RGB 图像可用。3DGS 渲染器对高斯参数和相机位姿都是端到端可微的。同一份梯度既能告诉你怎么挪高斯,也能告诉你怎么拨相机。
- 稠密重建白送。经典 SLAM 的地图是空间里几百个点;3DGS 的地图就是场景本身,能以 100+ FPS 渲染。跟踪做完,你顺手得到一张可游走的"照片"。
- 增量更新很自然。新帧带来新视角;你只要在新视角看见但旧地图没覆盖的位置长出几千颗新高斯就行,其余的不用动。NeRF 做不到这件事——权重是全局的。
§ 1 · The loop§ 1 · 主循环
Tracking, mapping, repeat 跟踪、建图、再来一遍
Almost every 3DGS-SLAM system follows the same four-step loop. Pseudocode looks like this:
几乎每个 3DGS-SLAM 系统都跑同一套四步循环。伪代码长这样:
scene = [] # list of Gaussians, initially empty
poses = [] # list of camera poses
for t, frame in enumerate(stream):
# ---- 1. tracking: optimize camera pose holding scene fixed ----
T_init = poses[-1] @ delta_motion_model(t) # constant-velocity guess
T_t = optimize_pose(scene, frame, T_init) # photometric loss on pose only
# ---- 2. mapping: add new Gaussians where the new view is uncovered ----
new_gs = spawn_from_uncovered_pixels(scene, frame, T_t)
scene.extend(new_gs)
# ---- 3. local BA: optimize a recent window jointly ----
if t % keyframe_stride == 0:
scene, poses = local_bundle_adjust(scene, poses, window=last_5_keyframes)
# ---- 4. loop closure (when the camera returns) ----
if detect_loop(frame, keyframes):
scene, poses = global_optimize(scene, poses)
poses.append(T_t)
Step 1 (tracking) is fast — only the 6 camera-pose parameters move, the scene is frozen. Step 2 is fast too — just initialization. Step 3 (local bundle adjustment) is where the real optimization happens: jointly refine the last few keyframes' worth of Gaussians and poses. Step 4 is rare but crucial: when the camera returns to a previously-mapped area, drift in the trajectory is detected and the whole graph is re-optimized.
第 1 步(跟踪)很快——只有 6 个相机位姿参数在动,地图冻着。第 2 步也快——只是初始化新高斯。第 3 步(局部光束法平差)才是真正的优化:联合精修最近几个关键帧覆盖到的高斯和位姿。第 4 步很罕见但很关键:当相机回到一个曾经建过图的区域,轨迹漂移被检测出来,整张图重新优化。
Interactive · run the loop yourself 交互 · 亲手跑一遍循环
A toy 2D scene. Drag the slider to advance time. Watch the camera move (orange triangle), new Gaussians appear (cyan blobs) only where the camera sees previously-uncovered pixels, and the trajectory accumulate. Click "drift" to simulate tracking error — then "close loop" to see the correction snap back. 一个玩具 2D 场景。拖滑块让时间往前走。看着相机(橙色三角)移动,新高斯(青色团子)只在相机看到此前未覆盖的像素时才出现,轨迹一路累积。点 "drift" 模拟跟踪误差,再点 "close loop" 看修正一下子归位。
§ 2 · Tracking§ 2 · 跟踪
Photometric pose optimization 光度损失驱动的位姿优化
With the scene frozen, tracking reduces to a 6-DoF non-linear least squares: find the camera pose \(\mathbf{T}\in SE(3)\) that makes the rendered image match the observed image. The pose is parameterized on the Lie algebra \(\mathfrak{se}(3)\):
地图冻住之后,跟踪问题就是一个 6 自由度的非线性最小二乘:找出让渲染图像 = 观测图像的相机位姿 \(\mathbf{T}\in SE(3)\)。位姿在李代数 \(\mathfrak{se}(3)\) 上参数化:
where \(\hat{I}(p; T)\) is the 3DGS render at pose \(T\), \(\rho\) is a robust loss (Huber or log-cosh), and the optimization is a few iterations of Gauss-Newton or Adam. Critically, the derivative \(\partial \hat{I} / \partial \boldsymbol{\xi}\) is closed-form — the EWA projection's Jacobian (foundations §3) with respect to the camera matrix \(W\) gives it to you for free.
其中 \(\hat{I}(p; T)\) 是在位姿 \(T\) 下的 3DGS 渲染,\(\rho\) 是鲁棒损失(Huber 或 log-cosh),优化跑几轮 Gauss-Newton 或 Adam。关键是:导数 \(\partial \hat{I} / \partial \boldsymbol{\xi}\) 有闭式——EWA 投影对相机矩阵 \(W\) 的雅可比(foundations §3)就白送了它。
Interactive · tracking gradient 交互 · 跟踪梯度
The fixed scene is a few brightly-colored Gaussians. The "rendered" image (left) at the current pose estimate is compared to the "observed" image (right). The colored arrows show the gradient pushing the camera toward alignment. Drag the camera to misalign it; watch tracking pull it back over a few iterations. 固定场景是几颗亮色的高斯。左边是当前位姿估计下的"渲染图",右边是"观测图"。彩色箭头展示梯度怎么把相机推向对齐。拖动相机让它走偏,再点 Step 看跟踪几次迭代把它拉回来。
§ 3 · Mapping§ 3 · 建图
Spawning Gaussians from new views 从新视角长出新高斯
Once the camera pose is locked in, the new frame contributes information about regions of the scene the existing Gaussians don't cover. The mapping step finds those regions and adds primitives there. Three signals identify uncovered pixels:
- High residual. Pixels where the rendered image disagrees with the observed one (after the pose is already optimized) are signals that no good Gaussian exists in that line-of-sight.
- Low transmittance. If a pixel terminates the alpha march with \(T\) still large, very few opaque Gaussians lie along its ray — the back of the scene is empty.
- Depth (RGB-D only). If you have a depth reading at \(z = d\), you know exactly where to place a new Gaussian — at world-space point \(\pi^{-1}(p, d)\). This is the gigantic advantage of RGB-D Gaussian SLAM.
SplaTAM, the canonical RGB-D system, spawns a Gaussian at every unmasked depth pixel with \(\boldsymbol{\mu} = \pi^{-1}(p, d_{p})\) and an initial isotropic Σ scaled to the local depth. The optimizer then refines all of them.
相机位姿锁定之后,新帧带来的信息就是关于现有高斯没覆盖到的那部分场景。建图这一步负责找出这些区域、在那里放新基元。识别"没覆盖到的像素"有三个信号:
- 残差大。位姿优化完之后,渲染图和观测图依然对不上的像素,说明那条视线上没有好高斯。
- 透射率高。如果一个像素 α-march 走完之后 \(T\) 还很大,说明它的射线上几乎没什么不透明高斯——后面是空的。
- 深度(仅 RGB-D)。如果你有深度读数 \(z = d\),你精确知道新高斯该放在哪——世界空间点 \(\pi^{-1}(p, d)\)。这是 RGB-D 高斯 SLAM 巨大的优势。
经典 RGB-D 系统 SplaTAM 在每个未被屏蔽的深度像素处长一颗高斯,\(\boldsymbol{\mu} = \pi^{-1}(p, d_{p})\),初始 Σ 是各向同性的、按局部深度缩放。然后让优化器统一精修。
Interactive · mapping step 交互 · 建图一步
A camera observes a scene with some existing Gaussians. The pixels marked red are the "uncovered" ones — high residual + low transmittance. Click "spawn" to add new Gaussians at those locations. After a few clicks, the residual map empties out and the scene converges. 相机看着一个已经有一些高斯的场景。标红的像素就是"没覆盖的"——残差大 + 透射率高。点 "spawn" 在那些位置长出新高斯。点几次之后,残差图慢慢空下来,场景收敛。
§ 4 · The monocular case§ 4 · 单目情形
What if you don't have depth? 如果连深度都没有呢?
RGB-only ("monocular") SLAM is the hard version of this problem. Without a depth sensor you have no idea how far away anything is — you only know the photo's pixel grid. Two Gaussians at very different depths can produce the same image, which means the scale is unobservable from a single frame. The classical fix is multi-frame triangulation; the 3DGS fix is more or less the same, plus regularization from a monocular depth model (DPT, Marigold) as a soft prior.
纯 RGB("单目")SLAM 是这个问题的困难版。没有深度传感器,你完全不知道东西多远——你只看见照片的像素网格。两颗深度相差很大的高斯可能产生同一张图像,意思就是单帧里尺度不可观。经典解法是多帧三角化;3DGS 的解法本质上一样,外加用单目深度模型(DPT、Marigold)做软先验来正则化。
RGB-D (SplaTAM, Gaussian-SLAM)RGB-D(SplaTAM、Gaussian-SLAM)
Depth fixes scale and gives a strong tracking signal. Real-time on a single GPU at 1–10 Hz. Quality matches offline 3DGS to within ~1 dB PSNR.
Used in: Kinect-class consumer captures, robot navigation, AR with depth.
深度直接固定了尺度,并给出很强的跟踪信号。单卡 1–10 Hz 实时。画质与离线 3DGS 在 ~1 dB PSNR 以内对齐。
应用:Kinect 级别的消费级采集、机器人导航、带深度的 AR。
RGB-only (MonoGS, Photo-SLAM)单目 RGB(MonoGS、Photo-SLAM)
Scale unobservable from one frame. Needs multi-frame triangulation, mono-depth priors, and careful initialization. Slower convergence, noisier mapping. But cameras are everywhere; depth sensors aren't.
Used in: phone capture, drone capture, archival video.
单帧尺度不可观。需要多帧三角化、单目深度先验、谨慎的初始化。收敛更慢、建图更噪。但是相机随处可见,深度传感器并非如此。
应用:手机采集、无人机采集、存档视频。
MonoGS (Matsuki et al., CVPR 2024) was the first credible monocular system. Its tricks: use an off-the-shelf monocular depth estimator as a soft constraint during tracking, initialize new Gaussians using triangulation from co-visible keyframes, and never let a single frame's tracking update propagate to the whole scene (a small temporal window of Gaussians is "active" at any moment). Today's monocular systems run 5–15 Hz, with mapping quality ~2 dB below the RGB-D state of the art.
MonoGS(Matsuki 等,CVPR 2024)是第一个站得住脚的单目系统。它的几个套路:跟踪时用现成的单目深度估计器作为软约束;新高斯靠"共视关键帧之间的三角化"来初始化;不让任何一帧的跟踪更新传播到整张地图(任何时候只有一个小时间窗口里的高斯是"活跃"的)。今天的单目系统跑 5–15 Hz,建图质量比 RGB-D SOTA 落后 ~2 dB。
§ 5 · Loop closure§ 5 · 闭环
The accumulated drift problem 累积漂移问题
Every tracking step has a small error. Over a few hundred frames, the errors compound — the estimated trajectory drifts away from the true one. Classical SLAM catches this when the camera revisits a known place: a loop closure matches the current view to an old keyframe and triggers a global optimization to redistribute the error around the whole trajectory.
In 3DGS-SLAM the same idea applies, but the optimization variables include not just the poses but the entire Gaussian map — a much bigger problem. Most current systems take a two-stage approach:
- Detect — a place-recognition module (NetVLAD or a tiny CNN over the rendered image) hashes each keyframe and looks for similar ones.
- Pose graph — first optimize only the camera trajectory under a constraint that the matched frames must align.
- Map rectification — apply the pose corrections to the Gaussians' positions (rigid transform per local sub-cloud), then re-fine-tune.
- Resume tracking — the latest pose is now anchored to the corrected trajectory.
每一步跟踪都有微小误差。走个几百帧,这些误差会累积——估计轨迹偏离真实轨迹。经典 SLAM 在相机回到一个熟悉地点的时候抓住这个机会:触发闭环,把当前视图匹配到一个旧的关键帧,再做一次全局优化,把积攒的误差平摊回整条轨迹。
3DGS-SLAM 沿用同一思路,但优化变量除了位姿,还包括整张高斯地图——问题大得多。目前大多数系统采用两阶段方式:
- 检测——一个地点识别模块(NetVLAD 或一个跑在渲染图上的小 CNN)给每个关键帧打哈希,找出相似的。
- 位姿图——在"匹配帧必须对齐"的约束下,先只优化相机轨迹。
- 地图校正——把位姿修正应用到高斯位置(每个局部子云做一次刚性变换),再做一次微调。
- 恢复跟踪——最新位姿现在锚定在修正后的轨迹上。
Interactive · drift and snap 交互 · 漂移与归位
Drag the slider to advance a loop trajectory. The dashed line is the ground truth; the solid line is the SLAM estimate, drifting more as you go. When you cross the loop closure threshold (the camera returns close to start), click "close" — the trajectory snaps back and the Gaussians move with it. 拖滑块让相机沿着一个回路走。虚线是真值,实线是 SLAM 估计——越走越偏。当走完一圈、相机靠近起点时点 "close"——轨迹一下子归位,高斯也跟着一起被拉回去。
§ 6 · The systems§ 6 · 系统盘点
The 2024–2025 lineup, by sensor 2024–2025 阵营,按传感器分
Click any system for the one-paragraph architecture summary.
点任意系统看一段话架构总结。
§ 7 · Where this stands§ 7 · 当前所在的位置
State of the art, mid-2025 2025 年中的 SOTA
Both numbers are within an order of magnitude of offline 3DGS reconstruction with ground-truth poses — i.e., the joint tracking+mapping problem isn't materially harder than the pure mapping problem, once you have a good initialization and a decent sensor. That wasn't true of NeRF-based SLAM systems (iMAP, NICE-SLAM); the Gaussian representation closed the gap.
这两个数字都在有真值位姿的离线 3DGS 重建一个数量级以内——也就是说,只要初始化合理、传感器尚可,"联合跟踪+建图"并不比"纯建图"困难多少。这件事在 NeRF 时代的 SLAM(iMAP、NICE-SLAM)上是不成立的;是高斯这个表示把差距合上的。
§ 8 · Open§ 8 · 仍未解决
Where the systems still break 系统还会崩的地方
Dynamic scenes. Almost every system assumes the world is static. Moving people, cars, and pets currently get fit as Gaussians and then "smear" across views. Some 2025 work (4D-SLAM, Dyna-SLAM) treats moving objects as a separate sub-cloud — see also 3dgs-deformable.
Large scenes. Hundred-meter or city-scale capture still trips the map size limit. Hierarchical Gaussian SLAM (HGS-SLAM) tiles the map; see 3dgs-large-scale.
Online compression. The map grows linearly with capture time. Without occasional Gaussian pruning the system blows past available VRAM. Most production systems integrate a LightGaussian-style pruner — see 3dgs-compression.
动态场景。几乎所有系统都假设世界是静态的。移动的人、车、宠物目前会被拟合成高斯,然后在不同视角间"涂抹"开。2025 年的若干工作(4D-SLAM、Dyna-SLAM)把运动物体当作独立子云处理——参见 3dgs-deformable。
大场景。百米级或城市级采集仍然撞地图体积上限。分层高斯 SLAM(HGS-SLAM)切分地图——参见 3dgs-large-scale。
在线压缩。地图随采集时间线性增长。如果不定期剪枝,系统会超出可用显存。大多数生产级系统集成一个 LightGaussian 风格的剪枝器——参见 3dgs-compression。