3D Gaussian Splatting SLAM
From the August 2023 paper that started everything to the foundation-model-driven frontier of May 2026. Written for someone who knows basic NeRF, SDF, ML, and linear algebra — but doesn't know SLAM yet.
从 2023 年 8 月那篇引爆一切的 paper 起步,一路梳理到 2026 年 5 月被 foundation model 主导的最前沿。写给已经懂一点 NeRF、SDF、机器学习和线代,但 SLAM 还没入门的你。
1 · The big picture 1 · 全景 #
What is SLAM? SLAM 到底是什么
Picture yourself dropped into a hotel room with the lights off. You can't see anything, but you can reach out and feel the wall. As your fingertips trace its edge, two things happen at once: you build up a mental map of where the wall is, and you figure out where you are standing relative to it. Crucially, the two pieces of knowledge are inseparable — to know where the wall is in the room, you have to know where you are; to know where you are, you have to know where the walls are.
That chicken-and-egg dance is called SLAM: Simultaneous Localization and Mapping. Algorithmically, you have a robot or AR headset streaming images (or depth, or LiDAR), and you want to recover two things together — the camera trajectory and a map of the world. SLAM is what makes a Roomba not bump into the same chair twice, what makes a Quest 3 keep your virtual screen pinned to your kitchen counter, and what every self-driving car runs in its head every millisecond.
想象你被蒙着眼睛丢进一间陌生酒店房间。一片漆黑,但你伸手能摸到墙。指尖沿着墙边摸的时候,脑子里其实同时在做两件事:你在拼凑墙的位置(一张地图),也在确认自己站在哪。关键是这两件事根本拆不开——要知道墙在房间里的位置,你得先知道自己在哪;要知道自己在哪,你又得先知道墙在哪。
这种"鸡生蛋蛋生鸡"的纠缠,算法上的名字就叫 SLAM——Simultaneous Localization and Mapping(同时定位与建图)。具体来说:一个机器人或 AR 头显持续给你输入图像(也可能是 depth 或 LiDAR),你要同时还原两样东西——相机轨迹 和 世界的地图。扫地机器人不撞两次同一把椅子,Quest 3 把虚拟屏幕牢牢钉在你厨房台面上,自动驾驶汽车每毫秒在脑子里跑的——全是 SLAM。
Why 3DGS changed everything 为什么 3DGS 改变了一切
For decades the "map" in SLAM was a list of points, or a TSDF voxel grid, or — starting in 2021 — a neural radiance field. NeRF maps looked gorgeous and could be queried for novel views, but they were brutally slow: every pixel needed dozens of MLP evaluations along a ray. Mapping ran at fractions of a hertz. SLAM, which needs to keep up with a 30 Hz camera, was effectively locked out of that beautiful representation.
Then in August 2023, Kerbl et al.'s 3D Gaussian Splatting showed up. Same alpha-compositing rendering equation as NeRF, same end-to-end differentiability — but instead of marching down a ray you rasterize a cloud of explicit blobs. Rendering jumped from seconds per frame to milliseconds. Within twelve weeks, five separate SLAM groups had a working 3DGS-SLAM prototype on arXiv. The dam had broken.
This survey walks through that lineage paper by paper, sub-direction by sub-direction. Along the way you'll play with three interactive demos that build the intuition for what 3D Gaussians do, why they make SLAM tractable, and what classical SLAM concepts (drift, loop closure, bundle adjustment) actually mean in pictures.
过去几十年,SLAM 里的"地图"要么是一堆稀疏的 3D 点,要么是 TSDF 体素,2021 年之后开始有人用 neural radiance field(NeRF)。NeRF 的地图渲染出来非常漂亮,还能合成新视角,但慢得离谱——每个像素都得在一条射线上 query 几十次 MLP。建图速度只有 零点几赫兹。SLAM 要跟上 30 Hz 的相机,根本没法用这种表示。
然后 2023 年 8 月,Kerbl 等人的 3D Gaussian Splatting 出现了。alpha 合成的渲染方程跟 NeRF 一模一样,端到端可微也跟 NeRF 一样——但渲染从沿射线积分变成了对一堆显式 blob 做 光栅化。每帧渲染从秒级跳到毫秒级。十二周之内,五个独立的 SLAM 团队就把 3DGS-SLAM 的原型挂上了 arXiv。闸门彻底冲开了。
这份综述会一篇一篇带你走过这条主线,按子方向分类讲。中间夹着三个可交互的 demo,让你亲手感受 3D Gaussian 到底在做什么、为什么它让 SLAM 变得可行,以及 SLAM 那些经典概念(drift、loop closure、bundle adjustment)用图像表达出来是什么样。
2 · Background you need 2 · 必备背景 #
Skim or skip this chapter as needed. Each subsection assumes you already know what's in the previous one.
这一章按需扫读或跳过。每一小节都默认你已经掌握了上一节的内容。
2.1 · A SLAM primer (for someone who has never read a SLAM paper) 2.1 · SLAM 速通(写给从没读过 SLAM 论文的你)
Tracking vs. Mapping (the PTAM split)Tracking 与 Mapping:PTAM 的那一刀
Klein and Murray's PTAM (ISMAR 2007) made the architectural decision that everybody — including every 3DGS-SLAM paper — has copied since: split the system into two threads. A tracking thread runs at full frame rate and answers a narrow question: "given the current map, where am I right now?" A mapping thread runs slower, in the background, and answers: "given the last hundred frames, what does the world actually look like, and where were the keyframes really?" Tracking is local and cheap; mapping is global and expensive. They share the same map but operate on it asynchronously.
Klein 和 Murray 的 PTAM(ISMAR 2007)做了一个架构决定,后来所有人——包括每一篇 3DGS-SLAM——都照搬:把系统拆成两个线程。tracking 线程满帧率运行,只回答一个窄问题——"地图固定,我现在在哪?" mapping 线程在后台慢慢跑,回答的是另一个问题——"过去这一百帧合起来看,世界到底长什么样,关键帧的位姿到底是多少?" tracking 是局部的、便宜的;mapping 是全局的、贵的。两者共享同一张地图,但异步操作。
Drift and loop closureDrift 与 loop closure
Because every frame's pose is estimated relative to the previous one, tiny errors accumulate. After walking down a corridor for thirty seconds, your estimated trajectory is bent a few degrees away from where you actually went. This bending is called drift, and it's the central pathology of pure SLAM. The cure is loop closure: when the system recognizes a place it has visited before, it adds a single constraint — "the pose I have now must equal the pose I had on frame 137" — and uses that to re-rigidify the whole trajectory. One good loop closure can erase ten seconds of accumulated drift.
因为每一帧的 pose 都是 相对于 前一帧估计的,每一步的小误差会一直累积。沿走廊走 30 秒,你估出来的轨迹就已经歪了好几度。这种歪叫 drift,是纯 SLAM 的核心顽疾。治法是 loop closure(回环闭合):当系统认出"诶这地方我之前来过",它就加一条约束——"我现在的 pose 必须等于第 137 帧时的 pose"——然后用这条约束把整条轨迹重新拽回正轨。一次好的 loop closure 能抹掉之前十秒钟累计的 drift。
The classical lineage in one paragraph经典 SLAM 谱系:一段话讲清
Three pre-NeRF families dominate the textbooks. Feature-based methods (the ORB-SLAM line) extract sparse keypoints and minimize reprojection error — fast, robust, sparse maps. Direct methods (LSD-SLAM, DSO) skip features and minimize photometric error directly on pixels — denser, more sensitive to lighting. Dense methods (KinectFusion, ElasticFusion) need a depth sensor and fuse measurements into a volumetric surface — beautiful geometry, memory-heavy.
NeRF 之前的教科书里有三大家族。特征点法(ORB-SLAM 一脉)抽稀疏关键点,最小化 reprojection error——又快又鲁棒,但地图是稀疏的。直接法(LSD-SLAM、DSO)跳过特征点,直接在像素层面最小化光度误差——地图更稠密,但对光照变化敏感。稠密法(KinectFusion、ElasticFusion)依赖 depth sensor,把测量融合成一个体素曲面——几何漂亮,但很占显存。
The map representation question — and why it's everything地图表示:决定一切的那道选择题
What you choose to store determines what your SLAM can do:
你选择 存什么,决定了你的 SLAM 能做什么:
- Sparse points — tiny memory, great for re-localization, useless for rendering.
- TSDF voxels — beautiful surfaces, memory blows up with volume.
- Neural fields (NeRF) — compact, photorealistic, agonizingly slow to query.
- 3D Gaussians — explicit (like points), differentiable (like neural fields), rasterizable (like meshes). The sweet spot.
- 稀疏点 — 内存极小,re-localization 好用,渲染基本没戏。
- TSDF 体素 — 曲面漂亮,内存随体积爆炸。
- 神经场(NeRF) — 紧凑、照片级真实,但 query 慢到让人想哭。
- 3D Gaussians — 像点一样显式,像 neural field 一样可微,像 mesh 一样可光栅化。三全其美的甜区。
Bundle Adjustment in one paragraphBundle Adjustment:一段话理解
The mapping thread's workhorse is bundle adjustment: stack every observation of every 3D entity into one giant nonlinear least-squares problem, then minimize jointly over both poses and the map.
mapping 线程的主力工具叫 bundle adjustment(BA,束调整):把每一次"某帧观测到某个 3D 元素"的事件全部摞进一个超大的非线性 least-squares 问题里,同时对 pose 和 map 求最小。
$$ \min_{\{T_i\}, \{X_j\}} \;\sum_{(i,j) \in \mathcal{O}} \rho\bigl(\| \pi(T_i, X_j) - u_{ij} \|^2\bigr) $$Here $T_i$ is the $i$-th camera pose, $X_j$ is the $j$-th map element, $\pi$ projects the map element through the camera into the image, $u_{ij}$ is the observed pixel, and $\rho$ is a robust loss. Levenberg-Marquardt solves it; the sparse block structure (every observation only touches one camera and one map element) lets you do the inversion via the Schur complement in a tractable amount of time. In 3DGS-SLAM, the analog is jointly optimizing camera poses and Gaussian parameters by backpropagating the photometric loss — same math, very different map.
$T_i$ 是第 $i$ 个相机 pose,$X_j$ 是第 $j$ 个地图元素,$\pi$ 把地图元素经相机投到图像,$u_{ij}$ 是观测到的像素,$\rho$ 是 robust loss。求解器用 Levenberg-Marquardt;每次观测只跟一个 camera 和一个 map element 相关,矩阵稀疏成块状结构,于是可以用 Schur complement 把巨大的 Hessian 求逆在可接受时间内做完。在 3DGS-SLAM 里,对应的东西是——通过反传 photometric loss 同时优化相机 pose 和 Gaussian 参数。数学是一回事,但 map 的表示完全不同。
Datasets & metrics you'll see in every paper每篇 paper 都会出现的数据集与指标
| Dataset数据集 | What it is是什么 | Why it matters为什么重要 |
|---|---|---|
| Replica | Photorealistic synthetic rooms with perfect ground truth.照片级真实的合成室内场景,ground truth 完美。 | The default neural-SLAM benchmark; PSNR ceilings are achievable.neural SLAM 的默认 benchmark,PSNR 能跑到天花板。 |
| TUM-RGBD | Real handheld RGB-D sequences from TU Munich.慕尼黑工大采集的真实手持 RGB-D 序列。 | The classical small-scale SLAM benchmark; many trajectories.经典小场景 SLAM benchmark,轨迹多。 |
| ScanNet / ScanNet++ | 1500+ real indoor scans with reconstructed meshes.1500+ 个真实室内扫描,带重建的 mesh。 | "Does this generalize to real rooms?" sanity check."在真实房间里到底 work 不 work" 的体检表。 |
| EuRoC MAV | Stereo + IMU drone flights.无人机的双目 + IMU 数据。 | The visual-inertial standard, used by multi-modal works.VIO 标准盘,多模态工作必跑。 |
| KITTI | Outdoor driving with LiDAR + stereo.户外驾驶数据,LiDAR + 双目。 | Car-scale benchmark; rarely used by indoor GS-SLAM.汽车尺度 benchmark,室内 GS-SLAM 一般不碰。 |
Standard metrics: ATE-RMSE (root-mean-squared distance between estimated and true trajectory after rigid alignment) for localization, PSNR / SSIM / LPIPS for rendering, Depth-L1 for geometry.
标准指标:定位看 ATE-RMSE(估计轨迹与真实轨迹刚性对齐后的均方根距离),渲染看 PSNR / SSIM / LPIPS,几何看 Depth-L1。
2.2 · 3DGS recap (you already know NeRF, vaguely) 2.2 · 3DGS 复习(默认你大致知道 NeRF)
From NeRF to 3DGS — one mental flip从 NeRF 到 3DGS:一个心智翻转
NeRF says: the scene is a function $f_\theta: (x, y, z) \mapsto (\sigma, c)$. To render a pixel, march samples along its ray and query $f_\theta$ at each one. Hundreds of network evaluations per pixel.
3DGS says: the scene is a list of stuff. A few million primitives, each one an explicit anisotropic blob. To render a pixel, find which blobs cover it and sum their contributions.
NeRF 的说法是:场景是一个函数——$f_\theta: (x, y, z) \mapsto (\sigma, c)$。要渲染一个像素,就沿它的射线撒一堆采样点,每个点 query 一次 $f_\theta$。每像素几百次网络评估。
3DGS 的说法是:场景是一堆显式的东西——几百万个 primitive,每个是一个各向异性的小 blob。要渲染一个像素,就找出覆盖到它的那些 blob,把它们的贡献叠起来。
The rendering equation is literally identical:
渲染方程完全相同:
$$ C = \sum_{i=1}^{N} T_i \, \alpha_i \, c_i, \qquad T_i = \prod_{j=1}^{i-1} (1 - \alpha_j). $$What changes is the source of the samples: NeRF samples along a ray; 3DGS sorts over the primitives touching this pixel's tile. Same alpha compositing, completely different machine.
区别只在 sample 是哪儿来的:NeRF 是 沿射线 采样;3DGS 是排序 覆盖到当前 tile 的那些 primitive。alpha 合成一模一样,但底层机器完全是两回事。
The anatomy of one 3D Gaussian一个 3D Gaussian 的解剖学
Each primitive in the cloud carries four parameter blocks:
每个 primitive 携带四组参数:
- Position $\mu \in \mathbb{R}^3$ — where the blob's center is in the world.
- Anisotropic covariance $\Sigma = R\,S\,S^\top\,R^\top$ — a rotation $R$ (stored as a unit quaternion $q$) plus a scale vector $s \in \mathbb{R}^3$. This factoring is not an accident: it forces $\Sigma$ to stay symmetric positive-definite under gradient updates.
- View-dependent color — spherical harmonics coefficients. Degree 0 is plain RGB; higher degrees let the color shift with view direction (specularities, anisotropic reflections).
- Opacity $\alpha \in [0, 1]$ — how much light the blob blocks.
- 位置 $\mu \in \mathbb{R}^3$ —— blob 在世界坐标里的中心。
- 各向异性协方差 $\Sigma = R\,S\,S^\top\,R^\top$ —— 一个旋转 $R$(用单位 quaternion $q$ 存)加一个尺度向量 $s \in \mathbb{R}^3$。这种因式分解不是顺手写的——它保证 $\Sigma$ 在梯度更新下始终是对称正定的。
- 视相关颜色 —— spherical harmonics(球谐)系数。0 阶就是普通 RGB,更高阶能让颜色随观察方向变化(高光、各向异性反射等)。
- 不透明度 $\alpha \in [0, 1]$ —— blob 挡光的程度。
How a 3D blob projects to a 2D blob (EWA splatting)3D blob 投到 2D blob(EWA splatting)
A 3D ellipsoid seen through a pinhole camera projects, to first order, to a 2D ellipse on the image. The projected covariance is
一个 3D 椭球,从针孔相机看,一阶近似下投影成图像上的一个 2D 椭圆。投影后的协方差是:
$$ \Sigma' = J\,W\,\Sigma\,W^\top\,J^\top $$where $W$ is the world-to-camera transform and $J$ is the Jacobian of the perspective projection evaluated at $\mu$. This is the classical EWA splatting approximation (Zwicker et al., 2001). Geometrically: tilt and squash the 3D ellipsoid through the camera, drop the depth axis, and you have an oriented ellipse. That ellipse is what the rasterizer draws.
$W$ 是世界到相机的变换,$J$ 是透视投影在 $\mu$ 处的 Jacobian。这就是经典的 EWA splatting(Zwicker 等人,2001)。几何上的直觉:把那个 3D 椭球经相机倾斜、挤压,扔掉深度轴,剩下的就是一个有朝向的 2D 椭圆。光栅化器画的就是这个椭圆。
Tile-based rasterization in one paragraph基于 tile 的光栅化:一段话讲清
The image is divided into $16 \times 16$ tiles. Each Gaussian is assigned to every tile its projected ellipse overlaps. Per tile, Gaussians are sorted once by depth (a single radix sort on the GPU), then the per-pixel alpha-blend just walks the sorted list. Because tiles are independent the whole thing fans out onto a CUDA grid. This — not anything about the math — is why 3DGS renders at hundreds of FPS where NeRF crawls.
图像被切成 $16 \times 16$ 的 tile。每个 Gaussian 投影后的椭圆覆盖到哪些 tile,就被指派到哪些 tile。每个 tile 内部,所有 Gaussian 按深度排一次序(GPU 上一次 radix sort 搞定),然后每个像素就是顺着排好序的列表做 alpha 混合。tile 之间互相独立,整件事直接铺到 CUDA grid 上。3DGS 能跑几百 FPS 而 NeRF 还在爬,靠的就是这个工程,不是数学。
The property that makes SLAM possible让 SLAM 成为可能的那个性质
Every operation above — the projection, the Jacobian, the covariance push-forward, the alpha blend — is smooth and differentiable. The photometric loss between rendered and observed pixels therefore backpropagates straight through to both the Gaussian parameters $(\mu, q, s, c, \alpha)$ and the camera pose $T$. One unified loss optimizes the map and the trajectory. That single property is what turned 3DGS into a SLAM substrate.
上面每一步——投影、Jacobian、协方差变换、alpha 混合——都是 光滑可微 的。所以渲染像素和观测像素之间的 photometric loss 可以一路反传,同时 更新 Gaussian 参数 $(\mu, q, s, c, \alpha)$ 以及 相机 pose $T$。一个统一的 loss,同时优化地图和轨迹。就这一个性质,把 3DGS 变成了 SLAM 的基底。
2.3 · NeRF-SLAM precursors 2.3 · NeRF-SLAM 前驱们
The idea "use a neural scene representation as your SLAM map" did not start with 3DGS. It started two years earlier — and it ran into a wall that 3DGS later broke through.
"用神经场景表示作为 SLAM 地图"这个想法不是 3DGS 才有的。它早两年就出现了,但撞上了一堵墙——后来由 3DGS 凿穿。
- iMAP (Sucar et al., ICCV 2021) — first proof that one MLP could be the map of a real-time RGB-D SLAM. Worked on tiny scenes; one MLP cannot store an apartment.
- NICE-SLAM (Zhu et al., CVPR 2022) — replaced the single MLP with a hierarchy of voxel feature grids. Scaled to rooms; became the dominant baseline of the NeRF-SLAM era.
- Vox-Fusion (Yang et al., ISMAR 2022) — voxel features in an octree, so memory grew with the observed surface, not the bounding box.
- ESLAM (Johari et al., CVPR 2023) — tri-plane features + TSDF decoder; strong accuracy at low memory.
- Co-SLAM (Wang et al., CVPR 2023) — hash-grid + coordinate encoding; first to comfortably exceed 10 Hz mapping.
- Point-SLAM (Sandström et al., ICCV 2023) — neural features anchored to a growing point cloud, foreshadowing the explicit-primitives idea 3DGS-SLAM would soon push to its limit.
- iMAP(Sucar 等人,ICCV 2021)—— 第一篇证明 一个 MLP 就能当实时 RGB-D SLAM 的地图。只在小场景里 work,因为单个 MLP 装不下一整套公寓。
- NICE-SLAM(Zhu 等人,CVPR 2022)—— 把单一 MLP 换成 分层的体素特征网格。能 scale 到房间尺度,成了 NeRF-SLAM 时代的统治级 baseline。
- Vox-Fusion(Yang 等人,ISMAR 2022)—— 体素特征装进 octree,内存只随观察到的表面增长,不再随 bounding box 爆炸。
- ESLAM(Johari 等人,CVPR 2023)—— tri-plane 特征 + TSDF decoder;又准又省内存。
- Co-SLAM(Wang 等人,CVPR 2023)—— hash-grid + 坐标编码;第一个把 mapping 稳稳推过 10 Hz 的。
- Point-SLAM(Sandström 等人,ICCV 2023)—— 把神经特征挂在一个 动态增长的点云 上,这种"显式 primitive"的思路,紧接着就被 3DGS-SLAM 推到了极致。
3 · Play with splats 3 · 上手玩 splats #
Three interactive demos built directly in your browser. No videos, no Colab — every pixel below is computed live by JavaScript using exactly the math we just covered, dropped from 3D to 2D for visual clarity. Drag, scrub the sliders, hit play.
三个直接跑在你浏览器里的可交互 demo。没有视频、没有 Colab——下面每一个像素都是 JavaScript 实时算出来的,用的就是上面讲的那套数学,为了直观从 3D 降到了 2D。随便拖,随便拉 slider,随便按 play。
Demo 1 · The splat rasterizerDemo 1 · Splat 光栅化
Eight anisotropic 2D Gaussians compositing front-to-back over a black background. Drag any Gaussian center to move it; scrub the sliders to scale all the ellipses or fade their opacity globally. Toggle the ellipse overlay to see the $2\sigma$ outlines.
八个各向异性的 2D Gaussian,按"先近后远"在黑色背景上做 alpha 合成。拖动 Gaussian 中心 可以挪它的位置;拉 slider 可以整体缩放或调透明度。开关椭圆图层能看到 $2\sigma$ 边界。
What you're seeing: for each pixel we compute $G_i(p) = \exp\!\bigl(-\tfrac{1}{2}(p-\mu_i)^\top \Sigma_i^{-1} (p-\mu_i)\bigr)$ for each Gaussian, then accumulate $C \leftarrow C + T \cdot \alpha_i G_i(p) \cdot c_i,\;\; T \leftarrow T \cdot (1 - \alpha_i G_i(p))$. Identical to the 3DGS rendering equation; we just collapsed the projection step because we're already in image space.
你看到的是:对每个像素,先对每个 Gaussian 算 $G_i(p) = \exp\!\bigl(-\tfrac{1}{2}(p-\mu_i)^\top \Sigma_i^{-1} (p-\mu_i)\bigr)$,再累加 $C \leftarrow C + T \cdot \alpha_i G_i(p) \cdot c_i$、$T \leftarrow T \cdot (1 - \alpha_i G_i(p))$。和真正的 3DGS 渲染方程一模一样,只是省掉了投影那一步——因为我们已经直接在图像空间了。
Demo 2 · Fitting Gaussians to an imageDemo 2 · 把 Gaussians 拟合到一张图像
On the left, a target image. On the right, 36 randomly-initialised 2D Gaussians. Hit play and watch them migrate, recolor, and resize themselves to reproduce the target. This is the same optimization loop that every 3DGS-SLAM paper runs every keyframe — just compressed to a single image instead of a multi-view photometric loss.
左边是目标图像,右边是 36 个随机初始化的 2D Gaussian。按 play,看它们一边移动、一边改颜色、一边改大小,最终拼出目标。每篇 3DGS-SLAM 在每个 keyframe 上跑的优化循环本质就是这个——只不过把多视角 photometric loss 压缩到了单张图像上。
Note: real 3DGS uses Adam on analytic gradients. To keep the demo dependency-free and intuitive we use an EM-style update — push each Gaussian's center toward the residual centroid in its support region, and nudge color toward the residual mean. The dynamics are conceptually the same: large local residual → larger update.
补一句:真的 3DGS 用 Adam + 解析梯度。为了让 demo 没有依赖也直观,这里用的是 EM 风格的更新——把每个 Gaussian 的中心朝它支撑域内的残差质心推,颜色朝残差均值挪。动力学本质一样:局部残差越大 → 这一步走得越大。
Demo 3 · Drift and the loop-closure miracleDemo 3 · Drift 和 loop closure 的奇迹
The blue path is the robot's true trajectory — a figure-eight. The orange path is what the robot thinks it walked, after we added a tiny per-step heading noise. Crank the noise slider up to see the orange path drift wildly off course. Now toggle loop closure on and re-run: when the robot returns to its start, a single constraint snaps the whole trajectory back into shape.
蓝色是机器人的真实轨迹——一个 figure-8。橙色是机器人 自以为 走的轨迹,每一步的航向都被加了一点小噪声。把噪声 slider 拉大,看橙色怎么疯狂偏离。然后打开 loop closure 再跑一次:当机器人回到起点时,一条约束就能把整条轨迹瞬间拽回来。
This is a toy: a real SLAM system never knows the true step length, the loop has to be detected (place recognition), and the correction is distributed via pose-graph optimization, not linear interpolation. But the qualitative dynamic — small per-step error compounding into huge global error, then a single global constraint correcting it — is exactly the problem and exactly the solution.
这是个玩具:真 SLAM 不知道真实步长,loop 得被 认出来(place recognition),修正也不是线性插值而是 pose-graph optimization 在背后分摊。但定性动力学——每步小误差累成全局大误差,然后一条全局约束把它拽回来——正是问题本身,也正是解决方案。
4 · Taxonomy & timeline 4 · 分类与时间线 #
A field that opened in twelve weeks 十二周炸出来的一个子领域
Below: an opinionated timeline of the most influential 3DGS-SLAM papers. Notice the density at the very beginning — five papers in three weeks, all reacting to the same August 2023 release.
下面是一份带主观色彩的 3DGS-SLAM 关键论文时间线。注意最开头那种密度——三周五篇,全是冲着 2023 年 8 月那一篇出来的。
Five axes every paper varies along 区分一篇论文的五个维度
You can locate any 3DGS-SLAM work in a five-dimensional space. Once you have these axes in your head, reading the field is much easier.
任何一篇 3DGS-SLAM 工作,你都能在一个五维空间里给它定位。脑子里有这五条轴,读文献就顺畅得多。
| Axis维度 | Common values常见取值 | What's at stake背后的取舍 |
|---|---|---|
| Input modality输入模态 | RGB-D · monocular · stereo · LiDAR+IMU+RGB · event · multi-camera | Easier inputs (RGB-D) → easier mapping; harder inputs (mono) → bigger algorithmic burden but bigger market. 输入越友好(RGB-D)→ 建图越容易;输入越苛刻(单目)→ 算法负担越重,但落地市场也越大。 |
| Tracking schooltracking 流派 | Gradient through Gaussians · Classical front-end (ORB / DROID) · ICP using Gaussian covariances · Feed-forward neural pose head 梯度穿过 Gaussian · 经典前端(ORB / DROID) · 用 Gaussian 协方差做 ICP · 前馈神经 pose head | Determines the speed/robustness/elegance trade-off. 决定速度、鲁棒性、优雅程度三者的取舍。 |
| Map structure地图结构 | One global cloud · Sub-maps · Anchors + neighbors · Hybrid 3DGS + SDF · 2D surfels 单一全局点云 · sub-map · anchor + 邻居 · 3DGS + SDF 混合 · 2D surfel | Drives scalability and loop closure. 直接决定能不能 scale,以及 loop closure 怎么做。 |
| Scene assumption场景假设 | Static · Dynamic objects · Fully dynamic · In-the-wild outdoor · Large-scale city 静态 · 含动态物体 · 全动态 · 野外户外 · 城市级 | Each generalisation removes a baked-in static-world assumption. 每一种泛化都对应着撤掉一条"世界是静止的"硬编码假设。 |
| Output channels输出通道 | Photometric only · + Geometry (depth/normals) · + Semantics · + Language features · + Motion forecasts 只要 photometric · + 几何(depth/normal)· + 语义 · + 语言特征 · + 运动预测 | What downstream tasks (rendering vs. navigation vs. manipulation) the map can serve. 决定这张地图能给下游什么任务用(渲染?导航?操作?)。 |
5 · The early era (Nov 2023 → mid 2024) 5 · 早期时代(2023.11 → 2024 年中) #
The five papers that opened the field all appeared within a three-week window. Their differences crystallized the three tracking philosophies that the rest of the field would inherit:
开局五篇 paper 全在三周内出现。它们之间的差异正好钉下了后面整个领域沿用的三种 tracking 哲学:
- "Unified representation" school — gradient-descent the camera pose against the Gaussians (GS-SLAM, SplaTAM, Gaussian-SLAM, MonoGS, CG-SLAM, RTG-SLAM).
- "Gaussians for rendering only" school — keep a battle-tested classical tracker; bolt Gaussians on for photo-realism (Photo-SLAM).
- "Gaussians are probabilistic points" school — reuse the covariances for classical scan-matching (GS-ICP-SLAM).
- "统一表示"派—— 直接拿 photometric loss 对相机 pose 做梯度下降,pose 和 Gaussian 走同一条反传链(GS-SLAM、SplaTAM、Gaussian-SLAM、MonoGS、CG-SLAM、RTG-SLAM)。
- "Gaussian 只用来渲染"派—— pose 留给久经考验的经典 tracker,Gaussian 仅仅作为照片级渲染层挂在外面(Photo-SLAM)。
- "Gaussian 就是带协方差的概率点"派—— 直接拿这些协方差去做经典 scan matching(GS-ICP-SLAM)。
5.1 · GS-SLAM RGB-D
The argument. NeRF-SLAM is slow because volume rendering needs hundreds of MLP queries per ray. 3DGS gives the same dense photo-realistic map with rasterization that runs at frame rate. Therefore: replace the neural field with an explicit Gaussian cloud and watch everything else fall into place.
The two tricks. First, coarse-to-fine pose optimization — first match against a sparse set of high-confidence Gaussians, then refine on a denser selection. This isn't just a speed hack; it stabilizes the optimization, because rough alignment from sparse points keeps you in the basin of attraction for the dense step. Second, an adaptive expansion strategy: explicitly add Gaussians in newly observed regions, delete Gaussians whose accumulated photometric error is suspect.
核心论证。 NeRF-SLAM 慢,是因为 volume rendering 每条射线要 query MLP 几百次。3DGS 给同样稠密、同样照片级真实的地图,但用光栅化跑出实时帧率。所以——把神经场替换成显式 Gaussian 点云,剩下的自然就通了。
两个 trick。第一,coarse-to-fine 的 pose 优化——先对一小撮高置信度的 Gaussian 做配准,再切到更密的 Gaussian 上 refine。这不只是加速 trick,更是稳定性 trick:稀疏匹配先把你拽进吸引域,密匹配再来精修。第二,自适应扩展策略:新观察到的区域显式新增 Gaussian,累积 photometric error 可疑的 Gaussian 直接删掉。
5.2 · Photo-SLAM Monocular RGB-D Stereo
The argument. Why fight classical SLAM? ORB-SLAM3 already handles pose estimation beautifully across monocular, stereo, and RGB-D inputs. Bolt a 3D Gaussian field on top — purely for photo-realistic rendering — and you get the best of both worlds for free.
The architecture. ORB-SLAM3 runs as the front-end and produces keyframes with confident poses and sparse map points. Those map points seed a 3D Gaussian field; the Gaussians are then optimized via standard 3DGS photometric loss, with a multi-resolution training schedule to learn coarse colors first and refine. Crucially, the Gaussian gradients do not flow back to perturb the ORB-SLAM3 poses — the tracker is treated as a black-box source of ground truth.
Why it punches above its weight. The whole thing is implemented in C++/CUDA with LibTorch. It runs in real time on a Jetson AGX Orin — the only one of this generation to do so. Reports ~30% PSNR improvement over the contemporaries on Replica, and rendering speed "hundreds of times faster" than NeRF-based competitors.
核心论证。 何必跟经典 SLAM 对着干?ORB-SLAM3 在单目、双目、RGB-D 三种输入下都把 pose 估计做得很漂亮了。在它上面焊一层 3D Gaussian field——纯做照片级渲染——两边的好处都免费拿到。
架构。 ORB-SLAM3 当前端,吐出带可信 pose 的 keyframe 和稀疏 map point。稀疏点 seed 出一个 3D Gaussian 场;Gaussian 用标准 3DGS photometric loss 训练,配多分辨率训练 schedule——先学粗糙颜色再 refine。关键一点:Gaussian 这边的梯度 不会 反过来扰动 ORB-SLAM3 的 pose——tracker 被当成黑盒里的 ground truth。
它为什么超水平发挥。整个系统是 C++/CUDA 写的,配 LibTorch。能在 Jetson AGX Orin 上实时跑——同期里只此一家。在 Replica 上 PSNR 比同代提了 ~30%,渲染速度比基于 NeRF 的对手"快几百倍"。
5.3 · SplaTAM RGB-D
The argument. Forget the neural field entirely. With an RGB-D camera you have everything you need to optimize an explicit Gaussian cloud directly: photometric loss for color, depth loss for geometry, and a differentiable rasterizer that backprops cleanly to pose. The whole system is just gradient descent on Gaussians and a transform.
The "what's new here" signal. Densification is the central technical contribution. When you arrive at a new frame, how do you know which pixels show geometry the map already explains versus geometry the map doesn't know about yet? SplaTAM renders a silhouette mask from the current Gaussians — for each pixel, the accumulated $1 - T$ tells you how much "stuff" already covers it. Pixels with low silhouette are unmapped; you spawn new Gaussians there. Simple, principled, and the move that lets the map grow without overlaying noise on already-modeled regions.
核心论证。 把神经场彻底丢了。有 RGB-D 相机的话,你直接优化一坨显式 Gaussian 所需的全套零件都齐了:颜色用 photometric loss、几何用 depth loss、相机 pose 反传通过一个干净可微的 rasterizer。整个系统就是在 Gaussians 和一个变换上做梯度下降。
这篇真正新的地方。致密化(densification)才是它的核心贡献。新一帧到了,你怎么知道哪些像素拍的是 地图已经解释过的几何、哪些是 地图还不知道的几何?SplaTAM 从当前 Gaussians 渲染一张 silhouette mask——每个像素累计的 $1 - T$ 告诉你它上面已经被"什么东西"覆盖了多少。silhouette 低的像素就是还没建图的,往那儿撒新 Gaussian。简单、有原则,让地图能干净地生长,不会在已建好的区域上叠垃圾。
# SplaTAM 致密化 pseudocode:
sil = render_silhouette(gaussians, T_cam) # 每像素 [0, 1]
unmapped = (sil < 0.5) & (depth_obs > 0) # 还没建图的像素 mask
new_xyz = backproject(depth_obs[unmapped], T_cam)
new_color = rgb_obs[unmapped]
gaussians.add(positions=new_xyz, colors=new_color, isotropic_init=True)
Numbers. Up to $\approx 2\times$ improvement in pose accuracy, mapping quality, and novel-view PSNR over NeRF-SLAM baselines (Nice-SLAM, Point-SLAM) on Replica, TUM-RGBD, ScanNet, ScanNet++. Becomes the universal baseline that every later paper compares against.
数字。 相比 NeRF-SLAM 系列(NICE-SLAM、Point-SLAM),在 Replica、TUM-RGBD、ScanNet、ScanNet++ 上 pose 精度、建图质量、新视角 PSNR 都接近翻倍。后来人人都拿它当 baseline。
5.4 · Gaussian-SLAM RGB-D
The problem they spotted. One global Gaussian cloud doesn't scale. The longer the camera flies around, the bigger the map, the slower every gradient step, the harder it is to hold everything in GPU memory. Their answer: sub-maps. Decompose the world into local Gaussian fields, optimize only the active one, swap them in and out as the trajectory progresses.
Why it matters in the long run. Sub-maps are also the natural unit of loop closure — you can later register one Gaussian sub-map against another to add a global constraint. This is the seed that LoopSplat will eventually grow into a full global-consistency story.
它发现的问题。一坨全局 Gaussian 不 scale。相机飞得越久地图越大,每一步梯度越慢,全装进显存就越吃力。它的回答:sub-map。把世界拆成一片片局部 Gaussian field,只优化当前活跃的那个,按轨迹推进做换入换出。
长远意义。sub-map 同时也是 loop closure 的天然单位——后面你可以把两个 Gaussian sub-map 直接做配准来加全局约束。LoopSplat 最终把这粒种子长成了一整套全局一致性方案。
5.5 · MonoGS(即 "Gaussian Splatting SLAM") Monocular RGB-D Stereo
The hard mode. Monocular SLAM is the brutal case: no depth sensor, scale is fundamentally ambiguous, photometric cues are everything. Everyone else in the November-December 2023 wave used RGB-D. This one cracked monocular.
The contribution that makes it work. They derive analytic Jacobians of the 3DGS rendering w.r.t. camera pose on the SE(3) manifold. That sentence sounds dry; here's the cash value: classical Gauss-Newton (the hammer of every SLAM textbook for the last 30 years) needs Jacobians, and once you have them on a Lie group you get clean second-order convergence for pose. So pose optimization isn't a hand-tuned Adam loop fighting curvature; it's the same well-conditioned trust-region solve that every BA system uses, with the differentiable splatter providing the residuals.
地狱难度模式。单目 SLAM 是最狠的设定:没有 depth sensor,尺度天然有歧义,光度信息就是全部。2023 年 11-12 月那波里其他人都用 RGB-D,唯独这一篇啃下了单目。
真正让它 work 的贡献。他们 推出了 3DGS 渲染对相机 pose 在 SE(3) 流形上的解析 Jacobian。听上去很干,但实际意义巨大:经典的 Gauss-Newton(过去 30 年 SLAM 教科书的核心工具)需要 Jacobian,一旦你在 Lie group 上把 Jacobian 搞出来,pose 优化就能拿到干净的二阶收敛。于是 pose 优化不再是手调 Adam 跟 curvature 死磕,而是回到每个 BA 系统都在用的良好条件的 trust-region 解法,可微的 splatter 提供残差。
where $\xi \in \mathfrak{se}(3)$ is a tangent-space pose increment, $\mathbf{r}$ is the photometric residual, and $\mathbf{J}$ is the Jacobian of the rendered color w.r.t. $\xi$. The $\exp$ retracts the update back onto the manifold.
$\xi \in \mathfrak{se}(3)$ 是切空间里的 pose 增量,$\mathbf{r}$ 是 photometric 残差,$\mathbf{J}$ 是渲染颜色对 $\xi$ 的 Jacobian。最后的 $\exp$ 把增量缩回流形上。
Plus an isotropic-shape regularizer. In monocular mode the depth ambiguity loves to manifest as long, needle-like Gaussians that "shoot off into nothing" along the optical axis. They add a regularizer that gently penalizes extreme anisotropy, killing the degenerate solutions.
另外加了各向同性的形状正则。单目模式下深度歧义最爱表现为沿光轴拉得像针一样的 Gaussian——"扎进虚空"。他们加了一个温和的正则项惩罚极端各向异性,把这种退化解扼杀掉。
Numbers. ~3 FPS live monocular SLAM, full appearance and geometry recovery on Replica, TUM-RGBD, ScanNet++. The "Best Demo" award at CVPR 2024 made this the field's poster child.
数字。单目实时约 3 FPS,在 Replica、TUM-RGBD、ScanNet++ 上把外观和几何都还原得很完整。CVPR 2024 拿了 "Best Demo",成了这个子领域的门面。
5.6 · CG-SLAM RGB-D
The gripe. Earlier 3DGS-SLAMs treat every Gaussian equally during pose optimization. But Gaussians near depth discontinuities or in noisy regions hurt tracking — their gradients pull the camera toward bad poses. We need a way to say "trust this one, distrust that one."
它的抱怨。早期 3DGS-SLAM 在 pose 优化时一视同仁地对待每个 Gaussian。但深度断层附近、噪声区域里的 Gaussian 是 有害的——它们的梯度会把相机往坏 pose 拉。需要一种机制说"信这个,别信那个"。
The fix. Each Gaussian carries an explicit uncertainty score derived from a depth-noise model. The pose loss is reweighted: trustworthy Gaussians vote loudly, uncertain ones whisper. This — together with an efficient tile-based renderer specialized for tracking — gets them to ~15 Hz tracking, several times faster than SplaTAM/MonoGS at matched accuracy.
解法。每个 Gaussian 带一个由 depth noise model 推导出的显式不确定性分数。pose loss 据此 reweight——靠谱的 Gaussian 投票声大,不靠谱的几乎不发声。配上一个为 tracking 专门优化的 tile-based renderer,tracking 跑到 ~15 Hz,相同精度下比 SplaTAM/MonoGS 快好几倍。
5.7 · GS-ICP-SLAM RGB-D ~107 FPS
The conceptual flip. A 3D Gaussian is literally a 3D point plus a covariance. That is exactly what Generalized ICP needs to do probabilistic scan matching — a beautiful pre-deep-learning method from 2009 that aligns point clouds by treating each point as a tiny ellipsoid. So instead of paying the cost of photometric gradient descent every frame, just reuse the Gaussian covariances as G-ICP inputs.
观念上的翻转。一个 3D Gaussian 本质上就是"一个 3D 点 + 一个协方差"。Generalized ICP 做概率 scan matching 要的,正是这个东西——这是 2009 年的一个漂亮的前深度学习时代算法,把每个点当成一个小椭球来做点云配准。所以与其每帧都付 photometric 梯度下降的代价,不如 直接复用 Gaussian 协方差 当成 G-ICP 的输入。
What that buys you. Tracking becomes a classical scan match — fast, well-understood, no learning rate to tune. They also exchange the covariances back and forth (with a careful scale alignment) so the same ellipsoids that fit appearance also serve scan matching. End-to-end up to ~107 FPS, easily the fastest of this generation.
这换来什么。tracking 退化成经典 scan match——快,原理清楚,不用调学习率。他们还让协方差在 mapping 和 tracking 之间双向交换(配上仔细的尺度对齐),所以同一组椭球既拟合外观也服务 scan match。端到端 最高约 107 FPS,是同期最快的。
5.8 · RTG-SLAM RGB-D Large-scale
The double partition. RTG-SLAM is the first 3DGS-SLAM that takes large scenes seriously, and it does so via two orthogonal classifications of every Gaussian:
双重划分。RTG-SLAM 是第一个认真对待 大场景 的 3DGS-SLAM,它对每个 Gaussian 做两套正交的分类:
- Opaque vs. transparent. Every Gaussian is forced to be either nearly opaque (fitting surface + dominant color) or nearly transparent (modeling residual color, view-dependent highlights, lacy structures). This decouples geometry from appearance and prevents the "many half-opaque Gaussians" mess.
- Stable vs. unstable. After a few frames most of the map stops moving — surface positions converge. Once a Gaussian is "stable," it's frozen and excluded from re-optimization; only "unstable" Gaussians (near the current frustum, or with recent residual) keep getting updates. Crucially, only the pixels covered by unstable Gaussians are re-rendered each step.
- 不透明 vs. 透明。每个 Gaussian 被强行约束成 要么 几乎完全不透明(拟合曲面 + 主色),要么 几乎完全透明(建模残差颜色、视相关高光、镂空细节)。这一刀把几何和外观解耦,避免"一堆半透明 Gaussian 乱叠"的混乱。
- 稳定 vs. 不稳定。跑几帧之后大半地图已经不动了——曲面位置收敛。一个 Gaussian 被打上"稳定"标签就直接冻结,不参与后续优化;只有"不稳定"的 Gaussian(在当前 frustum 附近,或最近有残差)才会继续更新。关键是,每步只对被"不稳定 Gaussian"覆盖的像素做重渲染。
Together, these two partitions cut compute and memory roughly in half, making real-time large-scale reconstruction (whole rooms, multiple connected spaces) feasible.
两套划分合在一起,计算和显存大约都砍掉一半,整间房、多间相连空间的实时大场景重建变得可行。
5.9 · LoopSplat RGB-D
The hole every previous paper had. All of GS-SLAM, SplaTAM, MonoGS, Gaussian-SLAM, CG-SLAM, RTG-SLAM, and GS-ICP-SLAM were essentially open-loop. Drift accumulated and nothing snapped it back. Classical SLAM has owned loop closure for a quarter century; the neural era had been ignoring it.
之前所有 paper 共同的漏洞。GS-SLAM、SplaTAM、MonoGS、Gaussian-SLAM、CG-SLAM、RTG-SLAM、GS-ICP-SLAM 本质上全是 开环 的。drift 越累越多,没有任何机制能把它拉回来。loop closure 是经典 SLAM 二十五年的领地,neural 时代一直在视而不见。
The idea, framed in one question. If your map is a cloud of Gaussians, can you register two clouds the way classical SLAM registers point clouds? Yes — and because Gaussians carry orientation (covariance) and color, you can do it more accurately than dumping to points first. LoopSplat builds on Gaussian-SLAM's sub-map structure: when the system detects a loop, it picks two sub-maps that should overlap, runs a direct splat-to-splat registration to estimate their relative transform, and feeds that constraint into a standard pose-graph optimizer. The whole trajectory then re-snaps into global consistency.
把想法压成一个问题。如果你的地图就是一坨 Gaussian,那你能不能像经典 SLAM 配准点云一样去配准两坨 Gaussian?能——而且因为 Gaussian 自带朝向(协方差)和颜色,比退化成点云再配准更准。LoopSplat 站在 Gaussian-SLAM 的 sub-map 结构上:系统检测到 loop,就挑两个应该重叠的 sub-map,直接做 splat-to-splat 配准估出相对变换,再把这条约束塞进标准的 pose-graph optimizer。整条轨迹一下子被拽回到全局一致。
6 · The middle era (late 2024 → mid 2025): specialisation explodes 6 · 中期时代(2024 末 → 2025 中):细分方向爆炸 #
Once the recipe was stable, papers started peeling off in five directions, each one removing a baked-in assumption of the early era:
基本配方稳定之后,paper 开始从五个方向往外拆,每个方向都对应着撕掉早期时代的一条隐含假设:
- Static world? No — handle moving people and cars.
- Indoor only? No — go outdoor, drone-scale, city-scale.
- One modality? No — fuse LiDAR + IMU + cameras.
- Geometry/photometry only? No — add semantic labels.
- Monocular is too noisy? Hybridise with a strong classical or learned frontend.
- 世界是静的?不,处理走动的人和开过的车。
- 只能室内?不,走到室外、无人机尺度、城市级。
- 一种模态?不,融合 LiDAR + IMU + 相机。
- 只有几何和光度?不,加上语义标签。
- 单目噪声太大?那就和一个强力的经典或学到的前端做混血。
6.1 · Monocular & in-the-wild 6.1 · 单目与野外场景
Monocular 3DGS-SLAM in the early era (MonoGS) was a tour de force, but its tracking was local and its maps couldn't absorb global corrections. Four mid-era papers tackle each pain point in turn — by importing a known monocular frontend (DROID-SLAM, DSO), a learned depth prior, or both.
早期的单目 3DGS-SLAM(MonoGS)是个壮举,但 tracking 是局部的,map 也吸收不了全局修正。中期四篇 paper 各自对症下药——要么引入已有的单目前端(DROID-SLAM、DSO),要么引入学到的 depth 先验,或者两者都用。
MonoGS only optimized poses locally. Splat-SLAM plugs in a DROID-SLAM-style dense-flow frontend with global bundle adjustment, then makes the Gaussian backend actively deformable: when global BA corrects the pose of an old keyframe, the Gaussians it created shift coherently with it instead of getting orphaned. A monocular depth prior fills in low-confidence pixels. Together: the strongest pure-RGB 3DGS-SLAM of the mid-era.
MonoGS 的 pose 优化只在局部。Splat-SLAM 接了一个 DROID-SLAM 风格的稠密 flow 前端 + 全局 BA,然后让 Gaussian 后端 可形变:全局 BA 修正了某个老 keyframe 的 pose 时,那个 keyframe 当年生成的 Gaussian 一起跟着平移,不会被孤立在原地。再加一个单目 depth 先验补低置信像素。综合起来——中期最强的纯 RGB 3DGS-SLAM。
Mono 3DGS-SLAM was traditionally visually OK but geometrically rough. HI-SLAM2 imports learned monocular depth + normal priors to initialise scene geometry, decouples mapping from tracking for speed, then re-couples them on loop closure via pose-graph BA plus explicit Gaussian deformation. Claims to surpass RGB-D baselines on reconstruction quality from a single camera — a watershed for the monocular line.
单目 3DGS-SLAM 历来 视觉上 还行但几何粗糙。HI-SLAM2 引入学到的 单目 depth + normal 先验 来初始化几何,mapping 和 tracking 解耦加速,loop closure 时再通过 pose-graph BA + 显式 Gaussian 形变重新耦合。报告说单目重建质量 超过 RGB-D baseline——单目这条线的分水岭。
Same idea as Photo-SLAM but for monocular: keep DROID-SLAM as a tested learned tracker (with loop detection and dense BA built in), put a 3DGS renderer on top. Works on monocular RGB and pseudo-RGB-D when a depth FM is available. Supports unknown intrinsics. Runs on consumer GPUs.
和 Photo-SLAM 的思路一样,但换成了单目:DROID-SLAM 当久经考验的学到 tracker(内置 loop detection 和稠密 BA),3DGS 当渲染层挂上去。单目 RGB 能跑,给一个 depth FM 还能跑伪 RGB-D。内参未知也能用。消费级显卡跑得动。
Pairs a classical photometric direct SLAM (DSO-style) with a parallel 3DGS mapper. The DSO point cloud is the seed — so Gaussians are born at reasonable locations instead of random noise. Adds opacity-aware densification (clone high-variance Gaussians, prune "floaters"). Result: cleaner monocular maps with less memory.
把经典直接法 SLAM(DSO 路线)和一个并行的 3DGS mapper 配在一起。DSO 的点云当 seed——Gaussian 一出生就站在合理位置,不是随机噪声。加上不透明度感知的致密化(克隆方差大的 Gaussian、剪掉"漂浮物")。结果:单目地图更干净,显存更省。
A small MLP on top of DINOv2 features predicts a per-pixel uncertainty map; pixels likely showing moving content (people, cars, animals) get downweighted in both the tracking loss and the mapping loss. On the new Wild-SLAM MoCap dataset, ATE-RMSE is 0.46 cm vs. MonoGS's 47.99 cm and Splat-SLAM's 8.71 cm — a $\approx 20\times$ win that effectively retires the static-scene monocular benchmark.
在 DINOv2 特征 上挂一个小 MLP,预测每像素的不确定性图;可能属于移动内容(人、车、动物)的像素,在 tracking loss 和 mapping loss 上都被 downweight。在新提出的 Wild-SLAM MoCap 数据集上 ATE-RMSE 做到 0.46 cm——MonoGS 是 47.99 cm,Splat-SLAM 是 8.71 cm,相当于 $\approx 20\times$ 提升,基本宣告"静态场景单目 benchmark"退休。
6.2 · Semantic GS-SLAM labels-per-Gaussian 6.2 · 语义 GS-SLAM 每个 Gaussian 自带标签
Pure geometry and color are useless to a robot that wants to ask "where is the chair?" The semantic trio of early 2024 attached object labels to each Gaussian and proved you could do it without blowing up memory.
对一个想问"椅子在哪"的机器人来说,纯几何 + 颜色没用。2024 年初的语义三连给每个 Gaussian 挂上了对象标签,证明这件事可以不爆显存地做出来。
Multi-channel optimization — appearance + geometry + semantics, all rendered through the same rasterizer — plus a semantic-guided keyframe selection that throws out keyframes whose label predictions are noisy. The crispness of Gaussian boundaries means object segmentation is much sharper than the previous semantic-NeRF-SLAM work.
多通道联合优化——外观 + 几何 + 语义全部通过同一个 rasterizer 渲出来——再加一个语义引导的关键帧选择,把标签预测不稳的关键帧扔掉。Gaussian 边界本来就锐利,所以物体分割比之前的 semantic-NeRF-SLAM 清晰得多。
Attaching a full CLIP-sized feature to every Gaussian doesn't scale. NEDS-SLAM trains a tiny encoder-decoder to compress the per-Gaussian semantic embedding, adds a Spatially Consistent Feature Fusion module to denoise the 2D segmentation backbone's outputs, and prunes outlier Gaussians from synthetic viewpoints. The blueprint for every subsequent semantic-GS work.
给每个 Gaussian 挂一个完整 CLIP 大小的特征不 scale。NEDS-SLAM 训一个轻量 encoder-decoder 把每 Gaussian 的语义 embedding 压扁,再加一个 Spatially Consistent Feature Fusion 模块给 2D 分割 backbone 的输出去噪,最后从合成视角里把离群 Gaussian 剪掉。后来所有 semantic-GS 工作的蓝本。
Goes a step further than SGS-SLAM by using the semantic channel as a tracking signal, not just an output. Multi-frame semantic correspondences enter BA as additional residuals — cutting drift over long sequences in a way photometric BA alone cannot.
比 SGS-SLAM 再进一步——语义通道不只是输出,还是 tracking 信号。多帧语义对应作为额外残差进入 BA,长序列下能压低 drift,纯 photometric BA 做不到这一点。
6.3 · Dynamic scenes — the static-world assumption finally cracks moving objects 6.3 · 动态场景——静态世界假设终于崩裂 moving objects
Until DG-SLAM, every 3DGS-SLAM baked a walking person into the map. Mid-era work attacks this in three flavours: mask-and-ignore, model-and-track, and uncertainty-downweight.
在 DG-SLAM 之前,每一个 3DGS-SLAM 都 把走过去的人焊进了地图。中期对这件事的攻法分三派:mask 掉、建模并跟踪、不确定性降权。
Combines spatio-temporal consistent depth masks with semantic priors to detect moving objects. Hybrid pose optimization (point-to-point + point-to-plane) keeps the camera locked to the static parts of the scene while the dynamic Gaussians are exiled to a separate stream. The reconstructed map shows only the static environment — which is usually what you want for navigation.
用时空一致的 depth mask 配语义先验检测移动物体。混合 pose 优化(point-to-point + point-to-plane)让相机锁在场景静态部分,动态 Gaussian 被流放到另一条流里。重建出来的地图只剩静态环境——大多数导航场景就是要这个。
6.4 · LiDAR / IMU / multi-modal fusion L+I+C 6.4 · LiDAR / IMU / 多模态融合 L+I+C
Indoor RGB-D 3DGS-SLAM falls apart outdoors: depth sensors don't see past 5 m, motion is fast, illumination changes. The fix is the classical robotics fix — add a LiDAR for geometry and an IMU for fast motion, and pay the engineering price of sensor calibration.
室内 RGB-D 的 3DGS-SLAM 到了室外就崩:depth sensor 看不远(超过 5 米就废),运动剧烈,光照变化。解药是机器人界的老解药——拿 LiDAR 解决几何、拿 IMU 解决快速运动,代价是 sensor 标定的工程成本。
RGB-D + IMU. A pre-integrated inertial term enters the joint loss alongside the photometric and depth terms, so the camera can be tracked through motion blur, texture-poor frames, and fast rotations — the failure modes that wreck pure-photometric 3DGS-SLAM. Reports $\approx 3\times$ tracking improvement over earlier 3DGS-SLAM SOTA and releases the UT-MM dataset.
RGB-D + IMU。预积分的惯性项和 photometric、depth 项一起进入联合 loss——所以相机能穿过运动模糊、弱纹理帧、快速旋转,这些恰好是纯 photometric 3DGS-SLAM 翻车的地方。报告 tracking 比早期 SOTA 提升 $\approx 3\times$,附带放出 UT-MM 数据集。
Tight LIC fusion handles pose tracking (with a continuous-time trajectory in v2); colorized LiDAR points seed the Gaussian map. Photo-realistic 3DGS mapping at robot-outdoor scale.
紧耦合的 LIC 做 pose tracking(v2 用连续时间轨迹建模),带颜色的 LiDAR 点 seed Gaussian 地图。把照片级 3DGS 建图推到 机器人户外尺度。
Two contributions: (a) a LiDAR-inertial frontend with size-adaptive voxels initialises both poses and surface Gaussians; (b) the Gaussians are 2D surface disks, not volumetric ellipsoids, which fits LiDAR returns much better. Supports both repetitive and solid-state (non-repetitive) LiDARs. Foreshadows the 2DGS-SLAM line (§7.3).
两个贡献:(a) LiDAR-惯性前端 + 尺寸自适应体素,同时初始化 pose 和表面 Gaussian;(b) Gaussian 是 2D 表面圆盘,不是体积椭球——更贴合 LiDAR 的实际返回。机械式和固态 LiDAR 都支持。是 2DGS-SLAM(§7.3)的前奏。
Outdoor LiDAR scans are sparse — naive Gaussian seeding leaves holes and floaters. GS-LIVM runs Gaussian Process Regression to fill in the inter-beam gaps before seeding, and uses a voxel-based map to bound memory at large scale. Close follow-up: GS-LIVO (HKUST-Aerial-Robotics).
室外 LiDAR 扫描很稀疏——朴素地 seed Gaussian 会出空洞和漂浮物。GS-LIVM 先跑一遍 Gaussian Process Regression 把激光线束之间的缝补起来,再 seed Gaussian;地图用体素结构限制大场景下的显存。近邻工作:GS-LIVO(港科大空中机器人组)。
6.5 · Large-scale, outdoor & non-pinhole city-scale 6.5 · 大场景、室外、非针孔 城市级
Whole-room reconstruction was solved by mid-2024. The next ceiling: city blocks, kilometers, multi-agent fleets, and 360°-camera rigs. Most of these works inherit VastGaussian's "progressive partitioning + per-cell parallel optimization + airspace-aware visibility" recipe and graft it onto a SLAM loop.
整间房的重建到 2024 年中已经解决了。下一道天花板:城市街区、公里级、多机协同、360° 相机阵列。这一组工作大多继承 VastGaussian 的"渐进分区 + 每块并行优化 + 空间可见性感知"配方,再嫁接到 SLAM 闭环上。
7 · The modern frontier (late 2025 → May 2026) 7 · 现代前沿(2025 末 → 2026.05) #
If the early era was about "can we even do it?" and the middle era was about "can we do it under harder conditions?", the modern era is about collapsing the optimization budget. The single biggest shift since late 2025: foundation models trained on internet-scale multi-view data (DUSt3R, MASt3R, VGGT) can produce per-pixel depth, camera intrinsics, and pose in one forward pass. That changes what 3DGS-SLAM has to do — increasingly, it does not have to find geometry from scratch, just refine and densify what the foundation model already proposed.
如果说早期时代在问"我们到底能不能做"、中期时代在问"换更难的条件还能不能做",那现代时代的关键词是 把优化预算压扁。2025 末以来最大的一次转向:在互联网级多视图数据上训练的 foundation model(DUSt3R、MASt3R、VGGT)能在 一次前向 里给出每像素 depth、相机内参和 pose。这件事改写了 3DGS-SLAM 要做的工作——越来越不需要 从零找出 几何,只要 精修和致密化 foundation model 已经提议出来的东西。
7.1 · Foundation-model priors meet 3DGS-SLAM 7.1 · 基础模型先验遇上 3DGS-SLAM
This is the hottest direction right now. Almost every CVPR/ICLR 2026 entry I found falls here.
这是现在最热的方向。CVPR/ICLR 2026 里我能查到的 GS-SLAM 工作几乎全在这一格。
7.2 · Feed-forward GS-SLAM (no per-scene optimization) 7.2 · 前馈式 GS-SLAM(不再做 per-scene 优化)
A closely related, partly overlapping cluster: papers that eliminate the per-scene Adam loop entirely. Pose and Gaussians come out of a single neural network forward pass.
和上一节高度重叠的一簇:彻底干掉 per-scene 的 Adam 循环。pose 和 Gaussian 全部从一次神经网络前向里掉出来。
7.3 · 2D Gaussian / surfel SLAM 7.3 · 2D 高斯 / surfel SLAM
Idea: 3D Gaussians are great for volume, but the world's surface is a 2-manifold. Replacing 3D ellipsoids with 2D Gaussian surfels — disk-shaped Gaussians embedded in tangent planes — gives sharper, more consistent surfaces at the cost of some appearance flexibility.
想法:3D Gaussian 擅长描述 体积,但世界的表面其实是 2 维流形。把 3D 椭球换成 2D Gaussian surfel——嵌在切平面里的圆盘形 Gaussian——曲面更锐利、更一致,代价是丢掉一点外观自由度。
7.4 · Event-camera & motion-blur 7.4 · 事件相机与运动模糊
Event cameras don't return frames — they return per-pixel "this pixel got brighter" events at microsecond resolution. They eat motion blur for breakfast, which is exactly the failure mode of every conventional GS-SLAM.
事件相机不返回帧——它每像素返回"这个像素刚才变亮了"的事件,微秒分辨率。它把运动模糊当早餐吃,而运动模糊恰好是常规 GS-SLAM 的死穴。
7.5 · Language-embedded GS-SLAM 7.5 · 语言嵌入的 GS-SLAM
Take the Gaussian map and stick a CLIP/SAM-derived embedding on each blob, distilled from a vision-language foundation model. Suddenly the map is open-vocabulary: you can ask "show me the blue mug" and the system can render exactly the Gaussians whose features cosine-match the query.
在 Gaussian 地图上,给每个 blob 挂一个从 vision-language foundation model 蒸馏出来的 CLIP/SAM 风格 embedding。地图突然变成 open-vocabulary 的:你可以问"给我看那个蓝色马克杯",系统就能渲染那些特征跟你查询余弦匹配的 Gaussian。
7.6 · Dynamic-scene 3DGS-SLAM 7.6 · 动态场景 3DGS-SLAM
Static-world assumption: finally dying. By 2026 the field had moved from "remove dynamic regions before mapping" to "track the dynamics and render them too."
静态世界假设——终于死了。到 2026 年,整个领域从"建图前先把动态区域剔掉"进化到了"动态部分也一起跟踪、一起渲染"。
7.7 · On-device & hardware specialization 7.7 · 端上推理与硬件专用化
The signal that a subfield has matured: ASIC papers start to appear. As of HPCA 2026, they have.
一个子领域成熟的信号——ASIC paper 开始出现。HPCA 2026 上,它们出现了。
8 · Open problems & where the field is going 8 · 开放问题与未来方向 #
Five things you can still build a Ph.D. around in 2026:
2026 年还能撑起一个博士论文的五个方向:
- End-to-end uncertainty. Most systems propagate uncertainty heuristically, if at all. A principled Bayesian or variational treatment of pose and map uncertainty — VBGS-SLAM is an early attempt — is still open.
- Long-horizon consistency at city scale. Sub-maps + loop closure handle rooms and apartments. Whole-city sequences with hundreds of thousands of frames are still rough.
- True deformable / human-centric SLAM. Dynamic-object handling exists, but most systems still bake in "rigid scene + rigid objects." Cloth, liquid, animals — open.
- Foundation-model-aware tracking. The current pattern is "FM gives a prior, optimizer cleans it up." A tighter coupling — where the FM is itself updated online, or where the SLAM error signal trains a per-deployment adapter — is the next obvious step.
- Embodied use. The killer application is not "render a pretty room"; it's "let a robot plan, navigate, and manipulate using this map." Splat-aware planning, splat-aware collision, splat-aware grasping — all still in their first generation.
- 端到端的不确定性。大多数系统的不确定性传播要么是启发式的,要么压根没有。pose 和 map 不确定性的原理性 Bayesian / 变分处理——VBGS-SLAM 是早期尝试——仍是开放问题。
- 城市级的长时一致性。sub-map + loop closure 已经能搞定房间和公寓。但几十万帧的整城序列还很粗糙。
- 真·可形变 / 以人为中心的 SLAM。动态物体的处理已经有了,但大部分系统仍然默认"刚体场景 + 刚体物体"。布料、液体、动物——开放。
- 对 foundation model 敏感的 tracking。当前套路是"FM 给先验,optimizer 清理"。更紧的耦合——FM 本身 在线更新,或者用 SLAM 的误差信号训练一个每部署一份的 adapter——是下一步明显的方向。
- 具身用途。杀手级应用不是"渲染一间漂亮房间",而是"让机器人用这张地图去规划、导航、操作"。Splat 感知的规划、splat 感知的碰撞、splat 感知的抓取——全都还在第一代。
Glossary 术语表 #
| ATE-RMSE | Absolute Trajectory Error, root-mean-square. Standard localization metric.绝对轨迹误差的均方根。定位的标准指标。 |
| BA (Bundle Adjustment) | Joint nonlinear least-squares over poses and map elements. The mapping workhorse.在 pose 和地图元素上联合做非线性 least-squares。mapping 的主力工具。 |
| EWA splatting | The "elliptical weighted average" approximation for projecting 3D Gaussians to 2D image-space ellipses."椭圆加权平均"近似,把 3D Gaussian 投影成图像空间的 2D 椭圆。 |
| ICP (Iterative Closest Point) | Classical algorithm aligning two point clouds by alternating closest-point assignment and rigid-transform fitting. G-ICP generalizes to ellipsoidal points.经典点云配准算法,交替做最近邻匹配和刚体变换拟合。G-ICP 把它推广到椭球点。 |
| Keyframe | A selected frame at which the mapping thread updates the map. Tracking runs at every frame; mapping runs at keyframes only.被挑出来用于更新地图的那些帧。tracking 每帧都跑,mapping 只在 keyframe 上跑。 |
| Loop closure | A constraint added when the system recognizes it has revisited a place. Erases drift over the spanned segment.系统认出"我又来过这里了"时加上的一条约束。能抹掉这段时间累积的 drift。 |
| PSNR / SSIM / LPIPS | Image-quality metrics. PSNR is a logarithmic MSE; SSIM measures structural similarity; LPIPS uses a learned perceptual distance.图像质量指标。PSNR 是 log 形式的 MSE;SSIM 衡量结构相似度;LPIPS 用学到的感知距离。 |
| SDF / TSDF | Signed Distance Field / Truncated SDF. A volumetric scalar field whose zero-level-set is the surface.有符号距离场 / 截断 SDF。体积标量场,零等值面就是表面。 |
| SE(3) | The Lie group of rigid 3D transformations. Camera poses live here. Tangent-space optimization avoids the parameterization headaches of Euler angles.3D 刚体变换的 Lie 群。相机 pose 就活在这里面。在切空间里优化能绕开欧拉角各种参数化坑。 |
| Sub-map | A locally consistent piece of the global map. Active sub-map gets optimized; others sit dormant until loop closure pulls them in.全局地图的一片局部一致的子集。活跃 sub-map 参与优化,其它沉睡,直到 loop closure 把它们拉进来。 |
| Surfel | "Surface element" — a small oriented disk. 2D Gaussian surfels are an anisotropic, Gaussian-weighted version."surface element"——一个有朝向的小圆盘。2D Gaussian surfel 是 Gaussian 加权的各向异性版本。 |
References & further reading 参考文献与延伸阅读
Live links for every paper mentioned above. Many of these have project pages with video; click through. 上文提到的每一篇 paper 都有可点的链接。很多还有 project page 配视频,建议点进去看。
Foundations基础
- Kerbl et al. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. SIGGRAPH 2023.
- Mildenhall et al. NeRF: Representing Scenes as Neural Radiance Fields. ECCV 2020.
- Zwicker et al. EWA Splatting. TVCG 2002.
- Mur-Artal et al. ORB-SLAM. T-RO 2015.
- Klein & Murray. PTAM. ISMAR 2007.
NeRF-SLAM precursorsNeRF-SLAM 前驱
- Sucar et al. iMAP. ICCV 2021.
- Zhu et al. NICE-SLAM. CVPR 2022.
- Johari et al. ESLAM. CVPR 2023.
- Wang et al. Co-SLAM. CVPR 2023.
- Sandström et al. Point-SLAM. ICCV 2023.
Early era 3DGS-SLAM早期 3DGS-SLAM
- Yan et al. GS-SLAM. CVPR 2024 Highlight.
- Huang et al. Photo-SLAM. CVPR 2024.
- Keetha et al. SplaTAM. CVPR 2024.
- Yugay et al. Gaussian-SLAM. arXiv 2023.
- Matsuki et al. MonoGS / Gaussian Splatting SLAM. CVPR 2024 Highlight.
- Hu et al. CG-SLAM. ECCV 2024.
- Ha et al. GS-ICP-SLAM. ECCV 2024.
- Peng et al. RTG-SLAM. SIGGRAPH 2024.
- Zhu et al. LoopSplat. 3DV 2025 Oral.
Mid era (mid 2024 – mid 2025)中期(2024 中 – 2025 中)
- Sandström et al. Splat-SLAM. NeurIPS 2024.
- Zhang et al. HI-SLAM2. 2024.
- Hoy et al. DROID-Splat. ICCVW 2025.
- MGSO. ICRA 2025.
- Zheng et al. WildGS-SLAM. CVPR 2025.
- Li et al. SGS-SLAM. ECCV 2024.
- NEDS-SLAM. IEEE RA-L 2024.
- SemGauss-SLAM. IROS 2025.
- DG-SLAM. NeurIPS 2024.
- MM3DGS-SLAM. IROS 2024.
- Gaussian-LIC. ICRA 2025.
- LIV-GaussMap. IEEE RA-L 2024.
- GS-LIVM. ICCV 2025.
- GigaSLAM. 2025.
- LSG-SLAM. 2025.
- VIGS-SLAM. 2025.
- GRAND-SLAM. 2025.
- OmniGS. WACV 2025.
- GaussNav. 2024.
Modern era (2025–2026)现代(2025–2026)
- Wang et al. Survey: Towards Next-Generation SLAM. Feb 2026.
- Nguyen Xuan et al. Survey on Collaborative SLAM with 3DGS. Oct 2025.
- Maggio & Carlone. VGGT-SLAM 2.0. Jan 2026.
- Ren et al. M$^3$: Dense Matching Meets Multi-View FMs for Mono GS-SLAM. Mar 2026.
- Zhang et al. Flash-Mono. ICLR 2026.
- Wang et al. DepthGS. IROS 2025.
- Zhong et al. 2DGS-SLAM. 2025.
- S$^3$LAM. arXiv 2507.20854.
- Chen et al. EGS-SLAM. IEEE RA-L 2025.
- Lee et al. LEGO-SLAM. Nov 2025.
- Zhang et al. DAGS-SLAM. Feb 2026.
- Li et al. 4D Gaussian Splatting SLAM. ICCV 2025.
- Huang et al. Splatonic. HPCA 2026.
Curated lists社区维护的清单
- Awesome-3DGS-SLAM (KwanWaiPang)
- awesome-NeRF-and-3DGS-SLAM
Built as a single-page bilingual pedagogical survey. Light theme, math by KaTeX, syntax by highlight.js, demos hand-rolled in vanilla JS. View source if you want to see how the splatting and the drift simulations work. 单页双语教学综述。浅色主题,公式用 KaTeX 渲染,代码高亮用 highlight.js,三个 demo 全是手写的 vanilla JS。想看 splatting 和 drift 仿真怎么实现的就直接看源码。