§ 0 · The setup§ 0 · 前提铺垫
2D diffusion is huge. 3D data is scarce. 2D 扩散巨大,3D 数据稀缺。
By late 2023, 2D image generation was a solved problem at consumer scale: Stable Diffusion, DALL·E 3, MidJourney. They had been trained on billions of image–text pairs. The same was not true of 3D — Objaverse, the biggest open 3D dataset, had ~800 K shapes. Three orders of magnitude less data, and the shapes were noisy, badly textured, and inconsistent in scale.
So the 3D generation problem became: how do you get 3D out without training on 3D? Two answers emerged, both adopting 3DGS as the output representation:
到 2023 年底,消费级别的 2D 图像生成已经被解决:Stable Diffusion、DALL·E 3、MidJourney——这些模型在数十亿对图文上训练过。3D 这边可没有这种待遇——最大的开放 3D 数据集 Objaverse 只有约 80 万个形状,比图像数据少了三个数量级,而且这些形状本身就噪声大、贴图烂、尺度不一致。
所以 3D 生成问题变成了:不在 3D 上训练,怎么能产出 3D?出现了两种答案,都把 3DGS 选作输出表示:
Optimization · SDS-based优化派 · 基于 SDS
Use a frozen 2D diffusion model as a "critic." Randomly render the candidate 3DGS scene from a camera; ask the diffusion model "is this a good rendering of the prompt?"; backprop its gradient through the rasterizer to update the Gaussians. Per-scene optimization, slow (~5–30 minutes), but very high quality.
DreamGaussian, GaussianDreamer.
把一个冻结的 2D 扩散模型当作"评论员"。从某个相机随机渲染当前 3DGS 候选场景,问扩散模型"这个图像对得上 prompt 吗?"再把它的梯度通过可微光栅化器反传,更新高斯。逐场景优化,慢(~5–30 分钟),但画质很高。
DreamGaussian、GaussianDreamer。
Feed-forward · LRM-style前馈派 · LRM 风格
Train a transformer once on rendered views of all 800 K Objaverse shapes. At inference, feed it a few views and read out the Gaussians of the depicted object directly. Per-scene time: seconds. Quality bounded by the training data; great on common categories, hallucinations on out-of-distribution prompts.
LGM, GRM, Trellis (with a rectified-flow twist).
在 80 万 Objaverse 形状的渲染视图上一次性训练一个 transformer。推理时喂进几个视角,直接读出该物体的高斯。逐场景耗时:秒级。画质受训练集限制;常见品类很好,离群 prompt 上会幻觉。
LGM、GRM、Trellis(带 rectified-flow 风味)。
§ 1 · SDS, briefly§ 1 · 简述 SDS
Score Distillation Sampling 分数蒸馏采样
A 2D diffusion model \(\epsilon_\phi(\mathbf{x}_t, t, y)\) is a denoiser: given a noisy image \(\mathbf{x}_t\), the noise level \(t\), and a text prompt \(y\), it predicts the noise that was added. The cleaner the image is, the smaller the predicted noise — so the predicted noise is a signal of "how off the prompt this image is." DreamFusion (Poole et al., 2022) noticed that you can use this signal to optimize anything that produces an image — including a randomly-rendered 3D scene.
一个 2D 扩散模型 \(\epsilon_\phi(\mathbf{x}_t, t, y)\) 是个去噪器:给定一张带噪声的图像 \(\mathbf{x}_t\)、噪声水平 \(t\)、文本 prompt \(y\),它预测当初加进去的那份噪声。图像越干净、预测出的噪声越小——所以预测噪声本身就是一个"图像离 prompt 多远"的信号。DreamFusion(Poole 等,2022)注意到:你可以用这个信号去优化任何产出图像的东西——包括一个随机渲染的 3D 场景。
where \(\mathbf{x} = \text{render}(\theta; c)\) is the render of your scene \(\theta\) at random camera \(c\), and \(\mathbf{x}_t = \alpha_t \mathbf{x} + \sigma_t \epsilon\) is its noised version. The bracketed term is the "score difference" — what the diffusion model thinks should move to match the prompt. Multiply by \(\partial \mathbf{x}/\partial \theta\) — the differentiable renderer's Jacobian — and you get a gradient on the 3D scene.
其中 \(\mathbf{x} = \text{render}(\theta; c)\) 是你的场景 \(\theta\) 在随机相机 \(c\) 下的渲染结果,\(\mathbf{x}_t = \alpha_t \mathbf{x} + \sigma_t \epsilon\) 是它的加噪版本。方括号里的部分是"分数差"——扩散模型认为图像需要往哪个方向移动才能更贴 prompt。再乘上可微渲染器的雅可比 \(\partial \mathbf{x}/\partial \theta\),就得到了一份施加到 3D 场景上的梯度。
Interactive · SDS gradient loop 交互 · SDS 梯度循环
A toy 2D "scene" of Gaussians, optimized toward a target image via a synthetic "denoiser" (we cheat and use the L2 gradient as a stand-in). Click "step" to run one SDS-ish update. Watch the Gaussians migrate toward the target — that's what's happening per-view in 3D, times ~1000 iterations. 一个由高斯组成的玩具 2D"场景",通过一个合成的"去噪器"(我们偷懒用 L2 梯度替代)向目标图像收敛。点 "step" 跑一轮 SDS-ish 更新。看着高斯朝目标迁移——这就是 3D 里每个视角发生的事,再乘以 ~1000 次迭代。
§ 2 · DreamGaussian§ 2 · DreamGaussian
The first 3DGS generator 第一个 3DGS 生成器
DreamGaussian was the first paper to drop SDS onto a 3DGS scene. The architecture is conceptually simple — initialize a thousand random Gaussians, run SDS, watch them organize into something that matches the prompt — but two key details made it actually work:
- Densification, but driven by SDS gradients. Same clone-and-split trick as the original 3DGS, but the gradient comes from the diffusion model rather than a photometric loss. A region with high diffusion-gradient magnitude is a region the prompt says should have detail — so add Gaussians there.
- Mesh refinement, second stage. SDS on a Gaussian field converges to a plausible-but-blurry result. DreamGaussian then exports a textured mesh from the Gaussians (Marching Cubes on the opacity field), and fine-tunes the mesh's texture map with a second round of SDS at higher resolution. The Gaussian stage gives geometry; the mesh stage gives crisp texture.
From "a teapot" to a 30 MB textured 3D asset in 2 minutes on a single GPU. The fastest SDS-based generator at the time by ~10×, and the quality matched DreamFusion's 8-hour NeRF optimizations. The follow-ups (GaussianDreamer, LucidDreamer) extended the basic loop with better priors and more sampling tricks; the architecture is essentially the same.
DreamGaussian 是第一篇把 SDS 套在 3DGS 上的论文。架构上很简单——先初始化一千颗随机高斯,跑 SDS,看着它们组织成符合 prompt 的形状——但有两个关键细节让它真正 work:
- 致密化,但由 SDS 梯度驱动。跟原版 3DGS 一样的"克隆-分裂"技巧,但这次梯度不来自光度损失,而来自扩散模型。扩散梯度大的区域就是 prompt 说"那里该有细节"的区域——往那儿加高斯。
- 第二阶段:网格精化。对高斯场跑 SDS 会收敛到一个"像但糊"的结果。DreamGaussian 接下来从高斯里导出一份有贴图的网格(在不透明度场上跑 Marching Cubes),再用第二轮高分辨率 SDS 精修贴图。高斯阶段负责几何,网格阶段负责锐利的贴图。
从"一个茶壶"到 30 MB 的带纹理 3D 资产,单卡 2 分钟。当时基于 SDS 的最快生成器,速度快了 ~10 倍,画质对齐 DreamFusion 跑 8 小时的 NeRF 优化。后续 GaussianDreamer、LucidDreamer 用更好的先验和更多采样技巧扩展了基本循环;架构本质相同。
Interactive · SDS-driven convergence 交互 · SDS 驱动的收敛过程
Click "play." The scene starts as 200 random Gaussians; the SDS gradient (here simulated) pushes them toward a target silhouette. Note the densification spikes — every 100 steps the simulation clones high-gradient Gaussians, exactly mirroring DreamGaussian. 点 "play"。场景初始是 200 颗随机高斯;SDS 梯度(这里是模拟)把它们朝目标轮廓推。注意致密化的尖峰——每 100 步会克隆高梯度高斯,跟 DreamGaussian 一致。
§ 3 · LGM / GRM§ 3 · LGM / GRM
Feed-forward: a transformer that emits Gaussians 前馈派:一个吐出高斯的 Transformer
SDS is slow because it's per-scene optimization. The alternative is to train once on a big dataset of 3D shapes and learn to directly predict the Gaussians of a new object from a few input views. This is the recipe of LRM (Large Reconstruction Model) — adapted with a 3DGS head as LGM and GRM.
- Input. A handful of views of the target object. These can be either real photos or "fake" multi-view images hallucinated by a 2D diffusion model conditioned on a single image (Zero-1-to-3 style). 4 views is the sweet spot in practice.
- Encoder. Each view goes through a vision transformer (ViT). The patch tokens of all 4 views are concatenated into one long sequence — typically 4 × 256 = 1024 tokens.
- 3D decoder. A second transformer attends across the 1024 tokens to produce a set of 3D queries; each query head is decoded into a Gaussian's (μ, q, s, α, SH). Typically 8 K to 100 K Gaussians per shape.
- Loss (training only). Render the predicted Gaussians from a held-out camera and L2 against the held-out ground-truth view. The whole stack is end-to-end differentiable because 3DGS is.
SDS 慢,是因为它每个场景都得重头优化。替代方案是一次性在一个大 3D 形状数据集上训练,学会从几个输入视角直接预测新物体的高斯。这是 LRM(Large Reconstruction Model)的配方——配上一个 3DGS 输出头就是 LGM 和 GRM。
- 输入。目标物体的几个视图。可以是真实照片,也可以是 2D 扩散模型条件在单张图上"幻想"出来的多视图(Zero-1-to-3 风格)。实测 4 个视图是甜点。
- 编码器。每个视图过一个 ViT。所有 4 个视图的 patch token 拼到一条长序列里——典型 4 × 256 = 1024 个 token。
- 3D 解码器。第二个 transformer 在这 1024 个 token 上做 cross-attention,生成一组 3D query;每个 query 头解出一颗高斯的 (μ, q, s, α, SH)。典型 8 K 到 100 K 颗高斯一个形状。
- 损失(仅训练阶段)。从一个留出相机渲染预测出来的高斯,跟留出真值视图做 L2。因为 3DGS 端到端可微,整个栈可以一起反传。
Interactive · the 4-view → Gaussians pipeline 交互 · 4 视角 → 高斯 的流水线
Four placeholder views around a synthetic object on the left. The transformer (middle, just a schematic) attends across all 4 sets of patch tokens and emits Gaussian queries. The right shows the predicted Gaussians orbiting in 3D. Drag the orbit slider; pretend the transformer understood it. 左边是一个合成物体的四个占位视图。中间的 transformer(只是示意)跨四组 patch token 做注意力,吐出高斯 query。右边是预测出来的高斯在 3D 里转动。拖动 orbit 滑块,假装这个 transformer 真的懂你刚才的输入。
Inference time: ~1 second on a single GPU for a typical mesh-class object. That speed comes at a cost — the model only generates what it's seen during training, and there's a fixed budget of Gaussians per object. GRM extends LGM with multi-resolution heads and higher-resolution input views; recent work (CRM, InstantMesh) trades some of the GS heads for direct mesh prediction.
推理时间:典型网格类物体单卡 ~1 秒。这速度有代价——模型只会生成它训练时见过的东西,每个物体还有固定的高斯预算。GRM 把 LGM 扩展为多分辨率输出头 + 更高分辨率的输入视图;近期工作(CRM、InstantMesh)则换掉一部分 GS 头,直接预测网格。
§ 4 · Trellis§ 4 · Trellis
Sparse voxels + flow matching 稀疏体素 + 流匹配
Trellis takes a different bet: instead of a transformer that emits unstructured Gaussians, train a flow-matching model that emits a structured sparse voxel grid. Each occupied voxel encodes a small bundle of Gaussians (à la Scaffold-GS); the flow model generates the occupancy pattern conditioned on a text or image prompt.
Why this works better than dense diffusion: most of 3D space is empty. A dense voxel grid wastes 99% of its model capacity on void. Trellis's sparse representation only places samples where surfaces are — and generates them in one rectified-flow pass at inference (~30 steps versus DDPM's 1000). The result: 2–10 second generations that match LGM's speed while being more composable (you can edit sub-volumes, repaint regions, add geometry without redoing the whole sample).
Trellis 押了一条不一样的注:不让 transformer 直接吐无结构高斯,改训练一个 flow-matching 模型,吐结构化的稀疏体素网格。每个被占用的体素里编码一小簇高斯(Scaffold-GS 风格);flow 模型在文字或图像 prompt 条件下生成占用模式。
为什么这比稠密扩散更好:3D 空间绝大部分是空的。稠密体素网格把 99% 的模型容量浪费在虚空里。Trellis 的稀疏表示只在表面附近放样本——并且推理时用一次 rectified-flow 直接生成(~30 步 vs DDPM 的 1000 步)。结果是:2–10 秒的生成,速度对齐 LGM,但更可组合(可以编辑子体积、重绘某些区域、加几何而不需要重采样整个东西)。
def trellis_inference(prompt):
# ---- stage 1: sparse-voxel occupancy via rectified flow ----
z_T = randn_like(sparse_grid) # noise on the sparse-grid latent
for step in range(30):
t = (30 - step) / 30
v = velocity_net(z_t, t, prompt) # learned velocity field
z_t = z_t - (1/30) * v
occupancy = decoder_occ(z_0) # which voxels are surfaces?
# ---- stage 2: per-voxel Gaussian bundle ----
z_g = randn_like(occupancy.shape + (latent_dim,))
for step in range(30):
v = gaussian_velocity_net(z_g, t, prompt, occupancy)
z_g = z_g - (1/30) * v
gaussians = decoder_gs(z_0_g) # bundle of (μ, q, s, α, SH) per voxel
return gaussians
Trellis cleanly separates where the geometry is (occupancy) from what it looks like (per-voxel Gaussians). Each stage runs in seconds. The flow-matching objective (instead of full DDPM) cuts inference iterations by 30×. It's the strongest open-source single-image-to-3DGS system as of early 2025.
Trellis 把"几何在哪"(占用)与"几何看上去如何"(每体素的高斯)干净地解耦。每个阶段都是秒级。Flow-matching 目标(替代完整 DDPM)把推理迭代次数砍了 30×。截至 2025 年初,是最强的开源"单图 → 3DGS"系统。
§ 5 · Where this lands§ 5 · 现在在哪
Numbers as of mid-2025 2025 年中的数字
Three to four orders of magnitude faster than 2022's text-to-3D, with quality that's better. The bottleneck moved from "can the model converge?" to "can the asset go into a game engine without cleanup?" — see also 3dgs-surface for the meshing side of that question.
比 2022 年的文生 3D 快了三到四个数量级,画质还更好。瓶颈已经从"模型能不能收敛"挪到"这资产不洗能不能扔进游戏引擎"——见 3dgs-surface 那边"出网格"的部分。
§ 6 · The era§ 6 · 时代
Eighteen months of generation 十八个月的生成发展史
§ 7 · Open§ 7 · 还未解决
What's not solved 还没解决的几件事
Game-engine quality. SDS and LRM both produce 3DGS clouds with great front-facing detail and mushy backs. Turning them into clean PBR-textured meshes is a separate problem (the upcoming 3dgs-surface essay covers that). Trellis is the closest we have to a single-pass solution.
Compositionality. "A dog on a horse" still gets you a hybrid creature, not a composed scene. The same compositional weaknesses 2D diffusion has show up worse in 3D — there's no global layout planner.
Animatable output. Generated Gaussians come without rig or skinning. 3dgs-avatars handles attaching primitives to a skeleton; combining "generate" with "rig" remains open.
游戏引擎级质量。SDS 和 LRM 产出的 3DGS 云普遍正面细致、背面糊。把它转成干净的 PBR 贴图网格是另一个问题(即将上线的 3dgs-surface 那篇会讲)。Trellis 是目前最接近"一次成型"的方案。
组合性。"一只狗站在一匹马上"仍然给你一个杂交生物,而不是一个组合好的场景。2D 扩散原有的组合性问题在 3D 里更严重——没有全局布局规划器。
可动画输出。生成出来的高斯没有骨骼、没有蒙皮。3dgs-avatars 讲怎么把基元挂到骨架上;把"生成"和"绑定"接起来仍然是开放问题。