3DGS generation — from text and images to fuzzy point clouds

§ 0 · The setup§ 0 · 前提铺垫

2D diffusion is huge. 3D data is scarce. 2D 扩散巨大，3D 数据稀缺。

By late 2023, 2D image generation was a solved problem at consumer scale: Stable Diffusion, DALL·E 3, MidJourney. They had been trained on billions of image–text pairs. The same was not true of 3D — Objaverse, the biggest open 3D dataset, had ~800 K shapes. Three orders of magnitude less data, and the shapes were noisy, badly textured, and inconsistent in scale.

So the 3D generation problem became: how do you get 3D out without training on 3D? Two answers emerged, both adopting 3DGS as the output representation:

到 2023 年底，消费级别的 2D 图像生成已经被解决：Stable Diffusion、DALL·E 3、MidJourney——这些模型在数十亿对图文上训练过。3D 这边可没有这种待遇——最大的开放 3D 数据集 Objaverse 只有约 80 万个形状，比图像数据少了三个数量级，而且这些形状本身就噪声大、贴图烂、尺度不一致。

所以 3D 生成问题变成了：不在 3D 上训练，怎么能产出 3D？出现了两种答案，都把 3DGS 选作输出表示：

Optimization · SDS-based优化派 · 基于 SDS

Use a frozen 2D diffusion model as a "critic." Randomly render the candidate 3DGS scene from a camera; ask the diffusion model "is this a good rendering of the prompt?"; backprop its gradient through the rasterizer to update the Gaussians. Per-scene optimization, slow (~5–30 minutes), but very high quality.

DreamGaussian, GaussianDreamer.

把一个冻结的 2D 扩散模型当作"评论员"。从某个相机随机渲染当前 3DGS 候选场景，问扩散模型"这个图像对得上 prompt 吗？"再把它的梯度通过可微光栅化器反传，更新高斯。逐场景优化，慢（~5–30 分钟），但画质很高。

DreamGaussian、GaussianDreamer。

Feed-forward · LRM-style前馈派 · LRM 风格

Train a transformer once on rendered views of all 800 K Objaverse shapes. At inference, feed it a few views and read out the Gaussians of the depicted object directly. Per-scene time: seconds. Quality bounded by the training data; great on common categories, hallucinations on out-of-distribution prompts.

LGM, GRM, Trellis (with a rectified-flow twist).

在 80 万 Objaverse 形状的渲染视图上一次性训练一个 transformer。推理时喂进几个视角，直接读出该物体的高斯。逐场景耗时：秒级。画质受训练集限制；常见品类很好，离群 prompt 上会幻觉。

LGM、GRM、Trellis（带 rectified-flow 风味）。

Why 3DGS as the output format? It's explicit (renderable on any GPU, no MLP required), it's differentiable (you can backprop through it for SDS), and it has constant-time rendering during the inner loop (every SDS step renders four views — 3DGS makes that cheap). NeRF, voxel grids, and meshes all lose to it on at least one of those three counts.

为什么用 3DGS 作输出？它显式（任何 GPU 都能渲染，不需要 MLP）；它可微（可以反传，做 SDS）；它在内层循环里渲染是常数时间（每一步 SDS 都要渲染四个视角——3DGS 让这件事便宜）。NeRF、体素网格、网格——三项里它们至少各输掉一项。

§ 1 · SDS, briefly§ 1 · 简述 SDS

Score Distillation Sampling 分数蒸馏采样

A 2D diffusion model \(\epsilon_\phi(\mathbf{x}_t, t, y)\) is a denoiser: given a noisy image \(\mathbf{x}_t\), the noise level \(t\), and a text prompt \(y\), it predicts the noise that was added. The cleaner the image is, the smaller the predicted noise — so the predicted noise is a signal of "how off the prompt this image is." DreamFusion (Poole et al., 2022) noticed that you can use this signal to optimize anything that produces an image — including a randomly-rendered 3D scene.

一个 2D 扩散模型 \(\epsilon_\phi(\mathbf{x}_t, t, y)\) 是个去噪器：给定一张带噪声的图像 \(\mathbf{x}_t\)、噪声水平 \(t\)、文本 prompt \(y\)，它预测当初加进去的那份噪声。图像越干净、预测出的噪声越小——所以预测噪声本身就是一个"图像离 prompt 多远"的信号。DreamFusion（Poole 等，2022）注意到：你可以用这个信号去优化任何产出图像的东西——包括一个随机渲染的 3D 场景。

\nabla_\theta \mathcal{L}_{\text{SDS}}(\theta) \;\propto\; \mathbb{E}_{t,\epsilon,c}\!\left[ w(t)\,\big(\epsilon_\phi(\mathbf{x}_t, t, y) - \epsilon\big) \, \frac{\partial \mathbf{x}}{\partial \theta} \right]

where \(\mathbf{x} = \text{render}(\theta; c)\) is the render of your scene \(\theta\) at random camera \(c\), and \(\mathbf{x}_t = \alpha_t \mathbf{x} + \sigma_t \epsilon\) is its noised version. The bracketed term is the "score difference" — what the diffusion model thinks should move to match the prompt. Multiply by \(\partial \mathbf{x}/\partial \theta\) — the differentiable renderer's Jacobian — and you get a gradient on the 3D scene.

其中 \(\mathbf{x} = \text{render}(\theta; c)\) 是你的场景 \(\theta\) 在随机相机 \(c\) 下的渲染结果，\(\mathbf{x}_t = \alpha_t \mathbf{x} + \sigma_t \epsilon\) 是它的加噪版本。方括号里的部分是"分数差"——扩散模型认为图像需要往哪个方向移动才能更贴 prompt。再乘上可微渲染器的雅可比 \(\partial \mathbf{x}/\partial \theta\)，就得到了一份施加到 3D 场景上的梯度。

Interactive · SDS gradient loop 交互 · SDS 梯度循环

A toy 2D "scene" of Gaussians, optimized toward a target image via a synthetic "denoiser" (we cheat and use the L2 gradient as a stand-in). Click "step" to run one SDS-ish update. Watch the Gaussians migrate toward the target — that's what's happening per-view in 3D, times ~1000 iterations. 一个由高斯组成的玩具 2D"场景"，通过一个合成的"去噪器"（我们偷懒用 L2 梯度替代）向目标图像收敛。点 "step" 跑一轮 SDS-ish 更新。看着高斯朝目标迁移——这就是 3D 里每个视角发生的事，再乘以 ~1000 次迭代。

iter 0 · loss = -

§ 2 · DreamGaussian§ 2 · DreamGaussian

The first 3DGS generator 第一个 3DGS 生成器

Tang, Ren, Zhou, Liu, Zeng · ICLR 2024 (Spotlight) · arXiv:2309.16653

DreamGaussian was the first paper to drop SDS onto a 3DGS scene. The architecture is conceptually simple — initialize a thousand random Gaussians, run SDS, watch them organize into something that matches the prompt — but two key details made it actually work:

Densification, but driven by SDS gradients. Same clone-and-split trick as the original 3DGS, but the gradient comes from the diffusion model rather than a photometric loss. A region with high diffusion-gradient magnitude is a region the prompt says should have detail — so add Gaussians there.
Mesh refinement, second stage. SDS on a Gaussian field converges to a plausible-but-blurry result. DreamGaussian then exports a textured mesh from the Gaussians (Marching Cubes on the opacity field), and fine-tunes the mesh's texture map with a second round of SDS at higher resolution. The Gaussian stage gives geometry; the mesh stage gives crisp texture.

From "a teapot" to a 30 MB textured 3D asset in 2 minutes on a single GPU. The fastest SDS-based generator at the time by ~10×, and the quality matched DreamFusion's 8-hour NeRF optimizations. The follow-ups (GaussianDreamer, LucidDreamer) extended the basic loop with better priors and more sampling tricks; the architecture is essentially the same.

DreamGaussian 是第一篇把 SDS 套在 3DGS 上的论文。架构上很简单——先初始化一千颗随机高斯，跑 SDS，看着它们组织成符合 prompt 的形状——但有两个关键细节让它真正 work：

致密化，但由 SDS 梯度驱动。跟原版 3DGS 一样的"克隆-分裂"技巧，但这次梯度不来自光度损失，而来自扩散模型。扩散梯度大的区域就是 prompt 说"那里该有细节"的区域——往那儿加高斯。
第二阶段：网格精化。对高斯场跑 SDS 会收敛到一个"像但糊"的结果。DreamGaussian 接下来从高斯里导出一份有贴图的网格（在不透明度场上跑 Marching Cubes），再用第二轮高分辨率 SDS 精修贴图。高斯阶段负责几何，网格阶段负责锐利的贴图。

从"一个茶壶"到 30 MB 的带纹理 3D 资产，单卡 2 分钟。当时基于 SDS 的最快生成器，速度快了 ~10 倍，画质对齐 DreamFusion 跑 8 小时的 NeRF 优化。后续 GaussianDreamer、LucidDreamer 用更好的先验和更多采样技巧扩展了基本循环；架构本质相同。

Interactive · SDS-driven convergence 交互 · SDS 驱动的收敛过程

Click "play." The scene starts as 200 random Gaussians; the SDS gradient (here simulated) pushes them toward a target silhouette. Note the densification spikes — every 100 steps the simulation clones high-gradient Gaussians, exactly mirroring DreamGaussian. 点 "play"。场景初始是 200 颗随机高斯；SDS 梯度（这里是模拟）把它们朝目标轮廓推。注意致密化的尖峰——每 100 步会克隆高梯度高斯，跟 DreamGaussian 一致。

step 0 · N = 200

§ 3 · LGM / GRM§ 3 · LGM / GRM

Feed-forward: a transformer that emits Gaussians 前馈派：一个吐出高斯的 Transformer

Tang, Chen, et al. (LGM, ECCV 2024) · Xu, Tan, et al. (GRM, ECCV 2024)

SDS is slow because it's per-scene optimization. The alternative is to train once on a big dataset of 3D shapes and learn to directly predict the Gaussians of a new object from a few input views. This is the recipe of LRM (Large Reconstruction Model) — adapted with a 3DGS head as LGM and GRM.

Input. A handful of views of the target object. These can be either real photos or "fake" multi-view images hallucinated by a 2D diffusion model conditioned on a single image (Zero-1-to-3 style). 4 views is the sweet spot in practice.
Encoder. Each view goes through a vision transformer (ViT). The patch tokens of all 4 views are concatenated into one long sequence — typically 4 × 256 = 1024 tokens.
3D decoder. A second transformer attends across the 1024 tokens to produce a set of 3D queries; each query head is decoded into a Gaussian's (μ, q, s, α, SH). Typically 8 K to 100 K Gaussians per shape.
Loss (training only). Render the predicted Gaussians from a held-out camera and L2 against the held-out ground-truth view. The whole stack is end-to-end differentiable because 3DGS is.

SDS 慢，是因为它每个场景都得重头优化。替代方案是一次性在一个大 3D 形状数据集上训练，学会从几个输入视角直接预测新物体的高斯。这是 LRM（Large Reconstruction Model）的配方——配上一个 3DGS 输出头就是 LGM 和 GRM。

输入。目标物体的几个视图。可以是真实照片，也可以是 2D 扩散模型条件在单张图上"幻想"出来的多视图（Zero-1-to-3 风格）。实测 4 个视图是甜点。
编码器。每个视图过一个 ViT。所有 4 个视图的 patch token 拼到一条长序列里——典型 4 × 256 = 1024 个 token。
3D 解码器。第二个 transformer 在这 1024 个 token 上做 cross-attention，生成一组 3D query；每个 query 头解出一颗高斯的 (μ, q, s, α, SH)。典型 8 K 到 100 K 颗高斯一个形状。
损失（仅训练阶段）。从一个留出相机渲染预测出来的高斯，跟留出真值视图做 L2。因为 3DGS 端到端可微，整个栈可以一起反传。

Interactive · the 4-view → Gaussians pipeline 交互 · 4 视角 → 高斯的流水线

Four placeholder views around a synthetic object on the left. The transformer (middle, just a schematic) attends across all 4 sets of patch tokens and emits Gaussian queries. The right shows the predicted Gaussians orbiting in 3D. Drag the orbit slider; pretend the transformer understood it. 左边是一个合成物体的四个占位视图。中间的 transformer（只是示意）跨四组 patch token 做注意力，吐出高斯 query。右边是预测出来的高斯在 3D 里转动。拖动 orbit 滑块，假装这个 transformer 真的懂你刚才的输入。

Orbitorbit predicting 4096 Gaussians (typical)

Inference time: ~1 second on a single GPU for a typical mesh-class object. That speed comes at a cost — the model only generates what it's seen during training, and there's a fixed budget of Gaussians per object. GRM extends LGM with multi-resolution heads and higher-resolution input views; recent work (CRM, InstantMesh) trades some of the GS heads for direct mesh prediction.

推理时间：典型网格类物体单卡 ~1 秒。这速度有代价——模型只会生成它训练时见过的东西，每个物体还有固定的高斯预算。GRM 把 LGM 扩展为多分辨率输出头 + 更高分辨率的输入视图；近期工作（CRM、InstantMesh）则换掉一部分 GS 头，直接预测网格。

§ 4 · Trellis§ 4 · Trellis

Sparse voxels + flow matching 稀疏体素 + 流匹配

Xiang, Lv, et al. · 2024 · arXiv:2412.01506

Trellis takes a different bet: instead of a transformer that emits unstructured Gaussians, train a flow-matching model that emits a structured sparse voxel grid. Each occupied voxel encodes a small bundle of Gaussians (à la Scaffold-GS); the flow model generates the occupancy pattern conditioned on a text or image prompt.

Why this works better than dense diffusion: most of 3D space is empty. A dense voxel grid wastes 99% of its model capacity on void. Trellis's sparse representation only places samples where surfaces are — and generates them in one rectified-flow pass at inference (~30 steps versus DDPM's 1000). The result: 2–10 second generations that match LGM's speed while being more composable (you can edit sub-volumes, repaint regions, add geometry without redoing the whole sample).

Trellis 押了一条不一样的注：不让 transformer 直接吐无结构高斯，改训练一个 flow-matching 模型，吐结构化的稀疏体素网格。每个被占用的体素里编码一小簇高斯（Scaffold-GS 风格）；flow 模型在文字或图像 prompt 条件下生成占用模式。

为什么这比稠密扩散更好：3D 空间绝大部分是空的。稠密体素网格把 99% 的模型容量浪费在虚空里。Trellis 的稀疏表示只在表面附近放样本——并且推理时用一次 rectified-flow 直接生成（~30 步 vs DDPM 的 1000 步）。结果是：2–10 秒的生成，速度对齐 LGM，但更可组合（可以编辑子体积、重绘某些区域、加几何而不需要重采样整个东西）。

def trellis_inference(prompt):
    # ---- stage 1: sparse-voxel occupancy via rectified flow ----
    z_T = randn_like(sparse_grid)                # noise on the sparse-grid latent
    for step in range(30):
        t = (30 - step) / 30
        v = velocity_net(z_t, t, prompt)         # learned velocity field
        z_t = z_t - (1/30) * v
    occupancy = decoder_occ(z_0)                 # which voxels are surfaces?
    # ---- stage 2: per-voxel Gaussian bundle ----
    z_g = randn_like(occupancy.shape + (latent_dim,))
    for step in range(30):
        v = gaussian_velocity_net(z_g, t, prompt, occupancy)
        z_g = z_g - (1/30) * v
    gaussians = decoder_gs(z_0_g)                # bundle of (μ, q, s, α, SH) per voxel
    return gaussians

Trellis cleanly separates where the geometry is (occupancy) from what it looks like (per-voxel Gaussians). Each stage runs in seconds. The flow-matching objective (instead of full DDPM) cuts inference iterations by 30×. It's the strongest open-source single-image-to-3DGS system as of early 2025.

Trellis 把"几何在哪"（占用）与"几何看上去如何"（每体素的高斯）干净地解耦。每个阶段都是秒级。Flow-matching 目标（替代完整 DDPM）把推理迭代次数砍了 30×。截至 2025 年初，是最强的开源"单图 → 3DGS"系统。

§ 5 · Where this lands§ 5 · 现在在哪

Numbers as of mid-2025 2025 年中的数字

~2 s

Trellis · image → 3DGSTrellis · 图像 → 3DGS

~1 s

LGM · 4 views → 3DGSLGM · 4 视图 → 3DGS

~2 min

DreamGaussian · text → 3DGS (per scene)DreamGaussian · 文字 → 3DGS（每场景）

~30 min

DreamFusion baseline · text → NeRFDreamFusion 基线 · 文字 → NeRF

Three to four orders of magnitude faster than 2022's text-to-3D, with quality that's better. The bottleneck moved from "can the model converge?" to "can the asset go into a game engine without cleanup?" — see also 3dgs-surface for the meshing side of that question.

比 2022 年的文生 3D 快了三到四个数量级，画质还更好。瓶颈已经从"模型能不能收敛"挪到"这资产不洗能不能扔进游戏引擎"——见 3dgs-surface 那边"出网格"的部分。

§ 7 · Open§ 7 · 还未解决

What's not solved 还没解决的几件事

Game-engine quality. SDS and LRM both produce 3DGS clouds with great front-facing detail and mushy backs. Turning them into clean PBR-textured meshes is a separate problem (the upcoming 3dgs-surface essay covers that). Trellis is the closest we have to a single-pass solution.

Compositionality. "A dog on a horse" still gets you a hybrid creature, not a composed scene. The same compositional weaknesses 2D diffusion has show up worse in 3D — there's no global layout planner.

Animatable output. Generated Gaussians come without rig or skinning. 3dgs-avatars handles attaching primitives to a skeleton; combining "generate" with "rig" remains open.

游戏引擎级质量。SDS 和 LRM 产出的 3DGS 云普遍正面细致、背面糊。把它转成干净的 PBR 贴图网格是另一个问题（即将上线的 3dgs-surface 那篇会讲）。Trellis 是目前最接近"一次成型"的方案。

组合性。"一只狗站在一匹马上"仍然给你一个杂交生物，而不是一个组合好的场景。2D 扩散原有的组合性问题在 3D 里更严重——没有全局布局规划器。

可动画输出。生成出来的高斯没有骨骼、没有蒙皮。3dgs-avatars 讲怎么把基元挂到骨架上；把"生成"和"绑定"接起来仍然是开放问题。

3DGS, generated. 3DGS，生成出来的。

2D diffusion is huge. 3D data is scarce. 2D 扩散巨大，3D 数据稀缺。

Optimization · SDS-based优化派 · 基于 SDS

Feed-forward · LRM-style前馈派 · LRM 风格

Score Distillation Sampling 分数蒸馏采样

Interactive · SDS gradient loop 交互 · SDS 梯度循环

The first 3DGS generator 第一个 3DGS 生成器

Interactive · SDS-driven convergence 交互 · SDS 驱动的收敛过程

Feed-forward: a transformer that emits Gaussians 前馈派：一个吐出高斯的 Transformer

Interactive · the 4-view → Gaussians pipeline 交互 · 4 视角 → 高斯的流水线

Sparse voxels + flow matching 稀疏体素 + 流匹配

Numbers as of mid-2025 2025 年中的数字

Eighteen months of generation 十八个月的生成发展史

What's not solved 还没解决的几件事

2D diffusion is huge. 3D data is scarce. 2D 扩散巨大，3D 数据稀缺。

Optimization · SDS-based优化派 · 基于 SDS

Feed-forward · LRM-style前馈派 · LRM 风格

Score Distillation Sampling 分数蒸馏采样

Interactive · SDS gradient loop 交互 · SDS 梯度循环

The first 3DGS generator 第一个 3DGS 生成器

Interactive · SDS-driven convergence 交互 · SDS 驱动的收敛过程

Feed-forward: a transformer that emits Gaussians 前馈派：一个吐出高斯的 Transformer

Interactive · the 4-view → Gaussians pipeline 交互 · 4 视角 → 高斯 的流水线

Sparse voxels + flow matching 稀疏体素 + 流匹配

Numbers as of mid-2025 2025 年中的数字

Eighteen months of generation 十八个月的生成发展史

What's not solved 还没解决的几件事

Interactive · the 4-view → Gaussians pipeline 交互 · 4 视角 → 高斯的流水线