3DGS avatars — attaching fuzzy points to a skeleton

§ 0 · The setup§ 0 · 前提

Three problems, one model 三个问题，一个模型

Capturing a moving human is much harder than capturing a static scene. The fundamental issue: each training frame shows the body in a different configuration. The same chest from arms-down to arms-raised looks like two different objects geometrically, but the underlying person is one. Three sub-problems fall out:

What's intrinsic vs what's pose? Skin color, hair length, body shape are constants of the person. Joint angles, expression, and clothing wrinkles are state. You have to decompose what you see into these two pieces.
How does state propagate to geometry? If you know the elbow bent 30°, you need to know where every point on the forearm moves. This is the deformation function.
What if the model is wrong? Body models (SMPL, FLAME) cover bones and skin, not loose clothing, long hair, or hand-held objects. You need a way to add detail that doesn't move rigidly with the skeleton.

3DGS-based avatars resolve these by composing two pieces. A parametric body model (SMPL for body, FLAME for face) handles the geometric template — a triangle mesh that deforms cleanly under a small set of pose parameters. A cloud of Gaussians attached to the template's triangles handles the appearance and details — and crucially, each Gaussian moves with its parent triangle when the skeleton moves.

拍一个动着的人比拍一个静态场景难得多。根本原因：每一帧训练图像里身体的形态都不一样。胳膊放下和胳膊举起的同一处胸口，几何上看起来像两个不同的物体——但底层是同一个人。由此分出三个子问题：

什么是固有，什么是姿态？肤色、发长、体型是这个人的常量。关节角度、表情、衣褶是状态。你得把所看到的拆成这两块。
状态怎么映射到几何？你知道肘弯了 30°，你得知道前臂上每个点该挪到哪儿。这就是形变函数。
如果模型本身错了呢？人体模型（SMPL、FLAME）覆盖的是骨骼和肌肤——不包括宽松衣服、长发或者手里拿的东西。你得有个办法加入那些不刚性跟随骨架的细节。

基于 3DGS 的虚拟人靠两件事的组合解决这些问题。参数化人体模型（身体用 SMPL、脸用 FLAME）负责几何模板——一份三角网格，在一小撮姿态参数下能干净地形变。挂在模板三角面片上的一团高斯负责外观和细节——而且关键是，骨架一动，每颗高斯就跟着它的父三角面片一起走。

Why this works. SMPL is a 30-year-old idea (Loper et al., 2015, building on Anguelov et al., 2005). 3DGS is a 2-year-old idea. The marriage took six months — because Gaussians are explicit primitives that you can just parent to mesh triangles, inheriting all the skinning machinery that the graphics community already wrote for free.

它为什么能成。SMPL 是个 30 年沉淀的想法（Loper 等，2015，前身是 Anguelov 等 2005）。3DGS 才两年。两者结合只花了六个月——因为高斯是显式基元，可以直接把它们挂作子节点到网格三角面片上，把图形学界早就写好的整套蒙皮机制免费继承过来。

§ 1 · The body model§ 1 · 人体模型

SMPL in 90 seconds 90 秒看懂 SMPL

SMPL (Skinned Multi-Person Linear model) is a parameterized triangle mesh of 6890 vertices. Its inputs are 10 shape coefficients (β: tall/short, thin/wide, etc.) and 72 pose parameters (θ: 23 joint rotations × 3 axes, plus root). Its output is a deformed mesh you can render. Internally, three operations compose:

SMPL（Skinned Multi-Person Linear model）是一份带蒙皮的、6890 个顶点的参数化三角网格。输入是 10 个体型系数（β：高矮、胖瘦等）和 72 个姿态参数（θ：23 个关节 × 3 轴的旋转，外加根节点）。输出就是一个可以直接渲染的形变后的网格。内部由三个操作复合而成：

V(\boldsymbol{\beta}, \boldsymbol{\theta}) \;=\; W\!\Big(T(\boldsymbol{\beta}) + B_p(\boldsymbol{\theta}),\, J(\boldsymbol{\beta}),\, \boldsymbol{\theta},\, \mathbf{W}\Big)

$T(\boldsymbol{\beta})$ is the rest-pose mesh for this body shape — a linear function of β around a learned mean.
$B_p(\boldsymbol{\theta})$ is the pose-corrective blendshape: small per-vertex offsets that fix the bulging/creasing that pure rotation can't capture.
$W(\cdot)$ is linear blend skinning (LBS): every vertex $v$ has a fixed weight vector $\mathbf{w}\in\mathbb{R}^{24}$ for the 24 joints; its world-space position is the LBS-weighted sum of the joint-local rotations applied to $v$: $$ v_{\text{world}} = \sum_{j=1}^{24} \mathbf{w}_j \,(R_j(\boldsymbol{\theta})\,v + t_j(\boldsymbol{\theta})). $$

Three things matter for 3DGS avatars: (a) the mesh is deterministic and differentiable given (β, θ), so you can backprop through it; (b) every point on the mesh has a 24-D LBS weight, which we'll inherit; (c) the canonical (rest-pose) mesh is the natural coordinate system in which to define Gaussians, so we don't have to learn pose- specific Gaussians from scratch.

$T(\boldsymbol{\beta})$ 是该体型下的静息姿态网格——围绕一个学到的均值的 β 的线性函数。
$B_p(\boldsymbol{\theta})$ 是姿态修正 blendshape：一些小的逐顶点位移，用来修正纯旋转无法刻画的肌肉凸起和褶皱。
$W(\cdot)$ 是线性混合蒙皮 (LBS)：每个顶点 $v$ 都有一个固定的 $\mathbf{w}\in\mathbb{R}^{24}$ 权重向量对应 24 个关节；它的世界位置是对 $v$ 施加各关节局部旋转后的 LBS 加权和： $$ v_{\text{world}} = \sum_{j=1}^{24} \mathbf{w}_j \,(R_j(\boldsymbol{\theta})\,v + t_j(\boldsymbol{\theta})). $$

对 3DGS 虚拟人来说，关键有三点：(a) 给定 (β, θ)，网格是确定且可微的，所以梯度可以一路反传过去；(b) 网格上每个点都有一个 24 维 LBS 权重——我们会继承这套权重；(c) 静息姿态的标准网格是自然的高斯坐标系——所以我们不必从零去学一组"对应这个姿态"的高斯。

§ 2 · Attaching Gaussians§ 2 · 把高斯挂上去

Parent each Gaussian to a triangle 让每颗高斯认一个三角面片做爹

Pattern: GaussianAvatars (Qian et al., CVPR 2024) · GART (Lei et al., CVPR 2024) · HUGS (Kocabas et al., CVPR 2024)

The trick that made the marriage work: every Gaussian is stored not in world space, but in a triangle-local frame. For each Gaussian $i$ we store its parent triangle id $t_i$ and its position in barycentric + normal coordinates on that triangle:

让这桩婚姻能成的关键技巧：每颗高斯不再以世界坐标存，而是存在它的父三角面片的局部坐标系里。对每颗高斯 $i$ 我们记录它的父三角 id $t_i$，以及它在该三角上的重心坐标 + 法向偏移：

\mathbf{p}_i^{\text{local}} \;=\; (u_i,\, v_i,\, h_i), \quad u_i + v_i + (1 - u_i - v_i) = 1, \; h_i \in \mathbb{R}.

$u_i, v_i$ are barycentric weights inside the triangle, and $h_i$ is the height above the triangle's surface (positive = outside the mesh, negative = inside). The Gaussian's covariance and orientation are similarly stored in the triangle's tangent frame. To render at pose θ:

Deform SMPL to get the world-space triangle: vertices $v_{a}, v_{b}, v_{c}$.
Compute the Gaussian's world-space center: $\mathbf{p}_i^{\text{world}} = u_i v_a + v_i v_b + (1 - u_i - v_i) v_c + h_i \mathbf{n}$, where $\mathbf{n}$ is the triangle normal.
Compute the Gaussian's world-space rotation: $R_i^{\text{world}} = R_{\text{triangle}} \cdot R_i^{\text{local}}$, where $R_{\text{triangle}}$ is the triangle's tangent-frame rotation.
Project and render exactly as in baseline 3DGS — EWA, alpha composite, etc.

Critically, all four steps are differentiable. The optimizer sees gradients on $u_i, v_i, h_i,$ the local rotation, scale, opacity, and SH — plus, indirectly, on the SMPL pose θ and shape β if you choose to refine them. The Gaussians effectively act as a learned "skin" stretched over the mesh — they fix what SMPL can't model (clothing folds, hair, accessories) while inheriting SMPL's articulation for free.

$u_i, v_i$ 是三角面片内的重心坐标，$h_i$ 是相对三角表面的高度（正 = 网格外，负 = 网格内）。高斯的协方差和朝向同样存在该三角的切空间下。给定姿态 θ 渲染时：

SMPL 形变，得到世界空间下的三角：顶点 $v_{a}, v_{b}, v_{c}$。
算高斯的世界空间中心：$\mathbf{p}_i^{\text{world}} = u_i v_a + v_i v_b + (1 - u_i - v_i) v_c + h_i \mathbf{n}$，其中 $\mathbf{n}$ 是三角面法向量。
算高斯的世界空间旋转：$R_i^{\text{world}} = R_{\text{triangle}} \cdot R_i^{\text{local}}$，其中 $R_{\text{triangle}}$ 是三角面切空间的旋转。
剩下投影和渲染跟基线 3DGS 完全一样——EWA、α 合成、等等。

关键是这四步全可微。优化器看得到对 $u_i, v_i, h_i$、局部旋转、尺度、不透明度、SH 的梯度——以及（如果你选择联合精修）对 SMPL 姿态 θ 和体型 β 的间接梯度。这些高斯实际上像一层"学到的皮肤"贴在网格上——把 SMPL 模不出来的东西（衣褶、头发、配件）补上，同时白白继承了 SMPL 的关节运动能力。

Interactive · parented Gaussians on a stick figure 交互 · 火柴人骨架上挂着的高斯

A toy 2D "skeleton" (head + torso + 4 limbs) with Gaussians attached to each segment. Drag any joint; the bones deform under LBS-ish rules; the Gaussians follow rigidly with their parent segment. This is the kindergarten version of what GaussianAvatars does with 13776 SMPL triangles. 一个玩具 2D 骨架（头 + 躯干 + 4 条肢体），每段挂着一些高斯。拖动任意关节；骨头按近似 LBS 的规则形变；高斯跟着所属段刚性运动。这是 GaussianAvatars 用 13776 个 SMPL 三角面片做的事的幼儿园版本。

drag any colored joint

§ 3 · Deformation§ 3 · 形变

What about clothes, hair, the things SMPL ignores? 衣服、头发、SMPL 不管的那些东西呢？

LBS attached to the body works for a t-shirt. It fails for a billowing skirt, a swinging ponytail, a hand-held bag. These need additional state — they bend with the body, but not as a rigid function of pose.

Two patterns handle this:

绑在身体上的 LBS 对 T 恤可以应付。对飘动的裙子、晃动的马尾、手里提着的包就崩了。这些东西需要额外的状态——它们随身体动，但不是姿态的刚性函数。

有两种处理方式：

Pose-dependent residual MLP姿态相关的残差 MLP

A small MLP takes the SMPL pose θ and outputs per-Gaussian residuals $(\Delta\mu, \Delta s, \Delta\alpha)$ on top of the rigidly-skinned base. The MLP learns "when the right arm is raised, this hair Gaussian shifts a centimeter laterally." Used by GART, HUGS, 3DGS-Avatar.

一个小 MLP 吃 SMPL 姿态 θ，在刚性蒙皮的基础上输出每颗高斯的残差 $(\Delta\mu, \Delta s, \Delta\alpha)$。MLP 学到的是"当右臂举起时，这颗头发高斯横向位移一厘米"。GART、HUGS、3DGS-Avatar 都这么做。

Coarse cage + sub-cage Gaussians粗笼 + 笼内高斯

Skip the body model and learn a per-subject coarse mesh cage instead. Gaussians are parented to cage tetrahedra (4 verts each). LBS reduces to barycentric interpolation inside the cage. More flexible at the cost of needing a captured t-pose. Used by GauHuman, AnimatableGS.

不用人体模型，而是给每个对象学一份粗的网格"笼子"。高斯绑在笼子的四面体（每个 4 顶点）里。LBS 退化成笼内的重心插值。更灵活，代价是需要一个采集到的 T-pose 起点。GauHuman、AnimatableGS 走这条路。

Either way, the deformation function is end-to-end differentiable. You train on a video of the person; gradients flow from photometric loss → screen-space Gaussian → world-space Gaussian → triangle-local Gaussian + pose + (optional residual MLP) → all the way to the SMPL parameters if you want.

两种方法形变函数都端到端可微。你拿一段这个人的视频做训练；梯度从光度损失 → 屏幕空间高斯 → 世界空间高斯 → 三角局部高斯 + 姿态 +（可选的）残差 MLP → 一路传回 SMPL 参数（如果你愿意联合精修的话）。

§ 4 · Faces§ 4 · 面部

FLAME + Gaussians = expressive avatars FLAME + 高斯 = 表情丰富的虚拟人

GaussianAvatars (Qian et al., CVPR 2024) · arXiv:2312.02069

Faces are easier and harder than bodies. Easier: only one rigid head, no limbs. Harder: the surface is more detailed (eyelashes, gum lines, pore-scale texture) and the user-facing demand for fidelity is brutal. The face equivalent of SMPL is FLAME — a parametric head mesh with 5023 vertices, 100 shape components, 100 expression components, plus jaw and neck rotations.

GaussianAvatars (the seminal paper) parents Gaussians to FLAME triangles in exactly the same way as §2, with two refinements specifically for faces:

Adaptive density. Eye and lip regions get ~10× the Gaussian density of the cheek. The densification heuristic uses pixel-error per FLAME region, not global gradient magnitude — so the optimizer concentrates capacity on what matters.
Lash and hair as floating Gaussians. Gaussians with $|h_i|$ very large get promoted to "floaters" that are anchored to the closest triangle but skinned with a small residual rotation MLP — handling the cases where FLAME's skin assumption breaks (hair, brows, lashes).

面部相比身体既更简单也更难。更简单：只有一个刚性的头，没有四肢。更难：表面细节多得多（睫毛、牙龈线、毛孔级纹理），用户对真实度的容忍度极低。SMPL 在面部的对应物是 FLAME——5023 个顶点的参数化头部网格，100 个体型分量、100 个表情分量，再加上下巴和颈部旋转。

GaussianAvatars（开山之作）把高斯挂到 FLAME 三角面片的方式跟 §2 一模一样，但在面部上做了两处针对性的细化：

自适应密度。眼周和唇周的高斯密度是脸颊的 ~10 倍。致密化的启发式用的是每个 FLAME 区域的像素误差，不是全局梯度模——优化器把容量集中到要紧的部位上。
把睫毛、头发当作浮动高斯。$|h_i|$ 很大的那些高斯被升级为"漂浮件"：它们仍然锚定到最近的三角面片，但用一个小残差旋转 MLP 来做蒙皮——处理 FLAME"皮肤"假设不成立的情形（头发、眉毛、睫毛）。

Interactive · facial expression blend 交互 · 表情融合

Drag the sliders for jaw, smile, and brow. The face Gaussians follow the FLAME-like deformation; floaters (lashes, brows) get extra rotation. Very stylized — a real GaussianAvatar has 100K+ Gaussians on the face alone. 拖动下巴、微笑、眉毛三个滑块。面部高斯按类 FLAME 的方式形变；漂浮件（睫毛、眉毛）会额外旋转。非常风格化——真正的 GaussianAvatar 光脸上就 10 万+ 颗高斯。

Jaw下巴 Smile微笑 Brow眉毛

Quality benchmark: 30+ dB PSNR on novel expressions, indistinguishable in short videos from the source person. The remaining hard problems are gaze (eyes are visually salient but FLAME doesn't track them) and view-dependent skin (subsurface scattering, which is only approximately captured by per-Gaussian SH). Active research areas.

画质基准：新表情下 30+ dB PSNR，短视频里基本认不出和真人的差别。剩下的难题是目光（眼睛视觉上很显眼，FLAME 并不跟踪它们）和视角相关的皮肤光照（次表面散射，单纯的 per-Gaussian SH 只是近似）。仍然是活跃的研究方向。

§ 5 · The systems§ 5 · 系统盘点

Twelve months of 3DGS humans 3DGS 虚拟人的 12 个月

§ 6 · Numbers§ 6 · 数字

Where this stands 现在到了什么程度

~30 dB

PSNR on novel-view, novel-pose新视角 + 新姿态的 PSNR

~100 K

Gaussians on a full body avatar整身虚拟人的高斯数

~5 min

Training time per subject (from 5-min video)每个对象训练时间（来自 5 分钟视频）

~80 FPS

Render speed at 1080p, with skinning, on RTX 40901080p 渲染 + 蒙皮，RTX 4090 上的速度

§ 7 · Open§ 7 · 仍未解决

What's not solved 还没解决的几件事

Cloth simulation, not just skinning. Loose garments still warp incorrectly. Some 2025 work treats garments as their own sub-cloud with physics-based simulation (PhysAvatar) — but coupling that to a differentiable optimizer is hard.

Multi-person interaction. Two avatars touching cause horrible Gaussian interpenetration. Contact handling needs an explicit collision model that nobody has cleanly integrated.

Relighting. The training video bakes lighting into SH. Putting the avatar in a new environment needs material decomposition — see 3dgs-relighting.

布料模拟，而不仅仅是蒙皮。宽松衣物还是会变形错误。2025 年若干工作把衣物作为独立子云、配上基于物理的模拟（PhysAvatar）——但要把它跟可微优化器耦合起来非常难。

多人交互。两个虚拟人接触会导致严重的高斯互穿。接触处理需要显式的碰撞模型，目前还没有人干净地把它集成进来。

重打光。训练视频把光照烤死在 SH 里。要把虚拟人放进新环境，需要材质分解——参见 3dgs-relighting。