TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting
TalkingGaussian:基于高斯溅射的结构保持3D说话人头合成
Abstract 摘要 TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting
Radiance fields have demonstrated impressive performance in synthesizing lifelike 3D talking heads. However, due to the difficulty in fitting steep appearance changes, the prevailing paradigm that presents facial motions by directly modifying point appearance may lead to distortions in dynamic regions. To tackle this challenge, we introduce TalkingGaussian, a deformation-based radiance fields framework for high-fidelity talking head synthesis. Leveraging the point-based Gaussian Splatting, facial motions can be represented in our method by applying smooth and continuous deformations to persistent Gaussian primitives, without requiring to learn the difficult appearance change like previous methods. Due to this simplification, precise facial motions can be synthesized while keeping a highly intact facial feature. Under such a deformation paradigm, we further identify a face-mouth motion inconsistency that would affect the learning of detailed speaking motions. To address this conflict, we decompose the model into two branches separately for the face and inside mouth areas, therefore simplifying the learning tasks to help reconstruct more accurate motion and structure of the mouth region. Extensive experiments demonstrate that our method renders high-quality lip-synchronized talking head videos, with better facial fidelity and higher efficiency compared with previous methods.
辐射场在合成逼真的3D说话头方面表现出令人印象深刻的性能。然而,由于难以拟合陡峭的外观变化,通过直接修改点外观来呈现面部运动的流行范例可能导致动态区域中的失真。为了解决这一挑战,我们引入了TalkingGaussian,一个基于变形的辐射场框架,用于高保真的说话头合成。利用基于点的高斯飞溅,面部运动可以在我们的方法中通过对持久的高斯基元应用平滑和连续的变形来表示,而不需要像以前的方法那样学习困难的外观变化。由于这种简化,可以合成精确的面部运动,同时保持高度完整的面部特征。在这样的变形模式下,我们进一步确定了一个脸嘴运动的不一致性,这将影响详细的说话动作的学习。 为了解决这一冲突,我们将模型分解为两个分支,分别用于面部和嘴巴内部区域,从而简化了学习任务,以帮助重建更准确的嘴巴区域的运动和结构。大量的实验表明,我们的方法呈现高质量的嘴唇同步说话的头部视频,更好的面部保真度和更高的效率相比,以前的方法。
Keywords:
talking head synthesis 3D Gaussian Splatting关键词:会说话的头合成3D高斯散射
Figure 1:Inaccurate predictions of the rapidly changing appearance often produce distorted facial features in previous NeRF-based methods. By keeping a persistent head structure and predicting deformation to represent facial motion, our TalkingGaussian outperforms previous methods in synthesizing more precise and clear talking heads.
图1:在以前基于NeRF的方法中,对快速变化的外观的不准确预测通常会产生扭曲的面部特征。通过保持持久的头部结构和预测变形来表示面部运动,我们的TalkingGaussian在合成更精确和清晰的说话头部方面优于以前的方法。
1Introduction 1介绍
Synthesizing audio-driven talking head videos is valuable to a wide range of digital applications such as virtual reality, film-making, and human-computer interaction. Recently, radiance fields like Neural Radiance Fields (NeRF) [31] have been adopted by many methods [15, 43, 24, 52, 36, 40, 5] to improve the stability of 3D head structure while providing photo-realistic rendering, which has achieved great success in synthesizing high-fidelity talking head videos.
合成音频驱动的说话头部视频对于诸如虚拟现实、电影制作和人机交互等广泛的数字应用是有价值的。最近,许多方法[15,43,24,52,36,40,5]都采用了像神经辐射场(NeRF)[ 31]这样的辐射场,以提高3D头部结构的稳定性,同时提供照片级真实感渲染,这在合成高保真度说话头部视频方面取得了巨大成功。
Most of these NeRF-based approaches [15, 43, 24, 52, 36] synthesize different face motions by directly modifying color and density with neural networks, predicting a temporary condition-dependent appearance for each spatial point in the radiance fields whenever receiving a condition feature. This appearance-modification paradigm enables previous methods to achieve dynamic lip-audio synchronization in a fixed space representation. However, since even neighbor regions can also show significantly different colors and various structures on a human face, it’s challenging for these continuous and smooth neural fields to accurately fit the rapidly changing appearance to represent facial motions, which may lead to some heavy distortions on the facial features like a messy mouth and transparent eyelids, as shown in Fig. 1.
大多数基于NeRF的方法[15,43,24,52,36]通过直接使用神经网络修改颜色和密度来合成不同的面部运动,每当接收到条件特征时,预测辐射场中每个空间点的临时条件依赖外观。这种外观修改范例使得以前的方法能够在固定的空间表示中实现动态唇音频同步。然而,由于即使是相邻区域也可以在人脸上显示出显著不同的颜色和各种结构,因此这些连续且平滑的神经场要准确地适应快速变化的外观以表示面部运动是具有挑战性的,这可能导致面部特征上的一些严重失真,如凌乱的嘴巴和透明的眼睑,如图1所示。
In this paper, we propose TalkingGaussian, a deformation-based talking head synthesis framework, that attempts to utilize the recent 3D Gaussian Splatting (3DGS) [20] to address the facial distortion problem in existing radiance-fields-based methods. The core idea of our method is to represent complex and fine-grained facial motions with several individual smooth deformations to simplify the learning task. To achieve this goal, we first obtain a persistent head structure that keeps an unchangeable appearance and stable geometry with 3DGS. Then, motions can be precisely represented just by the deformation applied to the head structure, therefore eliminating distortions produced from inaccurately predicted appearance, and leading to better facial fidelity while synthesizing high-quality talking heads.
在本文中,我们提出了TalkingGaussian,一个基于变形的说话头部合成框架,它试图利用最近的3D高斯飞溅(3DGS)[ 20]来解决现有基于辐射场的方法中的面部失真问题。我们的方法的核心思想是用几个单独的平滑变形来表示复杂和细粒度的面部运动,以简化学习任务。为了实现这一目标,我们首先获得一个持久的头部结构,保持不变的外观和稳定的几何形状与3DGS。然后,可以仅通过应用于头部结构的变形来精确地表示运动,因此消除了由不准确预测的外观产生的失真,并且在合成高质量的说话头部的同时导致更好的面部保真度。
Specifically, we represent the dynamic talking head with a 3DGS-based Deformable Gaussian Field, consisting of a static Persistent Gaussian Field and a neural Grid-based Motion Field to decouple the persistent head structure and dynamic facial motions. Unlike previous continuous neural-based backbones [31, 32, 24], 3DGS provides an explicit space representation by a definite set of Gaussian primitives, enabling us to obtain a more stable head structure and accurate control of spatial points. Based on this, we apply a point-wise deformation, which changes the position and shape of each primitive while persisting its color and opacity, to represent facial motions via the motion fields. Then the deformed primitives are input into the 3DGS rasterizer to render the target images. To facilitate the smooth learning for a target facial motion, we introduce an incremental sampling strategy that utilizes face action priors to schedule the optimization process of deformation.
具体来说,我们表示动态说话的头部与基于3DGS的可变形高斯场,由一个静态的持久高斯场和一个神经网格为基础的运动场解耦的持久头部结构和动态面部运动。与之前的连续神经骨干不同[31,32,24],3DGS通过一组确定的高斯基元提供了显式的空间表示,使我们能够获得更稳定的头部结构和对空间点的精确控制。在此基础上,我们应用逐点变形,改变每个图元的位置和形状,同时保持其颜色和不透明度,通过运动场来表示面部运动。然后将变形后的图元输入到3DGS光栅化器中绘制目标图像。为了促进目标面部运动的平滑学习,我们引入了一种增量采样策略,该策略利用面部动作先验来调度变形的优化过程。
In the Deformable Gaussian Fields, we further decompose the entire head as a face branch and an inside mouth branch to solve the motion inconsistency between these two regions, which hugely improves the synthesis quality both in static structure and dynamic performance. Since the motions of the face and inside mouth are not related totally in tight and may be much different sometimes, it is hard to accurately represent these delicate but conflicted motions with just one single motion field. To simplify the learning of both these two distinct motions, we divide these two regions in 2D input images with a semantic mask, and build two model branches to represent them individually. As the motion in each branch has been simplified to become smooth, our method can achieve better visual-audio synchronization and reconstruct a more accurate mouth structure.
在可变形高斯场中,我们进一步将整个头部分解为脸部分支和嘴巴内部分支,解决了这两个区域之间的运动不一致性,从而在静态结构和动态性能上都大大提高了合成质量。由于面部和口腔内部的运动并不完全相关,有时可能会有很大的不同,很难准确地表示这些微妙的,但冲突的运动,只有一个单一的运动场。为了简化对这两种不同运动的学习,我们使用语义掩码在2D输入图像中划分这两个区域,并构建两个模型分支来分别表示它们。由于每个分支中的运动被简化为平滑的,因此我们的方法可以实现更好的视听同步,并且重建出更准确的嘴部结构。
The main contributions of our paper are summarized as follows:
本文的主要贡献总结如下:
- •
We present a novel deformation-based framework that synthesizes talking heads by applying deformations to a persistent head structure, to escape an inherent facial distortion problem from the inaccurate prediction of changing appearance, enabling the generating of precise and intact facial details.
·我们提出了一种新的基于变形的框架,该框架通过将变形应用于持久的头部结构来合成说话的头部,以避免固有的面部失真问题,从而避免对变化外观的不准确预测,从而能够生成精确且完整的面部细节。 - •
We propose a Face-Mouth Decomposition module to facilitate motion modeling via decomposing conflicted learning tasks for deformation, therefore providing accurate mouth reconstruction and lip synchronization.
·我们提出了一个Face-Mouth Decomposition模块,通过分解变形的冲突学习任务来促进运动建模,从而提供准确的嘴巴重建和嘴唇同步。 - •
Extensive experiments show that the proposed TalkingGaussian renders realistic lip-synchronized talking head videos with high visual quality, generalization ability, and efficiency, outperforming state-of-the-art methods on both objective evaluation and human judgment.
·大量的实验表明,所提出的TalkingGaussian渲染逼真的嘴唇同步说话头部视频具有高的视觉质量,泛化能力和效率,在客观评价和人类判断方面优于最先进的方法。
2Related Work 2相关工作
Talking Head Synthesis. Driving talking heads by arbitrary input audio is an active research topic, aiming to reenact the specific person to generate highly audio-visual consistent videos. Early methods based on 2D generative models synthesize audio-synchronized lip motions for a given facial image [37, 12, 18, 7, 48]. Later advancements [44, 46, 55, 29] incorporate intermediate representations like facial landmarks and morphable models for better control, but suffer from errors and information loss during the intermediate estimation. Due to the lack of an explicit 3D structure, these 2D-based methods are short in keeping the naturalness and consistency when the head pose changes.
会说话的头合成。通过任意输入音频驱动说话人的头部是一个活跃的研究课题,旨在重现特定的人生成高度视听一致的视频。基于2D生成模型的早期方法为给定的面部图像合成音频同步的嘴唇运动[37,12,18,7,48]。后来的进步[44,46,55,29]结合了中间表示,如面部标志和变形模型,以更好地控制,但在中间估计过程中会出现错误和信息丢失。由于缺乏一个明确的三维结构,这些基于二维的方法在保持头部姿态变化时的自然性和一致性的短。
Recently, Neural Radiance Fields (NeRF) [31] has been introduced as a 3D representation of the talking head structure, providing photorealistic rendering and personalized talking style via person-specific training. Earlier NeRF-based works [15, 40, 27] suffer from the expensive cost of vanilla NeRF. Successfully driving efficient neural fields [32, 4] with audio, RAD-NeRF [43] and ER-NeRF [24] have gained tremendous improvements in both visual quality and efficiency. To improve the generalizability of cross-domain audio inputs, GeneFace [52] and SyncTalk [36] pre-train the audio encoder with large audio-visual datasets. However, most of these methods represent facial motions by changing the appearance of each sampling point, which burdens the network with learning the jumping and unsmooth appearance changes, resulting in distorted facial features. Although some works [40, 25, 53] have introduced a pre-trained deformable fields module for few-shot settings, the lack of fine-grained point control and precise head structure brings drawbacks in static and dynamic quality. Instead, utilizing 3DGS to maintain an accurate head structure, our method simplifies the learning difficulty of facial motions with a pure deformation representation, therefore improving facial fidelity and lip-synchronization.
最近,神经辐射场(NeRF)[ 31]已被引入作为说话头部结构的3D表示,通过特定于人的训练提供逼真的渲染和个性化的谈话风格。早期基于NeRF的作品[ 15,40,27]遭受香草NeRF的昂贵成本。RAD-NeRF [ 43]和ER-NeRF [ 24]成功地用音频驱动高效的神经场[ 32,4],在视觉质量和效率方面都有了巨大的改进。为了提高跨域音频输入的通用性,GeneFace [ 52]和SyncTalk [ 36]使用大型视听数据集预训练音频编码器。然而,大多数这些方法通过改变每个采样点的外观来表示面部运动,这使得网络需要学习跳跃和不平滑的外观变化,从而导致面部特征失真。 虽然一些作品[40,25,53]已经为少数镜头设置引入了预训练的可变形场模块,但缺乏细粒度点控制和精确的头部结构会带来静态和动态质量方面的缺点。相反,利用3DGS来保持准确的头部结构,我们的方法简化了学习难度的面部运动与一个纯粹的变形表示,从而提高面部保真度和嘴唇同步。
Deformation in Radiance Fields. Deformation has been widely applied in radiance fields to synthesize dynamic novel views. Some NeRF methods [38, 33, 34, 41, 13] use a static canonical radiance field to capture geometry and appearance and a time-dependent deformation field for dynamics. These methods predict an offset referring to the sampling position, which is opposite to the motion path and would bring extra difficulties in fitting. To solve this problem, [14] use a deformation that directly warps the canonical fields to represent dynamics. However, this method is costly since the spatial points cannot be accurately and stably controlled in its grid-based NeRF representation.
辐射场中的变形。变形技术广泛应用于辐射领域,可以合成动态的新视图。一些NeRF方法[38,33,34,41,13]使用静态正则辐射场来捕获几何形状和外观,并使用随时间变化的变形场进行动力学。这些方法预测参考采样位置的偏移,这与运动路径相反,并且会给拟合带来额外的困难。为了解决这个问题,[ 14]使用直接扭曲正则场的变形来表示动力学。然而,这种方法是昂贵的,因为空间点不能准确和稳定地控制在其基于网格的NeRF表示。
More recently, 3D Gaussian Splatting [20] introduces an explicit point-based representation for radiance fields, where deformation can be easily applied to a definite set of Gaussian primitives to directly warp the canonical fields. Based on this idea, considerable dynamic 3DGS works [30, 51, 49, 26, 22] get significant improvements in visual quality and efficiency for dynamic novel views synthesis. However, these methods only aim to remember the fixed motion at each time stamp, insufficient to represent various fine-grained motions driven by conditions, especially on the mouth. Despite some attempts conducted to reconstruct the human head [39, 50, 8, 45] driven by parametrized facial models, the mapping from audio to these parameters is not easy to learn and would cause information loss, and thus they can not be easily transferred to our audio-driven task. In this paper, we introduce deformable Gaussian fields with an incremental sampling strategy to facilitate learning multiple complex facial motions from a monocular speech video via pure deformation, and decomposite inconsistent motions of the face and inside mouth areas to improve the quality of delicate talking motions.
最近,3D高斯溅射[ 20]为辐射场引入了一种显式的基于点的表示,其中变形可以很容易地应用于一组确定的高斯基元,以直接扭曲规范场。基于这一思想,相当多的动态3DGS作品[ 30,51,49,26,22]在动态新视图合成的视觉质量和效率方面得到了显着的改善。然而,这些方法仅旨在记住每个时间戳处的固定运动,不足以表示由条件驱动的各种细粒度运动,特别是在嘴上。尽管进行了一些尝试来重建由参数化面部模型驱动的人类头部[39,50,8,45],但从音频到这些参数的映射并不容易学习,并且会导致信息丢失,因此它们不能容易地转移到我们的音频驱动任务中。 在本文中,我们引入可变形高斯场与增量采样策略,以方便学习多个复杂的面部运动从单目语音视频通过纯变形,和分解不一致的运动的脸和嘴内区域,以提高质量的微妙的谈话运动。
3Method 3方法
3.1Preliminaries and Problem Setting
3.1教材和问题设置
3D Gaussian Splatting. 3D Gaussian splatting (3DGS) [20] represents 3D information with a set of 3D Gaussians. It computes pixel-wise color 𝒞 with a set of 3D Gaussian primitives 𝜃 and the camera model information at the observing view. Specifically, a Gaussian primitive can be described with a center 𝜇∈ℝ3, a scaling factor 𝑠∈ℝ3, and a rotation quaternion 𝑞∈ℝ4. For rendering purposes, each Gaussian primitive also retains an opacity value 𝛼∈ℝ and a 𝑍-dimensional color feature 𝑓∈ℝ𝑍. Thus, the 𝑖-th Gaussian primitive 𝒢𝑖 keeps a set of parameters 𝜃𝑖={𝜇𝑖,𝑠𝑖,𝑞𝑖,𝛼𝑖,𝑓𝑖}. Its basis function is in the form of:
3D高斯散射3D高斯散射(3DGS)[ 20]表示具有一组3D高斯的3D信息。它使用一组3D高斯基元 𝜃 和观察视图处的相机模型信息来计算逐像素颜色 𝒞 。具体地,可以用中心 𝜇∈ℝ3 、缩放因子 𝑠∈ℝ3 和旋转四元数 𝑞∈ℝ4 来描述高斯基元。出于渲染目的,每个高斯基元还保留不透明度值 𝛼∈ℝ 和 𝑍 维颜色特征 𝑓∈ℝ𝑍 。因此,第 𝑖 高斯基元 𝒢𝑖 保持参数集合 𝜃𝑖={𝜇𝑖,𝑠𝑖,𝑞𝑖,𝛼𝑖,𝑓𝑖} 。其基函数的形式为:
𝒢𝑖(𝐱)=𝑒−12(𝐱−𝜇𝐢)𝑇Σ𝑖−1(𝐱−𝜇𝐢), | (1) |
where the covariance matrix Σ can be calculated from 𝑠 and 𝑞.
其中协方差矩阵 Σ 可以从 𝑠 和 𝑞 计算。
During the point-based rendering, a rasterizer would gather 𝑁 Gaussians following the camera model to compute the color 𝒞 of pixel 𝐱𝑝, with the decoded color 𝑐 of feature 𝑓 and the projected opacity 𝛼~ calculated by their projected 2D Gaussians 𝒢𝑝𝑟𝑜𝑗 on image plane:
在基于点的渲染期间,光栅化器将遵循相机模型收集 𝑁 高斯,以计算像素 𝐱𝑝 的颜色 𝒞 ,其中特征 𝑓 的解码颜色 𝑐 和投影不透明度 𝛼~ 由它们在图像平面上投影的2D高斯 𝒢𝑝𝑟𝑜𝑗 计算:
𝒞(𝐱𝑝)=∑𝑖∈𝑁𝑐𝑖𝛼~𝑖∏𝑗=1𝑖−1(1−𝛼~𝑗),𝛼~𝑖=𝛼𝑖𝒢𝑖𝑝𝑟𝑜𝑗(𝐱𝑝). | (2) |
Similarly, the opacity 𝒜∈[0,1] of pixel 𝐱𝑝 can be given:
类似地,可以给出像素 𝐱𝑝 的不透明度 𝒜∈[0,1] :
𝒜(𝐱𝑝)=∑𝑖∈𝑁𝛼~𝑖∏𝑗=1𝑖−1(1−𝛼~𝑗). | (3) |
3DGS optimizes the parameters 𝜃 for all Gaussians through gradient descent under color supervision. During the optimization process, it applies a densification strategy to control the growth of the primitives, while also pruning unnecessary ones. This work inherits these optimization strategies for color supervision.
3DGS在颜色监督下通过梯度下降为所有高斯函数优化参数 𝜃 。在优化过程中,它应用了一种致密化策略来控制基元的增长,同时也修剪了不必要的基元。这项工作继承了这些优化策略的颜色监督。
Problem Setting. In this paper, we aim to present an audio-driven framework based on 3DGS representation for high-fidelity talking head synthesis. Adopting a similar problem setting as NeRF-based works [15, 27, 43, 24], we take a few-minute speech video with a single person as the training data. A 3DMM model [35] is utilized to estimate the head pose and therefore to infer the camera pose. To keep aligned with previous works [15, 40, 27, 24], we use a pre-trained DeepSpeed [16] model as the basic audio encoder to get a generalizable audio feature from the raw input speech audio.
问题设置。在本文中,我们的目标是提出一个音频驱动的框架的基础上,3DGS表示高保真说话的头部合成。采用与基于NeRF的作品类似的问题设置[15,27,43,24],我们将一个人的几分钟语音视频作为训练数据。利用3DMM模型[35]来估计头部姿势,从而推断相机姿势。为了与以前的作品保持一致[15,40,27,24],我们使用预训练的DeepSpeed [16]模型作为基本的音频编码器,从原始输入语音音频中获得可概括的音频特征。
Figure 2:Overview of TalkingGaussian. Learning from the speech video with training frames 𝐼, TalkingGaussian builds two separate branches to represent the dynamic face and inside mouth areas. Queried by the primitives in Persistent Gaussian Fields with parameters 𝜃𝐶, a point-wise deformation can be predicted from Grid-based Motion Fields conditioned with audio feature 𝒂 and upper-face expression 𝒆. After that, the 3DGS rasterizer renders the deformed 3D Gaussian primitives into 2D images observed from the given camera, which are then fused to synthesize the entire talking head.
图2:TalkingGaussian概述。从具有训练帧 𝐼 的语音视频学习,TalkingGaussian构建两个单独的分支来表示动态面部和内部嘴部区域。通过具有参数 𝜃𝐶 的持续高斯场中的基元查询,可以从以音频特征 𝒂 和上面部表情 𝒆 为条件的基于网格的运动场预测逐点变形。之后,3DGS光栅化器将变形的3D高斯基元渲染成从给定相机观察到的2D图像,然后将其融合以合成整个说话头部。
3.2Deformable Gaussian Fields for Talking Head.
3.2可变形高斯场的说话头。
Figure 3:(a) The reconstructed facial motion results represented by deformation and appearance modification. (b) The visualized traces of the changing coordinate offset (deformation) and color in RGB (appearance modification) of two points with the same initial position. During the process, offset changes smoothly and the corresponding results are clear and accurate. Instead, some sudden changes with a large step length may occur in color, which is difficult to fit and causes a distorted mouth (red box).
图3:(a)由变形和外观修改表示的重建面部运动结果。(b)具有相同初始位置的两个点在RGB(外观修改)中改变坐标偏移(变形)和颜色的可视化轨迹。在此过程中,偏移量变化平稳,相应的结果清晰准确。相反,颜色可能会发生一些步长较大的突然变化,这很难适应,并导致嘴部变形(红色框)。
Despite previous NeRF-based methods [15, 40, 43, 52, 24, 27] have achieved great success in synthesizing high-quality talking heads via generating point-wise appearance, they can still not tackle the problem of generating distorted facial features on dynamic regions. One main reason is the appearance space, including color and density, is jumping and unsmooth, which makes it difficult for the continuous and smooth neural fields to fit. In comparison, deformation is another choice to represent motions with better smoothness and continuity, as shown in Fig. 3. In this work, we propose to purely use deformation in the Gaussian radiance fields to represent different motions of the talking head in 3D space. In particular, the whole representation is decomposed into Persistent Gaussian Fields and Grid-based Motion Fields, as shown in Fig. 2. These fields will be further refined for different regions in the next section.
尽管先前的基于NeRF的方法[15,40,43,52,24,27]在通过生成逐点外观合成高质量的说话头部方面取得了巨大成功,但它们仍然无法解决在动态区域上生成扭曲的面部特征的问题。一个主要原因是包括颜色和密度在内的表观空间具有跳跃性和不光滑性,这使得连续光滑的神经场难以拟合。相比之下,变形是另一种选择,以更好的平滑度和连续性来表示运动,如图3所示。在这项工作中,我们建议纯粹使用高斯辐射场的变形来表示在3D空间中的不同运动的说话头。特别地,整个表示被分解为持续高斯场和基于网格的运动场,如图2所示。在下一节中,将针对不同区域进一步细化这些字段。
Persistent Gaussian Fields. Persistent Gaussian Fields preserve the persistent Gaussian primitive with the canonical parameters 𝜃𝐶={𝜇,𝑠,𝑞,𝛼,𝑓}. Firstly, we initialize this module with the static 3DGS by the speech video frames to get a coarse mean field. Later, it attends a joint optimization with the Grid-based Motion Fields.
持续高斯场持久高斯场保留具有规范参数 𝜃𝐶={𝜇,𝑠,𝑞,𝛼,𝑓} 的持久高斯基元。首先,我们初始化该模块与静态3DGS的语音视频帧,得到一个粗略的平均场。随后,它与基于网格的运动场进行了联合优化。
Grid-based Motion Fields. Although the primitives in Persistent Gaussian Fields can effectively represent the correct 3D head, a regional position encoding is lacking due to their fully explicit space structure. Considering most facial motions are regionally smooth and continuous, we adopt an efficient and expressive tri-plane hash encoder ℋ [24] for position encoding with an MLP decoder to build Grid-based Motion Fields for a continuous deformation space.
基于网格的运动场。虽然持久高斯场中的基元可以有效地表示正确的3D头部,但由于其完全显式的空间结构,缺乏区域位置编码。考虑到大多数面部运动在区域上是平滑和连续的,我们采用高效且富有表现力的三平面哈希编码器 ℋ [ 24]用于位置编码,并使用MLP解码器来构建连续变形空间的基于网格的运动场。
Specifically, the motion fields aim to represent the facial motion by predicting a point-wise deformation 𝛿𝑖={Δ𝜇𝑖,Δ𝑠𝑖,Δ𝑞𝑖} for each primitive with the input of its center 𝜇𝑖, which is irrelevant to the color and opacity changing. For the given condition feature set 𝐂, the deformation 𝛿𝑖 can be calculated by:
具体地,运动场旨在通过预测每个基元的逐点变形 𝛿𝑖={Δ𝜇𝑖,Δ𝑠𝑖,Δ𝑞𝑖} 来表示面部运动,其中输入其中心 𝜇𝑖 ,这与颜色和不透明度变化无关。对于给定的条件特征集 𝐂 ,变形 𝛿𝑖 可以通过下式计算:
𝛿𝑖=MLP(ℋ(𝜇𝑖)⊕𝐂), | (4) |
where ⊕ denotes concatenation.
其中 ⊕ 表示级联。
Through a 3DGS rasterizer, these two fields are combined to generate deformed Gaussian primitives to render the output image, of which the deformed parameters 𝜃𝐷 are got from the canonical parameters 𝜃𝐶 and deformation 𝛿:
通过3DGS光栅化器,这两个场被组合以生成变形的高斯基元来渲染输出图像,其中变形的参数 𝜃𝐷 是从规范参数 𝜃𝐶 和变形 𝛿 得到的:
𝜃𝐷={𝜇+Δ𝜇,𝑠+Δ𝑠,𝑞+Δ𝑞,𝛼,𝑓}. | (5) |
Optimization with Incremental Sampling. While learning the deformation, once the target primitive position is too far from the predicted results, the gradient would vanish and thus the motion fields may fail to be effectively updated. To tackle this problem, we introduce an incremental sampling strategy. Specifically, we first find a valid metric 𝑚 (e.g. action units [11] or landmarks) to measure the deformation degree of each target facial motion. Then, at the 𝑘-th training iteration, we use a sliding window to sample a required training frame at position 𝑗, of which the motion metric 𝑚𝑗 satisfies the condition:
使用增量采样进行优化。在学习变形时,一旦目标基元位置与预测结果相差太远,梯度将消失,从而无法有效地更新运动场。为了解决这个问题,我们引入了增量采样策略。具体来说,我们首先找到一个有效的度量 𝑚 (例如动作单元[ 11]或地标)来测量每个目标面部运动的变形程度。然后,在第 𝑘 次训练迭代中,我们使用滑动窗口在位置 𝑗 处采样所需的训练帧,其中运动度量 𝑚𝑗 满足条件:
𝑚𝑗∈[𝐵𝑙𝑜𝑤𝑒𝑟+𝑘×𝑇,𝐵𝑢𝑝𝑝𝑒𝑟+𝑘×𝑇], | (6) |
where 𝐵𝑙𝑜𝑤𝑒𝑟 and 𝐵𝑢𝑝𝑝𝑒𝑟 denote the initial lower and upper bound of the sliding window, and 𝑇 denotes the step length. This selected training frame can offer sufficient new knowledge for the deformable fields to learn, but would not be too hard. To avoid catastrophic forgetting, we apply the incremental sampling strategy once every 𝐾 iterations.
其中 𝐵𝑙𝑜𝑤𝑒𝑟 和 𝐵𝑢𝑝𝑝𝑒𝑟 表示滑动窗口的初始下限和上限,并且 𝑇 表示步长。这个选定的训练帧可以为可变形场提供足够的新知识来学习,但不会太难。为了避免灾难性的遗忘,我们每 𝐾 次迭代应用一次增量采样策略。
3.3Face-Mouth Decomposition
3.3面-口分解
Although the Grid-based Motion Fields can predict the point-wise deformation at arbitrary positions due to the continuous and dense 3D space representation, this representation still encounters a granularity problem caused by the motion inconsistency between the face and the inside mouth. Since the inside area of the mouth is spatially too close to the lips but does not always move together, their motions would interfere with each other in a single interpolation-based motion field. This can also further lead to a bad reconstruction quality in static structure as well, as shown in Fig. 4.
虽然基于网格的运动场可以预测任意位置处的点式变形,由于连续和密集的3D空间表示,这种表示仍然遇到由面部和内部嘴巴之间的运动不一致性引起的粒度问题。由于嘴的内部区域在空间上太靠近嘴唇,但并不总是一起移动,因此它们的运动将在单个基于插值的运动场中相互干扰。这还可能进一步导致静态结构中的不良重建质量,如图4所示。
To tackle this problem, we propose decomposing these two regions in 3D space and building two individual branches with separate optimization. For each training video frame, we first use the off-the-shelf face parsing models to get a semantic mask of the inside mouth region in 2D space 1. Then, we take the masked image of the inside mouth and the remaining surface region (containing the face, hair, and other head parts) to train two separate deformable Gaussian fields as two branches of our framework.
为了解决这个问题,我们建议在3D空间中分解这两个区域,并使用单独的优化构建两个单独的分支。对于每一个训练视频帧,我们首先使用现成的人脸解析模型,以获得在2D空间中的内部嘴部区域的语义掩模。然后,我们将嘴巴内部的掩蔽图像和剩余的表面区域(包含面部,头发和其他头部部分)训练两个单独的可变形高斯场作为我们框架的两个分支。
Figure 4:(a) Lips and the inside mouth, especially teeth, are hard to be correctly divided with a single motion field. (b) This would further affect the learning of the mouth structure and speaking motions, resulting in bad quality. Our Face-Mouth Decomposition can successfully address this problem and render high-fidelity results.
图4:(a)嘴唇和口腔内部,特别是牙齿,很难用单一的运动场正确划分。(b)这将进一步影响嘴部结构和说话动作的学习,导致质量不好。我们的脸-嘴分解可以成功地解决这个问题,并呈现高保真的结果。
Face Branch. The face branch serves as the main part to fit the appearance and motion of the talking head, including all facial motions except the one of the inside mouth. In this branch, we adopt a region attention mechanism [24] in the Grid-based Motion Fields to facilitate the learning of the conditioned deformation driven by the features of audio 𝒂 and upper-face expression 𝒆. To fully decouple these two conditions, the upper-face expression feature 𝒆 is composed of 7 action units [11] that are explicitly irrelevant to the mouth. The deformation 𝛿𝑖F for the 𝑖-th primitive in the face branch can be predicted by:
面分支。面部分支作为主要部分,以适应说话头部的外观和运动,包括除口内运动之外的所有面部运动。在该分支中,我们在基于网格的运动场中采用区域注意机制[ 24],以便于学习由音频 𝒂 和上面部表情 𝒆 的特征驱动的条件变形。为了完全解耦这两个条件,上面部表情特征 𝒆 由7个动作单元组成[ 11],这些动作单元与嘴部明显无关。面分支中的第 𝑖 个图元的变形 𝛿𝑖F 可以通过下式预测:
𝛿𝑖F=MLP(ℋF(𝜇𝑖)⊕𝒂𝑟,𝑖⊕𝒆𝑟,𝑖), | (7) |
where 𝒂𝑟,𝑖=𝑉𝒂,𝑖⊙𝒂 and 𝒆𝑟,𝑖=𝑉𝒆,𝑖⊙𝒆 denote the region-aware feature at position 𝜇𝑖 in the region attention mechanism, calculated by the attention vectors 𝑉𝒂,𝑖 and 𝑉𝒆,𝑖 with the Hadamard product ⊙.
其中, 𝒂𝑟,𝑖=𝑉𝒂,𝑖⊙𝒂 和 𝒆𝑟,𝑖=𝑉𝒆,𝑖⊙𝒆 表示区域关注机制中的位置 𝜇𝑖 处的区域感知特征,由关注向量 𝑉𝒂,𝑖 和 𝑉𝒆,𝑖 与Hadamard乘积 ⊙ 计算。
During the optimization, we apply the Incremental Sampling strategy for the lips action and eye-blinking. Specifically, we measure the lips opening degree by the height of the mouth area according to the detected facial landmarks, and use AU45 [11] to describe the degree of eye close. Then, we gradually move the sliding window to guide the face branch to learn the deformations of the lips from close to open and eyes from open to close.
在优化过程中,我们对嘴唇动作和眨眼动作采用增量采样策略。具体来说,我们根据检测到的面部标志,通过嘴部区域的高度来测量嘴唇张开程度,并使用AU45 [ 11]来描述眼睛闭合的程度。然后,我们逐渐移动滑动窗口来引导面部分支学习嘴唇从闭合到张开以及眼睛从张开到闭合的变形。
Inside Mouth Branch. The inside mouth branch represents the audio-driven dynamic inside mouth region in 3D space. Considering the inside mouth moves in a much simpler manner and is only driven by audio, we use a lightweight deformable Gaussian field to build this branch. In particular, we only predict the translation Δ𝜇𝑖 conditioned by the audio feature 𝒂 for the 𝑖-th primitive:
内口分支。内嘴分支表示3D空间中的音频驱动的动态内嘴区域。考虑到内部嘴部以简单得多的方式移动并且仅由音频驱动,我们使用轻量级可变形高斯场来构建该分支。特别地,我们仅预测第 𝑖 个原语的由音频特征 𝒂 调节的平移 Δ𝜇𝑖 :
𝛿𝑖M={Δ𝜇𝑖M}=MLP(ℋM(𝜇𝑖)⊕𝒂). | (8) |
To get a better reconstruction quality of the teeth part, we apply an incremental sampling strategy that smooths the learning of the overlapping between teeth and lips with the quantitative metric AU25 [11].
为了获得更好的牙齿部分重建质量,我们应用了增量采样策略,该策略使用定量度量AU 25平滑牙齿和嘴唇之间重叠的学习[11]。
Rendering. The final talking head image is fused with the two rendered face and inside mouth images. Based on the physical structure, we assume the rendering results from the Inside Mouth Branch are behind that from the Face Branch. Therefore, the talking head color 𝒞head of pixel 𝐱𝑝 can be rendered by:
渲染。最终的说话头部图像与两个渲染的面部和内部嘴图像融合。基于物理结构,我们假设来自嘴内分支的渲染结果落后于来自面分支的渲染结果。因此,像素 𝐱𝑝 的讲话头部颜色 𝒞head 可以由下式呈现:
𝒞head(𝐱𝑝)=𝒞face(𝐱𝑝)×𝒜face(𝐱𝑝)+𝒞mouth(𝐱𝑝)×(1−𝒜face(𝐱𝑝)), | (9) |
where 𝒞face and 𝒜face denote the predicted face color and opacity from the face branch, and 𝒞mouth is the color predicted by the inside mouth branch.
其中, 𝒞face 和 𝒜face 表示来自面部分支的预测面部颜色和不透明度,并且 𝒞mouth 是由嘴内分支预测的颜色。
3.4Training Details 3.4培训详情
We keep the basic 3DGS optimization strategies to train our framework. The full process can be divided into three stages, of which the first two stages are individually applied for the two branches and the last stage is for fusion.
我们保留了基本的3DGS优化策略来训练我们的框架。整个过程可以分为三个阶段,其中前两个阶段分别应用于两个分支,最后一个阶段用于融合。
Static Initialization. At the beginning of the training, we first conduct an initialization via the vanilla 3DGS for the Persistent Gaussian Fields to get a coarse head structure. Following 3DGS, we use a pixel-wise L1 loss and a D-SSIM term to measure the error between the image ℐ^𝐶 rendered by parameters 𝜃𝐶 and the masked ground-truth image ℐmask for each branch:
静电放电在训练开始时,我们首先通过vanilla 3DGS对持续高斯场进行初始化,以获得粗略的头部结构。在3DGS之后,我们使用逐像素L1损失和D-SSIM项来测量由参数 𝜃𝐶 渲染的图像 ℐ^𝐶 与每个分支的掩蔽地面实况图像 ℐmask 之间的误差:
ℒ𝐶=ℒ1(ℐ^𝐶,ℐmask)+𝜆ℒD−SSIM(ℐ^𝐶,ℐmask). | (10) |
Motion Learning. After the initialization, we add the motion fields into training via its predicted deformation 𝛿. In practice, we take the deformed parameters 𝜃𝐷 from Equation 5 as the input for the 3DGS rasterizer to render the output image ℐ^𝐷. The loss function is:
动作学习在初始化之后,我们通过其预测变形 𝛿 将运动场添加到训练中。在实践中,我们将来自等式5的变形参数 𝜃𝐷 作为3DGS光栅化器的输入以渲染输出图像 ℐ^𝐷 。损失函数为:
ℒ𝐷=ℒ1(ℐ^𝐷,ℐmask)+𝜆ℒD−SSIM(ℐ^𝐷,ℐmask). | (11) |
Fine-tuning. Finally, a color fine-tuning stage is conducted to better fuse the head and inside mouth branches. We calculate the reconstruction loss between the fused image ℐ^head rendered by Equation 9 and the ground-truth video frame ℐ^ with pixel-wise L1 loss, D-SSIM, and LPIPS terms:
微调最后,进行颜色微调阶段,以更好地融合头部和内口分支。我们使用逐像素L1损失、D-SSIM和LPIPS项计算由等式9渲染的融合图像 ℐ^head 与地面实况视频帧 ℐ^ 之间的重建损失:
ℒ𝐹=ℒ1(ℐ^head,ℐ)+𝜆ℒD−SSIM(ℐ^head,ℐ)+𝛾ℒLPIPS(ℐ^head,ℐ). | (12) |
At this stage, we only update the color parameter 𝑓∈𝜃𝐶 and stop the densification strategy of 3DGS for stability.
在此阶段,我们仅更新颜色参数 𝑓∈𝜃𝐶 并停止3DGS的致密化策略以保持稳定性。
4Experiment 4实验
4.1Experimental Settings 4.1实验设置
Dataset. We collect four high-definition speaking video clips from previous publicly-released video sets [15, 24, 52] to build the video datasets for our experiments, including three male portraits "Macron", "Lieu", "Obama", and one female portrait "May". The video clips have an average length of about 6500 frames in 25 FPS with a center portrait, three ("May", "Macron", and "Lieu") of which are cropped and resized to 512×512 and one ("Obama") to 450×450.
数据集。我们从之前公开发布的视频集[ 15,24,52]中收集了四个高清演讲视频片段来构建我们实验的视频数据集,包括三个男性肖像“Macron”,“Lieu”,“奥巴马”和一个女性肖像“May”。视频剪辑具有约6500帧的平均长度,在25 FPS中具有中心肖像,其中三个(“May”、“Macron”和“Lieu”)被裁剪并调整大小为 512×512 ,一个(“奥巴马”)被调整大小为 450×450 。
Comparison Baselines. In the experiments, we mainly compare our method with the most related NeRF-based methods AD-NeRF [15], DFRF [40], RAD-NeRF [43], GeneFace [52] and ER-NeRF [24], which render talking head via person-specific radiance fields trained with speech videos. Additionally, we also take the state-of-the-art 2D generative models (Wav2Lip [37], IP-LAP [58] and DINet [57]), which do not need person-specific training, and person-specific methods (SynObama [42], NVP [44], and LSP [29]) as the baselines.
比较基线。在实验中,我们主要将我们的方法与最相关的基于NeRF的方法AD-NeRF [ 15],DFRF [ 40],RAD-NeRF [ 43],GeneFace [ 52]和ER-NeRF [ 24]进行比较,这些方法通过使用语音视频训练的人特定辐射场来渲染说话的头部。此外,我们还采用了最先进的2D生成模型(Wav 2Lip [ 37],IP-Risk [ 58]和DINet [ 57]),这些模型不需要特定于个人的训练,以及特定于个人的方法(SynObama [ 42],NVP [ 44]和LSP [ 29])作为基线。
Implementation Details. Our method is implemented on PyTorch. For a specific portrait, we first train both the face and inside mouth branches for 50,000 iterations parallelly and then jointly fine-tune them for 10,000 iterations. Adam [21] and AdamW [28] optimizers are used in training. In the loss functions, 𝜆 and 𝛾 are set to 0.2 and 0.5. All experiments are performed on RTX 3080 Ti GPUs. The overall training process takes about 0.5 hours. A pre-trained DeepSpeech model [16] is used as a basic audio feature extractor.
实施细节。我们的方法在PyTorch上实现。对于一个特定的肖像,我们首先训练面部和嘴巴内部分支进行 50,000 迭代,然后联合微调它们进行 10,000 迭代。亚当[21]和AdamW [28]优化器用于训练。在损失函数中,将 𝜆 和 𝛾 设置为 0.2 和 0.5 。所有实验都在RTX 3080 Ti GPU上进行。整个训练过程大约需要0.5小时。预训练的DeepSpeech模型[16]被用作基本的音频特征提取器。