Sicheng Xu^*, Guojun Chen^*, Jiaolong Yang^‡, Yizhong Zhang, Yu Deng, Steve Lin, Baining Guo Microsoft Research Asia
^*Equal Contributions ^‡Corresponding Author: jiaoyan@microsoft.com

TL;DR: Single portrait photo to realistic and expressive 3D head avatar, animatable with speech audio to create 3D free-viewpoint talking face videos with precise lip-audio sync, lifelike facial behavior, and natural head movements, in real time.

Abstract

We propose VASA-3D, an audio-driven, single-shot 3D head avatar generator. This research tackles two major challenges: capturing the subtle expression details present in real human faces, and reconstructing an intricate 3D head avatar from a single portrait image. To accurately model expression details, VASA-3D leverages the motion latent of VASA-1, a method that yields exceptional realism and vividness in 2D talking heads. A critical element of our work is translating this motion latent to 3D, which is accomplished by devising a 3D head model that is conditioned on the motion latent. Customization of this model to a single image is achieved through an optimization framework that employs numerous video frames of the reference head synthesized from the input image. The optimization takes various training losses robust to artifacts and limited pose coverage in the generated training data. Our experiment shows that VASA-3D produces realistic 3D talking heads that cannot be achieved by prior art, and it supports the online generation of 512x512 free-viewpoint videos at up to 75 FPS, facilitating more immersive engagements with lifelike 3D avatars.

(Note: all portrait images on this page are virtual, non-existing identities generated by StyleGAN2 or DALL·E-3. This research explores visual affective skill generation for virtual, interactive characters and does not impersonate any person in the real world. This is only a research demonstration. Please see the “Responsible AI considerations” section for more information including the measures we have taken and the positive applications we are exploring.)

Real-time Demo

Our method generates 3D talking head frames of 512x512 size at 75fps with a preceding latency of only 65ms, evaluated on a desktop PC with a single NVIDIA RTX 4090 GPU. Speed test is conducted with a naive implementation without optimization.

Expressive 3D dynamic heads

Our method transforms a single portrait image into a lifelike 3D talking head, synchronized with any speech audio input. Our model ensures multiview consistency and facilitates real-time audio-driven animation and free-view rendering, captures and conveys dynamic expression details with a degree of realism that markedly exceeds current state-of-the-art techniques.

Artistic Portraits

Our method can also effectively handle artistic images and produce convincing 3D talking heads.

Controllability

Inheriting the capabilities of VASA-1, our VASA-3D can take additional control signals besides an audio clip. Here we present the results with emotion offset control, where the generated 3D talking heads closely adhere to different emotion offsets and exhibite motive talking styles.

Risks and responsible AI considerations

Our research aims to support positive applications of virtual AI avatars and is not intended for creating misleading or deceptive content. Responsible AI is foundational to VASA-3D and was considered at every stage of its development. First, it's important to note that this is a research-only demonstration and we are not making the models and APIs publicly available at this time. The research and development process of VASA-3D is based on synthetic videos generated with synthetic photos of non-existing subjects. And finally, we're using VASA-3D to explore further Responsible AI and AI detection efforts internally.

While recognizing the potential for misuse, it is important to acknowledge the substantial positive impact that our research technique could eventually have. We are currently examining potential benefits, such as its application in an AI coworker, which can enhance latent intelligence accessibility for knowledge workers, and an AI tutor, which can engage students in a more dynamic and effective manner. These applications highlight the significance of this research and other related investigations. We are committed to developing AI responsibly, with the goal of advancing human well-being.

BibTeX

          @inproceedings{vasa3d,
            title={VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image},
            author={Xu, Sicheng and Chen, Guojun and Yang, Jiaolong and Zhang, Yizhong and Deng, Yu and Lin, Steve and Guo, Baining},
            booktitle={Advances in Neural Information Processing Systems},
            year={2025}
          }