Sicheng Xu^*, Guojun Chen^*, Yu-Xiao Guo^*, Jiaolong Yang^*‡,
Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, Baining Guo Microsoft Research Asia
^*Equal Contributions ^‡Corresponding Author: jiaoyan@microsoft.com

TL;DR: single portrait photo + speech audio = hyper-realistic talking face video with precise lip-audio sync, lifelike facial behavior, and naturalistic head movements, generated in real time.

Abstract

We introduce VASA, a framework for generating lifelike talking faces of virtual characters with appealing visual affective skills (VAS), given a single static image and a speech audio clip. Our premiere model, VASA-1, is capable of not only producing lip movements that are exquisitely synchronized with the audio, but also capturing a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness. The core innovations include a holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos. Through extensive experiments including evaluation on a set of new metrics, we show that our method significantly outperforms previous methods along various dimensions comprehensively. Our method not only delivers high video quality with realistic facial and head dynamics but also supports the online generation of 512x512 videos at up to 40 FPS with negligible starting latency. It paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors.

(Note: all portrait images on this page are virtual, non-existing identities generated by StyleGAN2 or DALL·E-3 (except for Mona Lisa). We are exploring visual affective skill generation for virtual, interactive characters, NOT impersonating any person in the real world. This is only a research demonstration and there's no product or API release plan. See also the bottom of this page for more of our Responsible AI considerations.)

Realism and liveliness

Our method is capable of not only producing precious lip-audio synchronization, but also generating a large spectrum of expressive facial nuances and natural head motions. It can handle arbitary-length audio and stably output seamless talking face videos.

Examples with audio input of one minute long.

More shorter examples with diverse audio input

Controllability of generation

Our diffusion model accepts optional signals as condition, such as main eye gaze direction and head distance, and emotion offsets.

Generation results under different main gaze directions (forward-facing, leftwards, rightwards, and upwards, respectively)

Generation results under different head distance scales

Generation results under different emotion offsets (neutral, happiness, anger, and surprise, respectively)

Out-of-distribution generalization

Our method exhibits the capability to handle photo and audio inputs that are out of the training distribution. For example, it can handle artistic photos, singing audios, and non-English speech. These types of data were not present in the training set.

Power of disentanglement

Our latent representation disentangles appearance, 3D head pose, and facial dynamics, which enables separate attribute control and editing of the generated content.

Same input photo with different motion sequences (left two cases), and same motion sequence with different photos (right three cases)

Pose and expression editing (raw generation result, pose-only result, expression-only result, and expression with spinning pose)

Real-time efficiency

Our method generates video frames of 512x512 size at 45fps in the offline batch processing mode, and can support up to 40fps in the online streaming mode with a preceding latency of only 170ms , evaluated on a desktop PC with a single NVIDIA RTX 4090 GPU.

A real-time demo

Risks and responsible AI considerations

Our research focuses on generating visual affective skills for virtual AI avatars, aiming for positive applications. It is not intended to create content that is used to mislead or deceive. However, like other related content generation techniques, it could still potentially be misused for impersonating humans. We are opposed to any behavior to create misleading or harmful contents of real persons, and are interested in applying our technique for advancing forgery detection. Currently, the videos generated by this method still contain identifiable artifacts, and the numerical analysis shows that there's still a gap to achieve the authenticity of real videos.

While acknowledging the possibility of misuse, it's imperative to recognize the substantial positive potential of our technique. The benefits – such as enhancing educational equity, improving accessibility for individuals with communication challenges, offering companionship or therapeutic support to those in need, among many others – underscore the importance of our research and other related explorations. We are dedicated to developing AI responsibly, with the goal of advancing human well-being.

Given such context, we have no plans to release an online demo, API, product, additional implementation details, or any related offerings until we are certain that the technology will be used responsibly and in accordance with proper regulations.