One-Shot Talking Face Generation from Single-Speaker Audio-Visual Correlation Learning-Reference-Cited by-同舟云学术

One-Shot Talking Face Generation from Single-Speaker Audio-Visual Correlation Learning

Published:2022-06-28 Issue:3 Volume:36 Page:2531-2539
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

Wang Suzhen,Li Lincheng,Ding Yu,Yu Xin

Abstract

Audio-driven one-shot talking face generation methods are usually trained on video resources of various persons. However, their created videos often suffer unnatural mouth shapes and asynchronous lips because those methods struggle to learn a consistent speech style from different speakers. We observe that it would be much easier to learn a consistent speech style from a specific speaker, which leads to authentic mouth movements. Hence, we propose a novel one-shot talking face generation framework by exploring consistent correlations between audio and visual motions from a specific speaker and then transferring audio-driven motion fields to a reference image. Specifically, we develop an Audio-Visual Correlation Transformer (AVCT) that aims to infer talking motions represented by keypoint based dense motion fields from an input audio. In particular, considering audio may come from different identities in deployment, we incorporate phonemes to represent audio signals. In this manner, our AVCT can inherently generalize to audio spoken by other identities. Moreover, as face keypoints are used to represent speakers, AVCT is agnostic against appearances of the training speaker, and thus allows us to manipulate face images of different identities readily. Considering different face shapes lead to different motions, a motion field transfer module is exploited to reduce the audio-driven dense motion field gap between the training identity and the one-shot reference. Once we obtained the dense motion field of the reference image, we employ an image renderer to generate its talking face videos from an audio clip. Thanks to our learned consistent speaking style, our method generates authentic mouth shapes and vivid movements. Extensive experiments demonstrate that our synthesized videos outperform the state-of-the-art in terms of visual quality and lip-sync.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Subject

General Medicine

Cited by 36 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Deep Learning for Visual Speech Analysis: A Survey;IEEE Transactions on Pattern Analysis and Machine Intelligence;2024-09

2. MILG: Realistic lip-sync video generation with audio-modulated image inpainting;Visual Informatics;2024-09

3. Multi-Modal Driven Pose-Controllable Talking Head Generation;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-08-10

4. OSM-Net: One-to-Many One-Shot Talking Head Generation With Spontaneous Head Motions;IEEE Transactions on Circuits and Systems for Video Technology;2024-08

5. Talking Face Generation via Face Mesh - Controllability without Reference Videos;2024 IEEE Conference on Artificial Intelligence (CAI);2024-06-25