Talking Face Generation by Adversarially Disentangled Audio-Visual Representation-Reference-Cited by-同舟云学术

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

Published:2019-07-17 Issue: Volume:33 Page:9299-9306
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

Zhou Hang,Liu Yu,Liu Ziwei,Luo Ping,Wang Xiaogang

Abstract

Talking face generation aims to synthesize a sequence of face images that correspond to a clip of speech. This is a challenging task because face appearance variation and semantics of speech are coupled together in the subtle movements of the talking face regions. Existing works either construct specific face appearance model on specific subjects or model the transformation between lip motion and speech. In this work, we integrate both aspects and enable arbitrary-subject talking face generation by learning disentangled audio-visual representation. We find that the talking face sequence is actually a composition of both subject-related information and speech-related information. These two spaces are then explicitly disentangled through a novel associative-and-adversarial training process. This disentangled representation has an advantage where both audio and video can serve as inputs for generation. Extensive experiments show that the proposed approach generates realistic talking face sequences on arbitrary subjects with much clearer lip motion patterns than previous work. We also demonstrate the learned audio-visual representation is extremely useful for the tasks of automatic lip reading and audio-video retrieval.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Subject

General Medicine

Cited by 177 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Role of human physiology and facial biomechanics towards building robust deepfake detectors: A comprehensive survey and analysis;Computer Science Review;2024-11

2. Deep Learning for Visual Speech Analysis: A Survey;IEEE Transactions on Pattern Analysis and Machine Intelligence;2024-09

3. Generating dynamic lip-syncing using target audio in a multimedia environment;Natural Language Processing Journal;2024-09

4. A Survey of Cross-Modal Visual Content Generation;IEEE Transactions on Circuits and Systems for Video Technology;2024-08

5. OSM-Net: One-to-Many One-Shot Talking Head Generation With Spontaneous Head Motions;IEEE Transactions on Circuits and Systems for Video Technology;2024-08