Realistic Speech-Driven Facial Animation with GANs-Reference-Cited by-同舟云学术

Realistic Speech-Driven Facial Animation with GANs

Published:2019-10-13 Issue:5 Volume:128 Page:1398-1413
ISSN:0920-5691
Container-title:International Journal of Computer Vision
language:en
Short-container-title:Int J Comput Vis

Author:

Vougioukas Konstantinos^ORCID,Petridis Stavros,Pantic Maja

Abstract

Abstract Speech-driven facial animation is the process that automatically synthesizes talking characters based on speech signals. The majority of work in this domain creates a mapping from audio features to visual features. This approach often requires post-processing using computer graphics techniques to produce realistic albeit subject dependent results. We present an end-to-end system that generates videos of a talking head, using only a still image of a person and an audio clip containing speech, without relying on handcrafted intermediate features. Our method generates videos which have (a) lip movements that are in sync with the audio and (b) natural facial expressions such as blinks and eyebrow movements. Our temporal GAN uses 3 discriminators focused on achieving detailed frames, audio-visual synchronization, and realistic expressions. We quantify the contribution of each component in our model using an ablation study and we provide insights into the latent representation of the model. The generated videos are evaluated based on sharpness, reconstruction quality, lip-reading accuracy, synchronization as well as their ability to generate natural blinks.

Funder

Imperial College London

Publisher

Springer Science and Business Media LLC

Subject

Artificial Intelligence,Computer Vision and Pattern Recognition,Software

Link

http://link.springer.com/content/pdf/10.1007/s11263-019-01251-8.pdf

Reference46 articles.

1. Amos, B., Ludwiczuk, B., & Satyanarayanan, M. (2016). OpenFace: A general-purpose face recognition library with mobile applications. Technical Report, 118.

2. Arjovsky, M., & Bottou, L. (2017). Towards principled methods for training generative adversarial networks. In ICLR.

3. Assael, Y. M., Shillingford, B., Whiteson, S., & de Freitas, N. (2016). LipNet: End-to-end sentence-level Lipreading. arXiv preprint arXiv:1611.01599 .

4. Bentivoglio, A. R., Bressman, S. B., Cassetta, E., Carretta, D., Tonali, P., & Albanese, A. (1997). Analysis of blink rate patterns in normal subjects. Movement Disorders, 12(6), 1028–1034.

5. Bregler, C., Covell, M., & Slaney, M. (1997). Video rewrite. In Proceedings of the 24th annual conference on computer graphics and interactive techniques (pp. 353–360).

Cited by 178 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Deep Learning for Visual Speech Analysis: A Survey;IEEE Transactions on Pattern Analysis and Machine Intelligence;2024-09

2. 3D facial modeling, animation, and rendering for digital humans: A survey;Neurocomputing;2024-09

3. Cospeech body motion generation using a transformer;Applied Intelligence;2024-08-26

4. A survey on deep learning based reenactment methods for deepfake applications;IET Image Processing;2024-08-19

5. DialogueNeRF: towards realistic avatar face-to-face conversation video generation;Visual Intelligence;2024-08-07