3D facial animation driven by speech-video dual-modal signals-Reference-Cited by-同舟云学术

3D facial animation driven by speech-video dual-modal signals

Published:2024-05-23 Issue:5 Volume:10 Page:5951-5964
ISSN:2199-4536
Container-title:Complex & Intelligent Systems
language:en
Short-container-title:Complex Intell. Syst.

Author:

Ji Xuejie,Liao Zhouzhou,Dong Lanfang^ORCID,Tang Yingchao,Li Guoming,Mao Meng

Abstract

AbstractIn recent years, the applications of digital humans have become increasingly widespread. One of the most challenging core technologies is the generation of highly realistic and automated 3D facial animation that combines facial movements and speech. The single-modal 3D facial animation driven by speech typically ignores the weak correlation between speech and upper facial movements as well as head posture. In contrast, the video-driven approach can perfectly solve the posture problem while obtaining natural expressions. However, mapping 2D facial information to 3D facial information may lead to information loss, which make lip synchronization generated by video-driven methods is not as good as the speech-driven methods trained on 4D facial data. Therefore, this paper proposes a dual-modal generation method that uses speech and video information to generate more natural and vivid 3D facial animation. Specifically, the lip movements related to speech are generated by speech-video information, while speech-uncorrelated postures and expressions are generated solely by video information. The speech-driven module is used to extract speech features, and its output lip animation is then used as the foundation for facial animation. The expression and pose module is used to extract temporal visual features for regressing expression and head posture parameters. We fuse speech and video features to obtain chin posture parameters related to lip movements, and use these parameters to fine-tune the lip animation generated form the speech-driven module. This paper introduces multiple consistency losses to enhance the network’s capability to generate expressions and postures. Experiments conducted on the LRS3, TCD-TIMIT and MEAD datasets show that the proposed method achieves better performance on evaluation metrics such as CER, WER, VER and VWER than the current state-of-the-art methods. In addition, a perceptual user study show that over 77% and 70% of cases believe that this paper’s method is better than the comparative algorithms EMOCA and SPECTRE in terms of realism. In terms of lip synchronization, it received over 79% and 66% of cases support, respectively. Both evaluation methods demonstrate the effectiveness of the proposed method.

Funder

The National Key Research and Development Program of China

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s40747-024-01481-5.pdf

Reference43 articles.

1. Abdelaziz AH, Zeiler S, Kolossa D (2015) Learning dynamic stream weights for coupled-hmm-based audio-visual speech recognition. IEEE/ACM Trans Audio, Speech, Lang Process 23(5):863–876

2. Afouras T, Chung JS, Zisserman A (2018) Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496

3. Barros JMD, Golyanik V, Varanasi K, Stricker D (2019) Face it!: a pipeline for real-time performance-driven facial animation. In: 2019 IEEE International Conference on Image Processing (ICIP). IEEE. pp 2209–2213

4. Brand M (1999) Voice puppetry. In: Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp 21–28

5. Bulat A, Tzimiropoulos G (2017) How far are we from solving the 2D & 3D face alignment problem?(and a dataset of 230,000 3d facial landmarks). In: Proceedings of the IEEE international conference on computer vision, pp 1021–1030