MusicFace: Music-driven expressive singing face synthesis-Reference-Cited by-同舟云学术

MusicFace: Music-driven expressive singing face synthesis

Published:2023-11-30 Issue:1 Volume:10 Page:119-136
ISSN:2096-0433
Container-title:Computational Visual Media
language:en
Short-container-title:Comp. Visual Media

Author:

Liu Pengfei,Deng Wenjin,Li Hengda,Wang Jintai,Zheng Yinglin,Ding Yiwei,Guo Xiaohu,Zeng Ming

Abstract

AbstractIt remains an interesting and challenging problem to synthesize a vivid and realistic singing face driven by music. In this paper, we present a method for this task with natural motions for the lips, facial expression, head pose, and eyes. Due to the coupling of mixed information for the human voice and backing music in common music audio signals, we design a decouple-and-fuse strategy to tackle the challenge. We first decompose the input music audio into a human voice stream and a backing music stream. Due to the implicit and complicated correlation between the two-stream input signals and the dynamics of the facial expressions, head motions, and eye states, we model their relationship with an attention scheme, where the effects of the two streams are fused seamlessly. Furthermore, to improve the expressivenes of the generated results, we decompose head movement generation in terms of speed and direction, and decompose eye state generation into short-term blinking and long-term eye closing, modeling them separately. We have also built a novel dataset, SingingFace, to support training and evaluation of models for this task, including future work on this topic. Extensive experiments and a user study show that our proposed method is capable of synthesizing vivid singing faces, qualitatively and quantitatively better than the prior state-of-the-art.

Publisher

Springer Science and Business Media LLC

Subject

Artificial Intelligence,Computer Graphics and Computer-Aided Design,Computer Vision and Pattern Recognition

Link

https://link.springer.com/content/pdf/10.1007/s41095-023-0343-7.pdf

Reference71 articles.

1. Cudeiro, D.; Bolkart, T.; Laidlaw, C.; Ranjan, A.; Black, M. J. Capture, learning, and synthesis of 3D speaking styles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10093–10103, 2019.

2. Suwajanakorn, S.; Seitz, S. M.; Kemelmacher-Shlizerman, I. Synthesizing Obama: Learning lip sync from audio. ACM Transactions on Graphics Vol. 36, No. 4, Article No. 95, 2017.

3. Chen, L. L.; Cui, G. F.; Liu, C. L.; Li, Z.; Kou, Z. Y.; Xu, Y.; Xu, C. L. Talking-head generation with rhythmic head motion. In: Computer Vision–ECCV 2020. Lecture Notes in Computer Science, Vol. 12354. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 35–51, 2020.

4. Yi, R.; Ye, Z. P.; Zhang, J. Y.; Bao, H. J.; Liu, Y. J. Audio-driven talking face video generation with learning-based personalized head pose. arXiv preprint arXiv:2002.10137, 2020.

5. Zhang, C. X.; Zhao, Y. F.; Huang, Y. F.; Zeng, M.; Ni, S. F.; Budagavi, M.; Guo, X. H. FACIAL: Synthesizing dynamic talking face with implicit attribute learning. arXiv preprint arXiv:2108.07938, 2021.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Application of a 3D Talking Head as Part of Telecommunication AR, VR, MR System: Systematic Review;Electronics;2023-11-26

2. Review of Time Series Classification Techniques and Methods;2023 International Seminar on Application for Technology of Information and Communication (iSemantic);2023-09-16