MusicFace: Music-driven expressive singing face synthesis
-
Published:2023-11-30
Issue:1
Volume:10
Page:119-136
-
ISSN:2096-0433
-
Container-title:Computational Visual Media
-
language:en
-
Short-container-title:Comp. Visual Media
Author:
Liu Pengfei,Deng Wenjin,Li Hengda,Wang Jintai,Zheng Yinglin,Ding Yiwei,Guo Xiaohu,Zeng Ming
Abstract
AbstractIt remains an interesting and challenging problem to synthesize a vivid and realistic singing face driven by music. In this paper, we present a method for this task with natural motions for the lips, facial expression, head pose, and eyes. Due to the coupling of mixed information for the human voice and backing music in common music audio signals, we design a decouple-and-fuse strategy to tackle the challenge. We first decompose the input music audio into a human voice stream and a backing music stream. Due to the implicit and complicated correlation between the two-stream input signals and the dynamics of the facial expressions, head motions, and eye states, we model their relationship with an attention scheme, where the effects of the two streams are fused seamlessly. Furthermore, to improve the expressivenes of the generated results, we decompose head movement generation in terms of speed and direction, and decompose eye state generation into short-term blinking and long-term eye closing, modeling them separately. We have also built a novel dataset, SingingFace, to support training and evaluation of models for this task, including future work on this topic. Extensive experiments and a user study show that our proposed method is capable of synthesizing vivid singing faces, qualitatively and quantitatively better than the prior state-of-the-art.
Publisher
Springer Science and Business Media LLC
Subject
Artificial Intelligence,Computer Graphics and Computer-Aided Design,Computer Vision and Pattern Recognition
Reference71 articles.
1. Cudeiro, D.; Bolkart, T.; Laidlaw, C.; Ranjan, A.; Black, M. J. Capture, learning, and synthesis of 3D speaking styles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10093–10103, 2019. 2. Suwajanakorn, S.; Seitz, S. M.; Kemelmacher-Shlizerman, I. Synthesizing Obama: Learning lip sync from audio. ACM Transactions on Graphics Vol. 36, No. 4, Article No. 95, 2017. 3. Chen, L. L.; Cui, G. F.; Liu, C. L.; Li, Z.; Kou, Z. Y.; Xu, Y.; Xu, C. L. Talking-head generation with rhythmic head motion. In: Computer Vision–ECCV 2020. Lecture Notes in Computer Science, Vol. 12354. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 35–51, 2020. 4. Yi, R.; Ye, Z. P.; Zhang, J. Y.; Bao, H. J.; Liu, Y. J. Audio-driven talking face video generation with learning-based personalized head pose. arXiv preprint arXiv:2002.10137, 2020. 5. Zhang, C. X.; Zhao, Y. F.; Huang, Y. F.; Zeng, M.; Ni, S. F.; Budagavi, M.; Guo, X. H. FACIAL: Synthesizing dynamic talking face with implicit attribute learning. arXiv preprint arXiv:2108.07938, 2021.
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|