Editable Co-Speech Gesture Synthesis Enhanced with Individual Representative Gestures
-
Published:2024-08-21
Issue:16
Volume:13
Page:3315
-
ISSN:2079-9292
-
Container-title:Electronics
-
language:en
-
Short-container-title:Electronics
Author:
Bao Yihua1ORCID, Weng Dongdong1, Gao Nan2
Affiliation:
1. Beijing Engineering Research Center of Mixed Reality and Advanced Display, Beijing Institute of Technology, No. 5 Yard, Zhong Guan Cun South Street Haidian District, Beijing 100081, China 2. Institute of Automation, Chinese Academy of Sciences, No. 95 Zhongguancun East Road, Haidian District, Beijing 100190, China
Abstract
Co-speech gesture synthesis is a challenging task due to the complexity and uncertainty between gestures and speech. Gestures that accompany speech (i.e., Co-Speech Gesture) are an essential part of natural and efficient embodied human communication, as they work in tandem with speech to convey information more effectively. Although data-driven approaches have improved gesture synthesis, existing deep learning-based methods use deterministic modeling which could lead to averaging out predicted gestures. Additionally, these methods lack control over gesture generation such as user editing of generated results. In this paper, we propose an editable gesture synthesis method based on a learned pose script, which disentangles gestures into individual representative and rhythmic gestures to produce high-quality, diverse and realistic poses. Specifically, we first detect the time of occurrence of gestures in video sequences and transform them into pose scripts. Regression models are then built to predict the pose scripts. Next, learned pose scripts are used for gesture synthesis, while rhythmic gestures are modeled using a variational auto-encoder and a one-dimensional convolutional network. Moreover, we introduce a large-scale Chinese co-speech gesture synthesis dataset with multimodal annotations for training and evaluation, which will be publicly available to facilitate future research. The proposed method allows for the re-editing of generated results by changing the pose scripts for applications such as interactive digital humans. The experimental results show that this method generates more quality, more diverse, and realistic gestures than other existing methods.
Funder
the National Key R&D Program of China the 2022 major science and technology project “Yuelu·Multimodal Graph-Text-Sound-Semantic Gesture Big Model Research and Demonstration Application” in Changsha
Reference37 articles.
1. Computer vision-based hand gesture recognition for human-robot interaction: A review;Qi;Complex Intell. Syst.,2024 2. Bhattacharya, U., Rewkowski, N., Banerjee, A., Guhan, P., Bera, A., and Manocha, D. (April, January 27). Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. Proceedings of the 2021 IEEE Virtual Reality and 3D User Interfaces (VR), Lisbon, Portugal. 3. Liang, B., Pan, Y., Guo, Z., Zhou, H., Hong, Z., Han, X., Han, J., Liu, J., Ding, E., and Wang, J. (2022, January 18–24). Expressive talking head generation with granular audio-visual control. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA. 4. A Comprehensive Review of Data-Driven Co-Speech Gesture Generation;Nyatsanga;Comput. Graph. Forum,2023 5. Petrovich, M., Black, M.J., and Varol, G. (2022, January 23–27). TEMOS: Generating diverse human motions from textual descriptions. Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel.
|
|