Affiliation:
1. College of Computer Science and Electronic Engineering, Hunan University, Yuelushan, Changsha 410082, China
2. School of Computer Science, Hubei University of Technology, Road Nanli, Wuhan 430068, China
Abstract
Realistic co-speech gestures are important to anthropomorphize ECAs, as nonverbal behavior improves expressiveness of their speech greatly. However, the existing approaches to generating co-speech gestures with sufficient details (including fingers, etc.) in 3D scenarios are indeed rare. Additionally, they hardly address the problem of abnormal gestures, temporal–spatial coherence and diversity of gesture sequences comprehensively. To handle abnormal gesture issues, we put forward an angle conversion method to remove body part length from the original in-the-wild video dataset via transferring coordinates of human upper body key points into relative deflection angles and pitch angles. We also propose a neural network called HARP with encoder–decoder architecture to transfer MFCC featured speech audio into aforementioned angles on the basis of CNN and LSTM. The angles then can be rendered as corresponding co-speech gestures. Compared with the other latest approaches, the co-speech gestures generated by HARP are proved to be almost as good as the real person, i.e., they have strong temporal–spatial coherence, diversity, persuasiveness and credibility. Our approach puts finer control on co-speech gestures than most of the existing works by handling all key points of the human upper body. It is more feasible for industrial application, since HARP can be adaptive to any human upper body model. All related code and evidence videos of HARP can be accessed at https://github.com/drrobincroft/HARP .
Funder
National Natural Science Foundation of China
Medical Science and Technology Project of Zhejiang Province
Science and Technology Project of Nantong City
Publisher
World Scientific Pub Co Pte Ltd