Improving diversity of speech‐driven gesture generation with memory networks as dynamic dictionaries-Reference-Cited by-同舟云学术

Improving diversity of speech‐driven gesture generation with memory networks as dynamic dictionaries

Published:2024-04-22 Issue: Volume: Page:
ISSN:2468-2322
Container-title:CAAI Transactions on Intelligence Technology
language:en
Short-container-title:CAAI Trans on Intel Tech

Author:

Zhao Zeyu¹²^ORCID,Gao Nan¹,Zeng Zhi³,Zhang Guixuan¹³,Liu Jie¹³,Zhang Shuwu³

Affiliation:

1. Institute of Automation Chinese Academy of Sciences Beijing China

2. School of Artificial Intelligence University of Chinese Academy of Sciences Beijing China

3. Beijing University of Posts and Telecommunications Beijing China

Abstract

AbstractGenerating co‐speech gestures for interactive digital humans remains challenging because of the indeterministic nature of the problem. The authors observe that gestures generated from speech audio or text by existing neural methods often contain less movement shift than expected, which can be viewed as slow or dull. Thus, a new generative model coupled with memory networks as dynamic dictionaries for speech‐driven gesture generation with improved diversity is proposed. More specifically, the dictionary network dynamically stores connections between text and pose features in a list of key‐value pairs as the memory for the pose generation network to look up; the pose generation network then merges the matching pose features and input audio features for generating the final pose sequences. To make the improvements more accurately measurable, a new objective evaluation metric for gesture diversity that can remove the influence of low‐quality motions is also proposed and tested. Quantitative and qualitative experiments demonstrate that the proposed architecture succeeds in generating gestures with improved diversity.

Publisher

Institution of Engineering and Technology (IET)

Reference43 articles.

1. Generating Diverse Gestures from Speech Using Memory Networks as Dynamic Dictionaries

2. Speech gesture generation from the trimodal context of text, audio, and speaker identity

3. Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots