Dancing with the sound in edge computing environments-Reference-Cited by-同舟云学术

Dancing with the sound in edge computing environments

Published:2021-10-14 Issue: Volume: Page:
ISSN:1022-0038
Container-title:Wireless Networks
language:en
Short-container-title:Wireless Netw

Author:

Hao Wangli^ORCID,Han Meng,Li Shancang,Li Fuzhong

Abstract

AbstractConventional motion predictions have achieved promising performance. However, the length of the predicted motion sequences of most literatures are short, and the rhythm of the generated pose sequence has rarely been explored. To pursue high quality, rhythmic, and long-term pose sequence prediction, this paper explores a novel dancing with the sound task, which is appealing and challenging in computer vision field. To tackle this problem, a novel model is proposed, which takes the sound as an indicator input and outputs the dancing pose sequence. Specifically, our model is based on the variational autoencoder (VAE) framework, which encodes the continuity and rhythm of the sound information into the hidden space to generate a coherent, diverse, rhythmic and long-term pose video. Extensive experiments validated the effectiveness of audio cues in the generation of dancing pose sequences. Concurrently, a novel dataset of audiovisual multimodal sequence generation has been released to promote the development of this field.

Funder

intelligent information processing shanxi provincial key laboratory open project fund

Shanxi Province Higher Education Innovation Project of China

Shanxi Key Research and Development Program

shanxi agricultural university academic recovery research project

Publisher

Springer Science and Business Media LLC

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Information Systems

Link

https://link.springer.com/content/pdf/10.1007/s11276-021-02810-z.pdf

Reference46 articles.

1. Chen, L., Srivastava, S., Duan, Z., & Xu, C. (2017). Deep cross-modal audio-visual generation. In Proceedings of the on thematic workshops of ACM multimedia 2017 (pp. 349–357). ACM.

2. Brand, M. (1999). Voice puppetry. In Proceedings of the 26th annual conference on computer graphics and interactive techniques (pp. 21–28). ACM Press/Addison-Wesley Publishing Co.

3. Bregler, C., Covell, M., & Slaney, M. (1997). Video rewrite: Driving visual speech with audio. Siggraph, 97, 353–360.

4. Suwajanakorn, S., Seitz, S. M., & Kemelmacher-Shlizerman, I. (2017). Synthesizing Obama: Learning lip sync from audio. ACM Transactions on Graphics (TOG), 36(4), 95.

5. Taylor, S., Kim, T., Yue, Y., Mahler, M., Krahe, J., Rodriguez, A. G., et al. (2017). A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG), 36(4), 93.