Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models-Reference-Cited by-同舟云学术

Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models

Published:2023-07-26 Issue:4 Volume:42 Page:1-20
ISSN:0730-0301
Container-title:ACM Transactions on Graphics
language:en
Short-container-title:ACM Trans. Graph.

Author:

Alexanderson Simon¹²^ORCID,Nagy Rajmund¹^ORCID,Beskow Jonas¹^ORCID,Henter Gustav Eje¹²^ORCID

Affiliation:

1. KTH Royal Institute of Technology, Stockholm, Sweden

2. Motorica AB, Stockholm, Sweden

Abstract

Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise path-driven locomotion using the same model architecture. Finally, we generalise the guidance procedure to obtain product-of-expert ensembles of diffusion models and demonstrate how these may be used for, e.g., style interpolation, a contribution we believe is of independent interest.

Funder

Knut och Alice Wallenbergs Stiftelse

Digital Futures

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Graphics and Computer-Aided Design

Link

https://dl.acm.org/doi/pdf/10.1145/3592458

Reference152 articles.

1. No Gestures Left Behind: Learning Relationships between Spoken Language and Freeform Gestures

2. Style Transfer for Co-speech Gesture Animation: A Multi-speaker Conditional-Mixture Approach