A multimodal approach for modeling engagement in conversation-Reference-Cited by-同舟云学术

A multimodal approach for modeling engagement in conversation

Published:2023-03-02 Issue: Volume:5 Page:
ISSN:2624-9898
Container-title:Frontiers in Computer Science
language:
Short-container-title:Front. Comput. Sci.

Author:

Pellet-Rostaing Arthur,Bertrand Roxane,Boudin Auriane,Rauzy Stéphane,Blache Philippe

Abstract

Recently, engagement has emerged as a key variable explaining the success of conversation. In the perspective of human-machine interaction, an automatic assessment of engagement becomes crucial to better understand the dynamics of an interaction and to design socially-aware robots. This paper presents a predictive model of the level of engagement in conversations. It shows in particular the interest of using a rich multimodal set of features, outperforming the existing models in this domain. In terms of methodology, study is based on two audio-visual corpora of naturalistic face-to-face interactions. These resources have been enriched with various annotations of verbal and nonverbal behaviors, such as smiles, head nods, and feedbacks. In addition, we manually annotated gestures intensity. Based on a review of previous works in psychology and human-machine interaction, we propose a new definition of the notion of engagement, adequate for the description of this phenomenon both in natural and mediated environments. This definition have been implemented in our annotation scheme. In our work, engagement is studied at the turn level, known to be crucial for the organization of the conversation. Even though there is still a lack of consensus around their precise definition, we have developed a turn detection tool. A multimodal characterization of engagement is performed using a multi-level classification of turns. We claim a set of multimodal cues, involving prosodic, mimo-gestural and morpho-syntactic information, is relevant to characterize the level of engagement of speakers in conversation. Our results significantly outperform the baseline and reach state-of-the-art level (0.76 weighted F-score). The most contributing modalities are identified by testing the performance of a two-layer perceptron when trained on unimodal feature sets and on combinations of two to four modalities. These results support our claim about multimodality: combining features related to the speech fundamental frequency and energy with mimo-gestural features leads to the best performance.

Publisher

Frontiers Media SA

Subject

Computer Science Applications,Computer Vision and Pattern Recognition,Human-Computer Interaction,Computer Science (miscellaneous)

Reference60 articles.

1. “A study of gestural feedback expressions,”;Allwood,2003

2. “Smiling for negotiating topic transitions in French conversation,”;Amoyal;GESPIN-Gesture and Speech in Interaction,2019

3. “Paco: A corpus to analyze the impact of common ground in spontaneous face-to-face interaction,”;Amoyal;Language Resources and Evaluation Conference,2020

4. Evaluating the engagement with social robots;Anzalone;Int. J. Soc. Robot,2015

5. “Extending log-based affect detection to a multi-user virtual environment for science,”;Baker;International Conference on User Modeling, Adaptation, and Personalization,2014

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. How is your feedback perceived? An experimental study of anticipated and delayed conversational feedback;JASA Express Letters;2024-07-01

2. DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement Estimation in Conversation;Proceedings of the 31st ACM International Conference on Multimedia;2023-10-26