Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis-Reference-Cited by-同舟云学术

Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis

Published:2020-04-03 Issue:05 Volume:34 Page:8992-8999
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

Sun Zhongkai,Sarma Prathusha,Sethares William,Liang Yingyu

Abstract

Multimodal language analysis often considers relationships between features based on text and those based on acoustical and visual properties. Text features typically outperform non-text features in sentiment analysis or emotion recognition tasks in part because the text features are derived from advanced language models or word embeddings trained on massive data sources while audio and video features are human-engineered and comparatively underdeveloped. Given that the text, audio, and video are describing the same utterance in different ways, we hypothesize that the multimodal sentiment analysis and emotion recognition can be improved by learning (hidden) correlations between features extracted from the outer product of text and audio (we call this text-based audio) and analogous text-based video. This paper proposes a novel model, the Interaction Canonical Correlation Network (ICCN), to learn such multimodal embeddings. ICCN learns correlations between all three modes via deep canonical correlation analysis (DCCA) and the proposed embeddings are then tested on several benchmark datasets and against other state-of-the-art multimodal embedding algorithms. Empirical results and ablation studies confirm the effectiveness of ICCN in capturing useful information from all three views.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Subject

General Medicine

Cited by 146 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Triple disentangled representation learning for multimodal affective analysis;Information Fusion;2025-02

2. Frame-level nonverbal feature enhancement based sentiment analysis;Expert Systems with Applications;2024-12

3. Extracting method for fine-grained emotional features in videos;Knowledge-Based Systems;2024-10

4. Hierarchical denoising representation disentanglement and dual-channel cross-modal-context interaction for multimodal sentiment analysis;Expert Systems with Applications;2024-10

5. MIT-FRNet: Modality-invariant temporal representation learning-based feature reconstruction network for missing modalities;Expert Systems with Applications;2024-09