Progressive Learning of a Multimodal Classifier Accounting for Different Modality Combinations
Author:
John Vijay1ORCID, Kawanishi Yasutomo1ORCID
Affiliation:
1. Guardian Robot Project, RIKEN, Seika-cho, Kyoto 619-0288, Japan
Abstract
In classification tasks, such as face recognition and emotion recognition, multimodal information is used for accurate classification. Once a multimodal classification model is trained with a set of modalities, it estimates the class label by using the entire modality set. A trained classifier is typically not formulated to perform classification for various subsets of modalities. Thus, the model would be useful and portable if it could be used for any subset of modalities. We refer to this problem as the multimodal portability problem. Moreover, in the multimodal model, classification accuracy is reduced when one or more modalities are missing. We term this problem the missing modality problem. This article proposes a novel deep learning model, termed KModNet, and a novel learning strategy, termed progressive learning, to simultaneously address missing modality and multimodal portability problems. KModNet, formulated with the transformer, contains multiple branches corresponding to different k-combinations of the modality set S. KModNet is trained using a multi-step progressive learning framework, where the k-th step uses a k-modal model to train different branches up to the k-th combination branch. To address the missing modality problem, the training multimodal data is randomly ablated. The proposed learning framework is formulated and validated using two multimodal classification problems: audio-video-thermal person classification and audio-video emotion classification. The two classification problems are validated using the Speaking Faces, RAVDESS, and SAVEE datasets. The results demonstrate that the progressive learning framework enhances the robustness of multimodal classification, even under the conditions of missing modalities, while being portable to different modality subsets.
Subject
Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry
Reference36 articles.
1. Sadjadi, S., Greenberg, C., Singer, E., Olson, D., Mason, L., and Hernandez-Cordero, J. (2020, January 2–5). The 2019 NIST Audio-Visual Speaker Recognition Evaluation. Proceedings of the Speaker and Language Recognition Workshop: Odyssey 2020, Tokyo, Japan. 2. Das, R.K., Tao, R., Yang, J., Rao, W., Yu, C., and Li, H. (2020, January 7–10). HLT-NUS Submission for 2019 NIST Multimedia Speaker Recognition Evaluation. Proceedings of the APSIPA, Annual Summit and Conference, Auckland, New Zealand. 3. Tao, R., Das, R.K., and Li, H. (2020, January 25–29). Audio-visual Speaker Recognition with a Cross-modal Discriminative Network. Proceedings of the Annual Conference of the International Speech Communication Association, Shanghai, China. 4. Optimal Fusion Aided Face Recognition from Visible and Thermal Face Images;Kanmani;Multimed. Tools Appl.,2020 5. Fusion of Visible and Thermal Images Using a Directed Search Method for Face Recognition;Seal;Int. J. Pattern Recognit. Artif. Intell.,2017
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|