Abstract
Typical machine learning classification benchmark problems often ignore the full input data structures present in real-world classification problems. Here we aim to represent additional information as “hints” for classification. We show that under a specific realistic conditional independence assumption, the hint information can be included by late fusion. In two experiments involving image classification with hints taking the form of text metadata, we demonstrate the feasibility and performance of the fusion scheme. We fuse the output of pre-trained image classifiers with the output of pre-trained text models. We show that calibration of the pre-trained models is crucial for the performance of the fused model. We compare the performance of the fusion scheme with a mid-level fusion scheme based on support vector machines and find that these two methods tend to perform quite similarly, albeit the late fusion scheme has only negligible computational costs.
Funder
Danish Pioneer Centre for AI
Innovationsfonden
Publisher
Public Library of Science (PLoS)
Reference42 articles.
1. Axelsen MC, Bak N, Hansen LK. Testing Multimodal Integration Hypotheses with Application to Schizophrenia Data. In: 2015 International Workshop on Pattern Recognition in NeuroImaging; 2015. p. 37–40.
2. Chen Y, Shi J, Mertz C, Kong S, Ramanan D. Multimodal Object Detection via Bayesian Fusion. CoRR. 2021;abs/2104.02904.
3. Gallo I, Calefati A, Nawaz S. Multimodal Classification Fusion in Real-World Scenarios. In: First Workshop of Machine Learning, 14th IAPR International Conference on Document Analysis and Recognition, WML@ICDAR 2017, Kyoto, Japan, November 9-15, 2017. IEEE; 2017. p. 36–41.
4. Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and Recognition;D Wu;IEEE Transactions on Pattern Analysis and Machine Intelligence,2016