1. Towards improving speech emotion recognition for in-vehicle agents: Preliminary results of incorporating sentiment analysis by using early and late fusion methods;Li,2018
2. Deep learning-based late fusion of multimodal information for emotion classification of music video;Pandeya;Multimedia Tools Appl.,2021
3. UNITER: Universal image-text representation learning;Chen,2020
4. Seeing out of the box: End-to-end pre-training for vision-language representation learning;Huang,2021
5. ViLT: Vision-and-language transformer without convolution or region supervision;Kim,2021