Author:
Long Zhihui,Deng Huan,Yang Zhenguo,Liu Wenyin
Publisher
Springer Nature Singapore
Reference29 articles.
1. Cheng, M., et al.: Vista: vision and scene text aggregation for cross-modal retrieval. In: CoRR abs/2203.16778 (2022)
2. Degottex, G., Kane, J., Drugman, T., Raitio, T., Scherer, S.: COVAREP - a collaborative voice analysis repository for speech technologies. In: ICASSP, pp. 960–964 (2014)
3. Deng, H., Kang, P., Yang, Z., Hao, T., Li, Q., Liu, W.: Dense fusion network with multimodal residual for sentiment classification. In: ICME, pp. 1–6 (2021)
4. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186 (2019)
5. Guo, M., et al.: Attention mechanisms in computer vision: a survey. Comput. Vis. Media 8(3), 331–368 (2022)