Abstract
As an indispensable process of cross-media analyzing, comprehending heterogeneous data faces challenges in the fields of visual question answering (VQA), visual captioning, and cross-modality retrieval. Bridging the semantic gap between the two modalities is still difficult. In this article, to address the problem in cross-modality retrieval, we propose a cross-modal learning model with joint correlative calculation learning. First, an auto-encoder is used to embed the visual features by minimizing the error of feature reconstruction and a multi-layer perceptron (MLP) is utilized to model the textual features embedding. Then we design a joint loss function to optimize both the intra- and the inter-correlations among the image-sentence pairs, i.e., the reconstruction loss of visual features, the relevant similarity loss of paired samples, and the triplet relation loss between positive and negative examples. In the proposed method, we optimize the joint loss based on a batch score matrix and utilize all mutual mismatched paired samples to enhance its performance. Our experiments in the retrieval tasks demonstrate the effectiveness of the proposed method. It achieves comparable performance to the state-of-the-art on three benchmarks, i.e., Flickr8k, Flickr30k, and MS-COCO.
Funder
National Natural Science Foundation of China
National Key Research and Development Program of China
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Networks and Communications,Hardware and Architecture
Cited by
17 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Pseudo Content Hallucination for Unpaired Image Captioning;Proceedings of the 2024 International Conference on Multimedia Retrieval;2024-05-30
2. Semantic-based Selection, Synthesis, and Supervision for Few-shot Learning;Proceedings of the 31st ACM International Conference on Multimedia;2023-10-26
3. Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering;ACM Transactions on Multimedia Computing, Communications, and Applications;2023-10-23
4. Transformer-Based Visual Grounding with Cross-Modality Interaction;ACM Transactions on Multimedia Computing, Communications, and Applications;2023-05-30
5. NumCap: A Number-controlled Multi-caption Image Captioning Network;ACM Transactions on Multimedia Computing, Communications, and Applications;2023-02-27