Impact of Binary-Valued Representation on the Performance of Cross-Modal Retrieval System
-
Published:2022-12-01
Issue:6
Volume:7
Page:964-981
-
ISSN:2455-7749
-
Container-title:International Journal of Mathematical, Engineering and Management Sciences
-
language:en
-
Short-container-title:Int. j. math. eng. manag. sci.
Author:
Bhatt Nikita1, Ganatra Amit2, Bhatt Nirav3, Prajapati Purvi3, Rahevar Mrugendra1, Parmar Martin1
Affiliation:
1. U & P U. Patel Department of Computer Engineering, CSPIT, CHARUSAT, Gujarat, India. 2. Devang Patel Institute of Advance Technology and Research, CHARUSAT, Gujarat, India. 3. Smt. Kundanben Dinsha Patel Department of Information Technology, CSPIT, CHARUSAT, Gujarat, India.
Abstract
The tremendous proliferation of Multi-Modal data and the flexible need of users has drawn attention to the field of Cross-Modal Retrieval (CMR), which can perform image-sketch matching, text-image matching, audio-video matching and near infrared-visual image matching. Such retrieval is useful in many applications like criminal investigation, recommendation systems and person reidentification. The real challenge in CMR is to preserve semantic similarities between various modalities of data. To preserve semantic similarities, existing deep learning-based approaches use pairwise labels and generate binary-valued representation. The generated binary-valued representation provides fast retrieval with low storage requirement. However, the relative similarity between heterogeneous data is ignored. So, the objective of this work is to reduce the modality-gap by preserving relative semantic similarities among various modalities. So, a model named "Deep Cross-Modal Retrieval (DCMR)" is proposed, which takes triplet labels as the input and generates binary-valued representation. The triplet labels locate semantic similar data points nearer and dissimilar points far in the vector space. Extensive experiments are performed and the result is compared with deep learning-based approaches, which shows that the performance of DCMR increases by 2% to 3% for Image→Text retrieval and by 2% to 5% for Text→Image retrieval in mean average precision (mAP) on MSCOCO, XMedia, and NUS-WIDE datasets. So, the binary-valued representation generated from triplet labels preserve better relative semantic similarities than pairwise labels.
Publisher
Ram Arti Publishers
Subject
General Engineering,General Business, Management and Accounting,General Mathematics,General Computer Science
Reference43 articles.
1. Bhatt, N., & Ganatra, A. (2021). Improvement of deep cross-modal retrieval by generating real-valued representation. PeerJ Computer Science, 7, e491. https://doi.org/10.7717/peerj-cs.491. 2. Brodeur, S., Perez, E., Anand, A., Golemo, F., Celotti, L., Strub, F., Rouat, J., Larochelle, H., & Courville, A. (2017). Home: A household multimodal environment. arXiv preprint arXiv:1711.11017. https://doi.org/10.48550/arXiv.1711.11017. 3. Cao, Y., Long, M., Wang, J., Yang, Q., & Yu, P.S. (2016). Deep visual-semantic hashing for cross-modal retrieval. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1445-1454). https://doi.org/10.1145/2939672.2939812. 4. Chabot, F., Chaouch, M., Rabarisoa, J., Teuliere, C., & Chateau, T. (2017). Deep manta: A coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2040-2049). 5. Ding, G., Guo, Y., & Zhou, J. (2014). Collective matrix factorization hashing for multimodal data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2075-2082).
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|