Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders-Reference-Cited by-同舟云学术

Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders

Published:2021-11-30 Issue:4 Volume:17 Page:1-23
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Messina Nicola¹,Amato Giuseppe¹,Esuli Andrea¹,Falchi Fabrizio¹,Gennaro Claudio¹,Marchand-Maillet Stéphane²

Affiliation:

1. ISTI-CNR, Pisa, Italy

2. VIPER Group–University of Geneva, Geneva, Switzerland

Abstract

Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences (i.e., image regions and words, respectively) to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task. Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. Cross-attention links invalidate any chance to separately extract visual and textual features needed for the online search and the offline indexing steps in large-scale retrieval systems. In this respect, TERAN merges the information from the two domains only during the final alignment phase, immediately before the loss computation. We argue that the fine-grained alignments produced by TERAN pave the way toward the research for effective and efficient methods for large-scale cross-modal information retrieval. We compare the effectiveness of our approach against relevant state-of-the-art methods. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks on the Recall@1 metric. The code used for the experiments is publicly available on GitHub at https://github.com/mesnico/TERAN .

Funder

Intelligenza Artificiale per il Monitoraggio Visuale dei Siti Culturali

AI4Media

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3451390

Reference65 articles.

1. Picture it in your mind: Generating high level visual representations from textual descriptions;Carrara Fabio;Information Retrieval Journal,2018

2. Yen-Chun Chen Linjie Li Licheng Yu Ahmed El Kholy Faisal Ahmed Zhe Gan Yu Cheng and Jingjing Liu. 2019. Uniter: Learning universal image-text representations. arXiv:1909.11740.

Cited by 80 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Coding self-representative and label-relaxed hashing for cross-modal retrieval;Pattern Recognition Letters;2024-09

2. Realizing Efficient On-Device Language-based Image Retrieval;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-08-16

3. TPTE: Text-guided Patch Token Exploitation for Unsupervised Fine-Grained Representation Learning;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-08-09

4. Hypergraph clustering based multi-label cross-modal retrieval;Journal of Visual Communication and Image Representation;2024-08

5. MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer;2024 International Joint Conference on Neural Networks (IJCNN);2024-06-30