Zero-shot Cross-modal Retrieval by Assembling AutoEncoder and Generative Adversarial Network-Reference-Cited by-同舟云学术

Zero-shot Cross-modal Retrieval by Assembling AutoEncoder and Generative Adversarial Network

Published:2021-03-31 Issue:1s Volume:17 Page:1-17
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Xu Xing¹^ORCID,Tian Jialin¹,Lin Kaiyi¹,Lu Huimin²,Shao Jie³^ORCID,Shen Heng Tao³

Affiliation:

1. University of Electronic Science and Technology of China, Chengdu, China

2. Kyushu Institute of Technology, Kitakyushu, Japan

3. University of Electronic Science and Technology of China, China and Sichuan Artificial Intelligence Research Institute, Yibin, China

Abstract

Conventional cross-modal retrieval models mainly assume the same scope of the classes for both the training set and the testing set. This assumption limits their extensibility on zero-shot cross-modal retrieval (ZS-CMR), where the testing set consists of unseen classes that are disjoint with seen classes in the training set. The ZS-CMR task is more challenging due to the heterogeneous distributions of different modalities and the semantic inconsistency between seen and unseen classes. A few of recently proposed approaches are inspired by zero-shot learning to estimate the distribution underlying multimodal data by generative models and make the knowledge transfer from seen classes to unseen classes by leveraging class embeddings. However, directly borrowing the idea from zero-shot learning (ZSL) is not fully adaptive to the retrieval task, since the core of the retrieval task is learning the common space. To address the above issues, we propose a novel approach named Assembling AutoEncoder and Generative Adversarial Network (AAEGAN), which combines the strength of AutoEncoder (AE) and Generative Adversarial Network (GAN), to jointly incorporate common latent space learning, knowledge transfer, and feature synthesis for ZS-CMR. Besides, instead of utilizing class embeddings as common space, the AAEGAN approach maps all multimodal data into a learned latent space with the distribution alignment via three coupled AEs. We empirically show the remarkable improvement for ZS-CMR task and establish the state-of-the-art or competitive performance on four image-text retrieval datasets.

Funder

National Natural Science Foundation of China

Fundamental Research Funds for the Central Universities

Sichuan Science and Technology Program, China

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3424341

Reference62 articles.

1. Martin Arjovsky and Léon Bottou. 2017. Towards principled methods for training generative adversarial networks. arXiv:1701.04862. Retrieved from https://arxiv.org/abs/1701.04862. Martin Arjovsky and Léon Bottou. 2017. Towards principled methods for training generative adversarial networks. arXiv:1701.04862. Retrieved from https://arxiv.org/abs/1701.04862.

2. Martin Arjovsky Soumith Chintala and Léon Bottou. 2017. Wasserstein gan. arXiv:1701.07875. Retrieved from https://arxiv.org/abs/1701.07875. Martin Arjovsky Soumith Chintala and Léon Bottou. 2017. Wasserstein gan. arXiv:1701.07875. Retrieved from https://arxiv.org/abs/1701.07875.

Cited by 25 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Degradation-removed multiscale fusion for low-light salient object detection;Pattern Recognition;2024-11

2. Salient object detection in low-light RGB-T scene via spatial-frequency cues mining;Neural Networks;2024-10

3. Lightweight object detection in low light: Pixel-wise depth refinement and TensorRT optimization;Results in Engineering;2024-09

4. NSDIE: Noise Suppressing Dark Image Enhancement Using Multiscale Retinex and Low-Rank Minimization;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-03-08

5. LL-WSOD: Weakly supervised object detection in low-light;Journal of Visual Communication and Image Representation;2024-02