Hypercube Pooling for Visual Semantic Embedding-Reference-Cited by-同舟云学术

Hypercube Pooling for Visual Semantic Embedding

Published:2024-08-23 Issue: Volume: Page:
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Wang Hongbin¹^ORCID,Tang Rui¹^ORCID,Li Fan¹^ORCID

Affiliation:

1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, China and Yunnan Provincial Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, China

Abstract

\(Visual\ Semantic\ Embedding\ (VSE)\)

is a primary model for cross-modal retrieval, wherein the global feature aggregator is a crucial component of the

\(VSE\)

model. In recent research, the

\(General\ Pooling\ Operator\ (GPO)\)

aggregator, which weighs the features reconstructed from the local feature set to aggregate, facilitates the related models to achieve good retrieval performance. However, the reason for the effectiveness remains to be explored. To enhance the rationality of aggregator designs, we analyze the reason from the perspective of feature space. Indeed, for each data, the local feature set forms a hypercube containing abundant data information, and the feature learned by

\(GPO\)

measures the hypercube, thereby representing the data. The geometric structure of the hypercube implies that the set containing all points within the hypercube is a convex set, so the feature learned by weighted aggregation is an interior point of the hypercube. However, using the interior point to measure the hypercube leads to some problems in feature representation and model optimization, as well as the reduction of retrieval efficiency caused by weight computation. For example, the related pair's features may be far, while the unrelated ones may be close. To measure the hypercube more clearly and alleviate the problems mentioned above, we propose

\(Hypercube\ Pooling\ (HCP)\)

aggregator. Specifically,

\(HCP\)

concatenates the Max and Min Pooling features as the global features. This aggregation method has multiple advantages, e.g., the learned global feature represents all hyperplanes of the hypercube that contain critical information and hypercube geometric structure. Moreover,

\(HCP\)

adds normalization-before-concatenation and reduces the usual setting of margin in the loss function by half to avoid gradient loss caused by the difference in the feature value and dimensionality doubling. The experimental results on the

\(Flickr30K\)

and

\(MSCOCO\)

datasets show that the

\(HCP\)

model has excellent performance with high efficiency, confirming the correctness of the spatial analysis and the effectiveness of the

\(HCP\)

aggregator.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3689637

Reference31 articles.

1. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

2. Vaswani Ashish. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017), I.

3. Min Cao, Shiping Li, Juntao Li, Liqiang Nie, and Min Zhang. 2022. Image-text retrieval: A survey on recent research and development. arXiv preprint arXiv:2203.14713 (2022).

4. IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval

5. Learning the Best Pooling Strategy for Visual Semantic Embedding