Listen, Look, and Find the One-Reference-Cited by-同舟云学术

Listen, Look, and Find the One

Published:2020-05-31 Issue:2 Volume:16 Page:1-20
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Wang Xiao¹^ORCID,Liu Wu²,Chen Jun³,Wang Xiaobo²,Yan Chenggang⁴,Mei Tao²

Affiliation:

1. NERCMS, School of Computer Sicence, Wuhan University

2. AI Research of JD.com

3. NERCMS, School of Computer Science, Wuhan University

4. Hangzhou Dianzi University

Abstract

Person search with one portrait, which attempts to search the targets in arbitrary scenes using one portrait image at a time, is an essential yet unexplored problem in the multimedia field. Existing approaches, which predominantly depend on the visual information of persons, cannot solve problems when there are variations in the person’s appearance caused by complex environments and changes in pose, makeup, and clothing. In contrast to existing methods, in this article, we propose an associative multimodality index for person search with face, body, and voice information. In the offline stage, an associative network is proposed to learn the relationships among face, body, and voice information. It can adaptively estimate the weights of each embedding to construct an appropriate representation. The multimodality index can be built by using these representations, which exploit the face and voice as long-term keys and the body appearance as a short-term connection. In the online stage, through the multimodality association in the index, we can retrieve all targets depending only on the facial features of the query portrait. Furthermore, to evaluate our multimodality search framework and facilitate related research, we construct the Cast Search in Movies with Voice (CSM-V) dataset, a large-scale benchmark that contains 127K annotated voices corresponding to tracklets from 192 movies. According to extensive experiments on the CSM-V dataset, the proposed multimodality person search framework outperforms the state-of-the-art methods.

Funder

Fundamental Research Funds for the Central Universities

National Nature Science Foundation of China

National Key R8D Program of China

Hubei Province Technological Innovation Major Project

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3380549

Reference50 articles.