Affiliation:
1. School of Information and Computer Engineering Northeast Forestry University Harbin China
2. School of Computer Science and Technology Harbin Institute of Technology Shenzhen Shenzhen China
Abstract
AbstractThe image‐to‐video (I2V) person re‐identification (Re‐ID) is a cross‐modality pedestrian retrieval task, whose crux is to reduce the large modality discrepancy between images and videos. To this end, this paper proposes to predict the following video frames from a single image. Thus, the I2V person Re‐ID can be transformed to video‐to‐video (V2V) Re‐ID. Considering that predicting video frames from a single image is an ill‐posed problem, this paper proposes two strategies to improve the quality of the predicted videos. First, a pose‐guided video prediction pipeline is proposed. The given single image and pedestrian pose are encoded via image encoder and pose encoder, respectively; then, the image feature and pose feature are concatenated as the input of the video decoder. The authors minimize the difference between the predicted video and true video, and simultaneously minimize the difference between the true pose and predicted pose. Second, the conditional adversarial training strategy is employed to generate high‐quality video frames. Specifically, the discriminator takes the source image as condition and distinguishes whether the input frames are fake or true following frames of the source image. Experimental results demonstrate that the pose‐guided adversarial video prediction can effectively improve accuracy of I2V Re‐ID.
Funder
National Natural Science Foundation of China
Publisher
Institution of Engineering and Technology (IET)
Subject
Electrical and Electronic Engineering,Computer Vision and Pattern Recognition,Signal Processing,Software