Pairwise CNN-Transformer Features for Human–Object Interaction Detection-Reference-Cited by-同舟云学术

Pairwise CNN-Transformer Features for Human–Object Interaction Detection

Published:2024-02-27 Issue:3 Volume:26 Page:205
ISSN:1099-4300
Container-title:Entropy
language:en
Short-container-title:Entropy

Author:

Quan Hutuo¹²^ORCID,Lai Huicheng¹²^ORCID,Gao Guxue¹²^ORCID,Ma Jun¹²,Li Junkai¹²,Chen Dongji¹²

Affiliation:

1. College of Computer Science and Technology, Xinjiang University, Urumqi 830017, China

2. Xinjiang Key Laboratory of Signal Detection and Processing, Xinjiang University, Urumqi 830017, China

Abstract

Human–object interaction (HOI) detection aims to localize and recognize the relationship between humans and objects, which helps computers understand high-level semantics. In HOI detection, two-stage and one-stage methods have distinct advantages and disadvantages. The two-stage methods can obtain high-quality human–object pair features based on object detection but lack contextual information. The one-stage transformer-based methods can model good global features but cannot benefit from object detection. The ideal model should have the advantages of both methods. Therefore, we propose the Pairwise Convolutional neural network (CNN)-Transformer (PCT), a simple and effective two-stage method. The model both fully utilizes the object detector and has rich contextual information. Specifically, we obtain pairwise CNN features from the CNN backbone. These features are fused with pairwise transformer features to enhance the pairwise representations. The enhanced representations are superior to using CNN and transformer features individually. In addition, the global features of the transformer provide valuable contextual cues. We fairly compare the performance of pairwise CNN and pairwise transformer features in HOI detection. The experimental results show that the previously neglected CNN features still have a significant edge. Compared to state-of-the-art methods, our model achieves competitive results on the HICO-DET and V-COCO datasets.

Funder

Natural Science Foundation of China

Publisher

MDPI AG

Link

https://www.mdpi.com/1099-4300/26/3/205/pdf

Reference47 articles.

1. Xiao, Y., Gao, G., Wang, L., and Lai, H. (2022). Optical flow-aware-based multi-modal fusion network for violence detection. Entropy, 24.

2. Lv, J., Hui, T., Zhi, Y., and Xu, Y. (2023). Infrared Image Caption Based on Object-Oriented Attention. Entropy, 25.

3. Wang, L., Yao, W., Chen, C., and Yang, H. (2022). Driving behavior recognition algorithm combining attention mechanism and lightweight network. Entropy, 24.

4. Human object interaction detection: Design and survey;Antoun;Image Vis. Comput.,2023

5. Chao, Y.W., Liu, Y., Liu, X., Zeng, H., and Deng, J. (2018, January 12–15). Learning to detect human–object interactions. Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Enhanced-YOLOv8: A new small target detection model;Digital Signal Processing;2024-10

2. CDTracker: Coarse-to-Fine Feature Matching and Point Densification for 3D Single-Object Tracking;Remote Sensing;2024-06-25