Gaze-Swin: Enhancing Gaze Estimation with a Hybrid CNN-Transformer Network and Dropkey Mechanism-Reference-Cited by-同舟云学术

Gaze-Swin: Enhancing Gaze Estimation with a Hybrid CNN-Transformer Network and Dropkey Mechanism

Published:2024-01-12 Issue:2 Volume:13 Page:328
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Zhao Ruijie¹,Wang Yuhuan¹,Luo Sihui¹,Shou Suyao¹,Tang Pinyan¹

Affiliation:

1. Faculty of Electrical Engineering and Computer Science, Ningbo University, Ningbo 315211, China

Abstract

Gaze estimation, which seeks to reveal where a person is looking, provides a crucial clue for understanding human intentions and behaviors. Recently, Visual Transformer has achieved promising results in gaze estimation. However, dividing facial images into patches compromises the integrity of the image structure, which limits the inference performance. To tackle this challenge, we present Gaze-Swin, an end-to-end gaze estimation model formed with a dual-branch CNN-Transformer architecture. In Gaze-Swin, we adopt the Swin Transformer as the backbone network due to its effectiveness in handling long-range dependencies and extracting global features. Additionally, we incorporate a convolutional neural network as an auxiliary branch to capture local facial features and intricate texture details. To further enhance robustness and address overfitting issues in gaze estimation, we replace the original self-attention in the Transformer branch with Dropkey Assisted Attention (DA-Attention). In particular, this DA-Attention treats keys in the Transformer block as Dropout units and employs a decay Dropout rate schedule to preserve crucial gaze representations in deeper layers. Comprehensive experiments on three benchmark datasets demonstrate the superior performance of our method in comparison to the state of the art.

Funder

Natural Science Foundation of Zhejiang Province

Publisher

MDPI AG

Link

https://www.mdpi.com/2079-9292/13/2/328/pdf

Reference40 articles.

1. Eye movements in reading and information processing: 20 years of research;Rayner;Psychol. Bull.,1998

2. Jacob, R.J., and Karn, K.S. (2003). The Mind’s Eye, Elsevier.

3. Mutlu, B., Shiwa, T., Kanda, T., Ishiguro, H., and Hagita, N. (2009, January 9–13). Footing in human-robot conversations: How robots might shape participant roles using gaze cues. Proceedings of the 4th ACM/IEEE International Conference on Human Robot Interaction, La Jolla, CA, USA.

4. Eye gaze tracking techniques for interactive applications;Morimoto;Comput. Vis. Image Underst.,2005

5. Patney, A., Kim, J., Salvi, M., Kaplanyan, A., Wyman, C., Benty, N., Lefohn, A., and Luebke, D. (2016, January 24–28). Perceptually-based foveated virtual reality. Proceedings of the SIGGRAPH ’16: ACM SIGGRAPH 2016 Emerging Technologies, Anaheim, CA, USA.