Gaze Estimation Based on Convolutional Structure and Sliding Window-Based Attention Mechanism
Author:
Li Yujie12, Chen Jiahui1, Ma Jiaxin1, Wang Xiwen1, Zhang Wei12
Affiliation:
1. School of Artificial Intelligence, Guilin University of Electronic Technology, Guilin 541004, China 2. Guangxi Colleges and Universities Key Laboratory of AI Algorithm Engineering, Guilin 541004, China
Abstract
The direction of human gaze is an important indicator of human behavior, reflecting the level of attention and cognitive state towards various visual stimuli in the environment. Convolutional neural networks have achieved good performance in gaze estimation tasks, but their global modeling capability is limited, making it difficult to further improve prediction performance. In recent years, transformer models have been introduced for gaze estimation and have achieved state-of-the-art performance. However, their slicing-and-mapping mechanism for processing local image patches can compromise local spatial information. Moreover, the single down-sampling rate and fixed-size tokens are not suitable for multiscale feature learning in gaze estimation tasks. To overcome these limitations, this study introduces a Swin Transformer for gaze estimation and designs two network architectures: a pure Swin Transformer gaze estimation model (SwinT-GE) and a hybrid gaze estimation model that combines convolutional structures with SwinT-GE (Res-Swin-GE). SwinT-GE uses the tiny version of the Swin Transformer for gaze estimation. Res-Swin-GE replaces the slicing-and-mapping mechanism of SwinT-GE with convolutional structures. Experimental results demonstrate that Res-Swin-GE significantly outperforms SwinT-GE, exhibiting strong competitiveness on the MpiiFaceGaze dataset and achieving a 7.5% performance improvement over existing state-of-the-art methods on the Eyediap dataset.
Funder
Guangxi Science and Technology Major Project Guangxi Natural Science Foundation Key Laboratory of Cognitive Radio and Information Processing, Ministry of Education
Subject
Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry
Reference43 articles.
1. He, H., She, Y., Xiahou, J., Yao, J., Li, J., Hong, Q., and Ji, Y. (2018, January 11–14). Real-time eye-gaze based interaction for human intention prediction and emotion analysis. Proceedings of the Computer Graphics International 2018, Bintan Island, Indonesia. 2. Breen, M., Reed, T., Nishitani, Y., Jones, M., Breen, H.M., and Breen, M.S. (2023). Wearable and Non-Invasive Sensors for Rock Climbing Applications: Science-Based Training and Performance Optimization. Sensors, 23. 3. Canavan, S., Chen, M., Chen, S., Valdez, R., Yaeger, M., Lin, H., and Yin, L. (2017, January 17–20). Combining gaze and demographic feature descriptors for autism classification. Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China. 4. Patney, A., Kim, J., Salvi, M., Kaplanyan, A., Wyman, C., Benty, N., Lefohn, A., and Luebke, D. (2016, January 24–28). Perceptually-based foveated virtual reality. Proceedings of the ACM SIGGRAPH 2016 Emerging Technologies, Anaheim, CA, USA. 5. Pérez-Reynoso, F.D., Rodríguez-Guerrero, L., Salgado-Ramírez, J.C., and Ortega-Palacios, R. (2021). Human–Machine Interface: Multiclass Classification by Machine Learning on 1D EOG Signals for the Control of an Omnidirectional Robot. Sensors, 21.
|
|