Facial Expression Recognition Based on Vision Transformer with Hybrid Local Attention-Reference-Cited by-同舟云学术

Facial Expression Recognition Based on Vision Transformer with Hybrid Local Attention

Published:2024-07-24 Issue:15 Volume:14 Page:6471
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Tian Yuan¹^ORCID,Zhu Jingxuan¹,Yao Huang¹,Chen Di¹^ORCID

Affiliation:

1. Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan 430079, China

Abstract

Facial expression recognition has wide application prospects in many occasions. Due to the complexity and variability of facial expressions, facial expression recognition has become a very challenging research topic. This paper proposes a Vision Transformer expression recognition method based on hybrid local attention (HLA-ViT). The network adopts a dual-stream structure. One stream extracts the hybrid local features and the other stream extracts the global contextual features. These two streams constitute a global–local fusion attention. The hybrid local attention module is proposed to enhance the network’s robustness to face occlusion and head pose variations. The convolutional neural network is combined with the hybrid local attention module to obtain feature maps with local prominent information. Robust features are then captured by the ViT from the global perspective of the visual sequence context. Finally, the decision-level fusion mechanism fuses the expression features with local prominent information, adding complementary information to enhance the network’s recognition performance and robustness against interference factors such as occlusion and head posture changes in natural scenes. Extensive experiments demonstrate that our HLA-ViT network achieves an excellent performance with 90.45% on RAF-DB, 90.13% on FERPlus, and 65.07% on AffectNet.

Funder

General Project for Education of National Social Science Fund

Publisher

MDPI AG

Link

https://www.mdpi.com/2076-3417/14/15/6471/pdf

Reference35 articles.

1. Inference of attitudes from nonverbal communication in two channels;Mehrabian;J. Consult. Psychol.,1967

2. Lucey, P., Cohn, J., Kanade, T., Saragih, J., Ambadar, Z., and Matthews, I. (2010, January 13–18). The extended cohn-kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA.

3. Motion magnification multi-feature relation network for facial microexpression recognition;Zhang;Complex Intell. Syst.,2022

4. Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv.

5. He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27–30). Deep residual learning for image recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.