ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition-Reference-Cited by-同舟云学术

ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition

Published:2024-03-27 Issue:1 Volume:17 Page:
ISSN:1875-6883
Container-title:International Journal of Computational Intelligence Systems
language:en
Short-container-title:Int J Comput Intell Syst

Author:

Zakieldin Kamal,Khattab Radwa,Ibrahim Ehab,Arafat Esraa,Ahmed Nehal^ORCID,Hemayed Elsayed

Abstract

AbstractIn Talentino, HR-Solution analyzes candidates’ profiles and conducts interviews. Artificial intelligence is used to analyze the video interviews and recognize the candidate’s expressions during the interview. This paper introduces ViTCN, a combination of Vision Transformer (ViT) and Temporal Convolution Network (TCN), as a novel architecture for detecting and interpreting human emotions and expressions. Human expression recognition contributes widely to the development of human-computer interaction. The machine’s understanding of human emotions in the real world will considerably contribute to life in the future. Emotion recognition was identifying the emotions as a single frame (image-based) without considering the sequence of frames. The proposed architecture utilized a series of frames to accurately identify the true emotional expression within a combined sequence of frames over time. The study demonstrates the potential of this method as a viable option for identifying facial expressions during interviews, which could inform hiring decisions. For situations with limited computational resources, the proposed architecture offers a powerful solution for interpreting human facial expressions with a single model and a single GPU.The proposed architecture was validated on the widely used controlled data sets CK+, MMI, and the challenging DAiSEE data set, as well as on the challenging wild data sets DFEW and AFFWild2. The experimental results demonstrated that the proposed method has superior performance to existing methods on DFEW, AFFWild2, MMI, and DAiSEE. It outperformed other sophisticated top-performing solutions with an accuracy of 4.29% in DFEW, 14.41% in AFFWild2, and 7.74% in MMI. It also achieved comparable results on the CK+ data set.

Funder

TIEC center, ITIDA

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s44196-024-00436-5.pdf

Reference73 articles.

1. zbey, N. O., Topal, C.: Expression recognition with appearance-based features of facial landmarks. Signal Processing and Communications Applications Conference (SIU). IEEE, 2018, pp. 1–4. 1–4 (2018)

2. Liu, M., R. W., S. Shan, Chen, X.: Learning expressionlets via universal manifold model for dynamic facial expression recognition. IEEE Transactions on Image Processing, 2016. (2016)

3. Monkaresi, H., R. A. C., N. Bosch, D’Mello, S. K.: Automated detection of engagement using video-based estimation of facial expressions and heart rate. IEEE Transactions on Affective Computing 8, 15–28 (2016)

4. Zhang, K., Y. D., Y. Huang, Wang, L.: Facial expression recognition based on deep evolutional spatial-temporal networks. IEEE Transactions on Image Processing, 2017. (2017)

5. Kayadibi, I., U. E., Güraksın, G. E., Özmen Süzme, N.: An eye state recognition system using transfer learning: Alexnet-based deep convolutional neural network. International Journal of Computational Intelligence Systems. (2022)

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. VT-3DCapsNet: Visual tempos 3D-Capsule network for video-based facial expression recognition;PLOS ONE;2024-08-23

2. Emotion recognition to support personalized therapy in the elderly: an exploratory study based on CNNs;Research on Biomedical Engineering;2024-07-01