Affiliation:
1. University of South Carolina, USA
2. University of Washington Bothell, USA
Abstract
Recognizing facial actions is challenging, especially when they are accompanied with speech. Instead of employing information solely from the visual channel, this work aims to exploit information from both visual and audio channels in recognizing speech-related facial action units (AUs). In this work, two feature-level fusion methods are proposed. The first method is based on a kind of human-crafted visual feature. The other method utilizes visual features learned by a deep convolutional neural network (CNN). For both methods, features are independently extracted from visual and audio channels and aligned to handle the difference in time scales and the time shift between the two signals. These temporally aligned features are integrated via feature-level fusion for AU recognition. Experimental results on a new audiovisual AU-coded dataset have demonstrated that both fusion methods outperform their visual counterparts in recognizing speech-related AUs. The improvement is more impressive with occlusions on the facial images, which would not affect the audio channel.
Reference53 articles.
1. Robust Discriminative Response Map Fitting with Constrained Local Models
2. Recognizing Facial Expression: Machine Learning and Application to Spontaneous Behavior
3. Deep learning of representations for unsupervised and transfer learning.;Y.Bengio;Unsupervised and Transfer Learning Challenges in Machine Learning,2012
4. A new sparse image representation algorithm applied to facial expression recognition
5. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences.;S. B.Davis;IEEE Transactions on,1980