Abstract
AbstractThe use of machine learning in automatic speaker identification and localization systems has recently seen significant advances. However, this progress comes at the cost of using complex models, computations, and increasing the number of microphone arrays and training data. Therefore, in this work, we propose a new end-to-end identification and localization model based on a simple fully connected deep neural network (FC-DNN) and just two input microphones. This model can jointly or separately localize and identify an active speaker with high accuracy in single and multi-speaker scenarios by exploiting a new data augmentation approach. In this regard, we propose using a novel Mel Frequency Cepstral Coefficients (MFCC) based feature called Shuffled MFCC (SHMFCC) and its variant Difference Shuffled MFCC (DSHMFCC). In order to test our approach, we analyzed the performance of the identification and localization proposed model on the new features at different noise and reverberation conditions for single and multi-speaker scenarios. The results show that our approach achieves high accuracy in these scenarios, outperforms the baseline and conventional methods, and achieves robustness even with small-sized training data.
Publisher
Springer Science and Business Media LLC
Subject
Computer Vision and Pattern Recognition,Linguistics and Language,Human-Computer Interaction,Language and Linguistics,Software
Reference47 articles.
1. Ali, R., van Waterschoot, T., & Moonen, M. (2021). An integrated MVDR beamformer for speech enhancement using a local microphone array and external microphones. EURASIP Journal of Audio, Speech, and Music Processing, 1, 1–20.
2. Allen, J. B., & Berkley, D. A. (1979). Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America, 65(4), 943–950.
3. An, N. N., Thanh, N. Q., & Liu, Y. (2019). Deep CNNs with self-attention for speaker identification. IEEE Access. https://doi.org/10.1109/ACCESS.2019.2917470
4. Apte, S. (2017). Random signal processing. CRC Press.
5. Ashar, A., Bhatti, M. S., & Mushtaq, U. (2020). Speaker identification using a hybrid CNN-MFCC approach. In International conference on emerging trends in smart technologies (ICETST), 2020 (pp. 1–4). https://doi.org/10.1109/ICETST49965.2020.9080730.
Cited by
7 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献