End-to-end Multi-modal Low-resourced Speech Keywords Recognition Using Sequential Conv2D Nets-Reference-Cited by-同舟云学术

End-to-end Multi-modal Low-resourced Speech Keywords Recognition Using Sequential Conv2D Nets

Published:2024-01-15 Issue:1 Volume:23 Page:1-21
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Gambhir Pooja¹^ORCID,Dev Amita¹^ORCID,Bansal Poonam¹^ORCID,Sharma Deepak Kumar¹^ORCID

Affiliation:

1. Department of Information Technology, Indira Gandhi Delhi Technical University for Women, Kashmere Gate, New Delhi, India

Abstract

Advanced Neural Networks are widely used to recognize multi-modal conversational speech with significant improvements in accuracy automatically. Significantly, Convolutional Neural sheets have retreated cutting-edge performance in Automatic Voice Recognition (AVR) recently more appropriately in English; however, the Hindi language has not been explored and examined well on AVR systems. The work in this article has exposed a three-layered two-dimensional Sequential Convolutional neural architecture. The Sequential Conv2D is an end-to-end system that can instantaneously exploit speech signal spectral and temporal structures. The network has been trained and tested on different cepstral features such as Frequency and Time variant-Mel-Filters, Gamma-tone Filter Cepstral Quantities, Bark-Filter band Coefficients, and Spectrogram features of speech structures. The experiment was performed on two low-resourced speech command datasets; Hindi with 27,145 Speech Keywords developed by TIFR and 23,664 (1-s utterances) of English speech commands by Google TensorFlow and AIY English Speech Commands. The experimental outcome showed that the model achieves significant performance of Convolutional layers trained on spectrograms with 91.60% accuracy, compared to that achieved in other cepstral feature labels for English speech. However, the model achieved an accuracy of 69.65% for Hindi audio words in which bark-frequency cepstral coefficients features outperformed spectrogram features.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3606019

Reference44 articles.

1. P. Gambhir. 2019. Review of Chatbot design and trends. In Proceedings of the Conference on Artificial Intelligence and Speech Technology.

2. M. Chellapriyadharshini A. Toffy and V. Ramasubramanian. 2018. Semi-supervised and active-learning scenarios: Efficient acoustic model refinement for a low resource Indian language. Retrieved from https://arXiv:1810.06635

3. M. Shamsfard. 2019. Challenges and opportunities in processing low resource languages: A study on Persian. In International Conference Language Technologies for All (LT4All).

4. Acoustic Modeling in Speech Recognition: A Systematic Review

5. Poonam Bansal et al. 2015. The State-of-the-art of feature extraction techniques: An overview. In Proceedings of the Computer Society of India (CSI’15), Speech and Language Processing for Human-Machine Communications, Advances in Intelligent Systems and Computing. Springer, 195–207.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Early detection of red palm weevil infestations using deep learning classification of acoustic signals;Computers and Electronics in Agriculture;2023-09