Affiliation:
1. Department of Electrical Engineering, Pohang University of Science and Technology, Pohang 37673, Republic of Korea
Abstract
The noise robustness of voice activity detection (VAD) tasks, which are used to identify the human speech portions of a continuous audio signal, is important for subsequent downstream applications such as keyword spotting and automatic speech recognition. Although various aspects of VAD have been recently studied by researchers, a proper training strategy for VAD has not received sufficient attention. Thus, a training strategy for VAD using supervised contrastive learning is proposed for the first time in this paper. The proposed method is used in conjunction with audio-specific data augmentation methods. The proposed supervised contrastive learning-based VAD (SCLVAD) method is trained using two common speech datasets and then evaluated using a third dataset. The experimental results show that the SCLVAD method is particularly effective in improving VAD performance in noisy environments. For clean environments, data augmentation improves VAD accuracy by 8.0 to 8.6%, but there is no improvement due to the use of supervised contrastive learning. On the other hand, for noisy environments, the SCLVAD method results in VAD accuracy improvements of 2.9% and 4.6% for “speech with noise” and “speech with music”, respectively, with only a negligible increase in processing overhead during training.
Subject
Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering
Reference23 articles.
1. A statistical model-based voice activity detection;Sohn;IEEE Signal Process. Lett.,1999
2. Gaussian Model-Based Multichannel Speech Presence Probability;Souden;IEEE Trans. Audio Speech Lang. Process.,2010
3. Hebbar, R., Somandepalli, K., and Narayanan, S. (2019, January 12–17). Robust Speech Activity Detection in Movie Audio: Data Resources and Experimental Evaluation. Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK.
4. Jia, F., Majumdar, S., and Ginsburg, B. (2021, January 6–11). MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection. Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada.
5. ResectNet: An Efficient Architecture for Voice Activity Detection on Mobile Devices;Taseska;Proc. Interspeech,2022
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献