Abstract
ABSTRACTThis paper introduces WhisperSeg, utilizing the Whisper Transformer pre-trained for Automatic Speech Recognition (ASR) for human and animal Voice Activity Detection (VAD). Contrary to traditional methods that detect human voice or animal vocalizations from a short audio frame and rely on careful threshold selection, WhisperSeg processes entire spectrograms of long audio and generates plain text representations of onset, offset, and type of voice activity. Processing a longer audio context with a larger network greatly improves detection accuracy from few labeled examples. We further demonstrate a positive transfer of detection performance to new animal species, making our approach viable in the data-scarce multi-species setting.1
Publisher
Cold Spring Harbor Laboratory
Reference20 articles.
1. “Voice activity detection algorithm for speech recog-nition applications;in IJCA Proceedings on International Conference in Computational Intelligence (IC-CIA2012), vol. iccia,2012
2. Ivan Medennikov , Maxim Korenevsky , Tatiana Prisyach , Yuri Khokhlov , Mariya Korenevskaya , Ivan Sorokin , Tatiana Timofeeva , Anton Mitrofanov , Andrei Andrusenko , Ivan Podluzhny , et al., “Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario,” arXiv preprint arXiv:2005.07272, 2020.
3. Evaluating the impact of voice activity detection on speech emotion recognition for autistic children;Frontiers in Computer Science,2022
4. Hannah Sarvasy , Jaydene Elvin , Weicong Li , and Paola Escudero , “An acoustic analysis of nungon vowels in child-versus adult-directed speech,” in Proceedings of the 19th International Congress of Phonetic Sciences Melbourne, 2019, pp. 3155–3159.
5. Nouns slow down speech across structurally and culturally diverse languages
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献