Positive Transfer of the Whisper Speech Transformer to Human and Animal Voice Activity Detection-Reference-Cited by-同舟云学术

Positive Transfer of the Whisper Speech Transformer to Human and Animal Voice Activity Detection

Published:2023-10-02 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Gu Nianlong^ORCID,Lee Kanghwi,Basha Maris^ORCID,Ram Sumit Kumar,You Guanghao,Hahnloser Richard H. R.^ORCID

Abstract

ABSTRACTThis paper introduces WhisperSeg, utilizing the Whisper Transformer pre-trained for Automatic Speech Recognition (ASR) for human and animal Voice Activity Detection (VAD). Contrary to traditional methods that detect human voice or animal vocalizations from a short audio frame and rely on careful threshold selection, WhisperSeg processes entire spectrograms of long audio and generates plain text representations of onset, offset, and type of voice activity. Processing a longer audio context with a larger network greatly improves detection accuracy from few labeled examples. We further demonstrate a positive transfer of detection performance to new animal species, making our approach viable in the data-scarce multi-species setting.1

Publisher

Cold Spring Harbor Laboratory

Reference20 articles.

1. “Voice activity detection algorithm for speech recog-nition applications;in IJCA Proceedings on International Conference in Computational Intelligence (IC-CIA2012), vol. iccia,2012

2. Ivan Medennikov , Maxim Korenevsky , Tatiana Prisyach , Yuri Khokhlov , Mariya Korenevskaya , Ivan Sorokin , Tatiana Timofeeva , Anton Mitrofanov , Andrei Andrusenko , Ivan Podluzhny , et al., “Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario,” arXiv preprint arXiv:2005.07272, 2020.

3. Evaluating the impact of voice activity detection on speech emotion recognition for autistic children;Frontiers in Computer Science,2022

4. Hannah Sarvasy , Jaydene Elvin , Weicong Li , and Paola Escudero , “An acoustic analysis of nungon vowels in child-versus adult-directed speech,” in Proceedings of the 19th International Congress of Phonetic Sciences Melbourne, 2019, pp. 3155–3159.

5. Nouns slow down speech across structurally and culturally diverse languages

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. AVN: A Deep Learning Approach for the Analysis of Birdsong;2024-05-10