Affiliation:
1. School of Electronics and Information Engineering, Liaoning University of Technology, Jinzhou 121001, China
2. School of Electronics and Information Engineering, Shenyang University of Technology, Shenyang 110000, China
Abstract
The prediction of pause fillers plays a crucial role in enhancing the naturalness of synthesized speech. In recent years, neural networks, including LSTM, BERT, and XLNet, have been employed for pause fillers prediction modules. However, these methods have exhibited relatively lower accuracy in predicting pause fillers. This paper introduces the utilization of the RoBERTa model for predicting Chinese pause fillers and presents a novel approach to training the RoBERTa model, effectively enhancing the accuracy of Chinese pause fillers prediction. Our proposed approach involves categorizing text from different speakers into four distinct style groups based on the frequency and position of Chinese pause fillers. The RoBERTa model is trained on these four groups of data, which incorporate different styles of fillers, thereby ensuring a more natural synthesis of speech. The Chinese pause fillers prediction module is evaluated on systems such as Parallel Tacotron2, FastPitch, and Deep Voice3, achieving a notable 26.7% improvement in word-level prediction accuracy compared to the BERT model, along with a 14% enhancement in position-level prediction accuracy. This substantial improvement results in a significant enhancement of the naturalness of the generated speech.
Funder
Liaoning Provincial Education Department Fund
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Reference29 articles.
1. Nakanishi, R., Inoue, K., and Nakamura, S. (2019, January 24–26). Generating fillers based on dialog act pairs for smooth turn-taking by humanoid robot. Proceedings of the 9th International Workshop on Spoken Dialogue System Technology (IWSDS 2019), Singapore.
2. Comparing pre-trained and feature-based models for prediction of Alzheimer’s disease based on speech;Balagopalan;Front. Aging Neurosci.,2021
3. Reducing conversational agents’ overconfidence through linguistic calibration;Mielke;Trans. Assoc. Comput. Linguist.,2022
4. Natural language analysis and the psychology of verbal behavior: The past, present, and future states of the field;Boyd;J. Lang Soc. Psychol.,2021
5. The old and thee, uh, new: Disfluency and reference resolution;Arnold;Psychol. Sci.,2004