Voice Synthesis Improvement by Machine Learning of Natural Prosody

Author:

Kane Joseph12ORCID,Johnstone Michael N.12ORCID,Szewczyk Patryk12ORCID

Affiliation:

1. Cyber Security Cooperative Research Centre, Edith Cowan University, 270 Joondalup Drive, Joondalup, WA 6027, Australia

2. Security Research Institute, Edith Cowan University, Joondalup, WA 6027, Australia

Abstract

Since the advent of modern computing, researchers have striven to make the human–computer interface (HCI) as seamless as possible. Progress has been made on various fronts, e.g., the desktop metaphor (interface design) and natural language processing (input). One area receiving attention recently is voice activation and its corollary, computer-generated speech. Despite decades of research and development, most computer-generated voices remain easily identifiable as non-human. Prosody in speech has two primary components—intonation and rhythm—both often lacking in computer-generated voices. This research aims to enhance computer-generated text-to-speech algorithms by incorporating melodic and prosodic elements of human speech. This study explores a novel approach to add prosody by using machine learning, specifically an LSTM neural network, to add paralinguistic elements to a recorded or generated voice. The aim is to increase the realism of computer-generated text-to-speech algorithms, to enhance electronic reading applications, and improved artificial voices for those in need of artificial assistance to speak. A computer that is able to also convey meaning with a spoken audible announcement will also improve human-to-computer interactions. Applications for the use of such an algorithm may include improving high-definition audio codecs for telephony, renewing old recordings, and lowering barriers to the utilization of computing. This research deployed a prototype modular platform for digital speech improvement by analyzing and generalizing algorithms into a modular system through laboratory experiments to optimize combinations and performance in edge cases. The results were encouraging, with the LSTM-based encoder able to produce realistic speech. Further work will involve optimizing the algorithm and comparing its performance against other approaches.

Funder

Edith Cowan University

Cyber Security Research Centre Limited

Australian Government’s Cooperative Research Centres Programme

Publisher

MDPI AG

Reference51 articles.

1. Medeiros, J. (2022, April 10). How Intel Gave Stephen Hawking a Voice. Available online: https://www.wired.com/2015/01/intel-gave-stephen-hawking-voice/.

2. McCaffrey, M., Wagner, J., Hayes, P., and Hobbs, M. (2022, April 10). Consumer Intelligence SeriesPrepare for the Voice Revolution. Available online: https://www.pwc.com/us/en/advisory-services/publications/consumer-intelligence-series/voice-assistants.pdf.

3. Narrative Identity;McAdams;Curr. Dir. Psychol. Sci.,2013

4. Brain ‘talks over’ boring quotes: Top-down activation of voice-selective areas while listening to monotonous direct speech quotations;Yao;NeuroImage,2012

5. Aronoff, M. (2020). Oxford Research Encyclopedia of Linguistics, Oxford University Press.

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Raspberry-Pi Based Physical Media to Audio Conversion device for Visually Impaired Individuals;International Journal of Scientific Research in Science, Engineering and Technology;2024-08-29

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3