Accent modification for speech recognition of non-native speakers using neural style transfer

Author:

Radzikowski KacperORCID,Wang Le,Yoshie Osamu,Nowak Robert

Abstract

AbstractNowadays automatic speech recognition (ASR) systems can achieve higher and higher accuracy rates depending on the methodology applied and datasets used. The rate decreases significantly when the ASR system is being used with a non-native speaker of the language to be recognized. The main reason for this is specific pronunciation and accent features related to the mother tongue of that speaker, which influence the pronunciation. At the same time, an extremely limited volume of labeled non-native speech datasets makes it difficult to train, from the ground up, sufficiently accurate ASR systems for non-native speakers.In this research, we address the problem and its influence on the accuracy of ASR systems, using the style transfer methodology. We designed a pipeline for modifying the speech of a non-native speaker so that it more closely resembles the native speech. This paper covers experiments for accent modification using different setups and different approaches, including neural style transfer and autoencoder. The experiments were conducted on English language pronounced by Japanese speakers (UME-ERJ dataset). The results show that there is a significant relative improvement in terms of the speech recognition accuracy. Our methodology reduces the necessity of training new algorithms for non-native speech (thus overcoming the obstacle related to the data scarcity) and can be used as a wrapper for any existing ASR system. The modification can be performed in real time, before a sample is passed into the speech recognition system itself.

Publisher

Springer Science and Business Media LLC

Subject

Electrical and Electronic Engineering,Acoustics and Ultrasonics

Reference27 articles.

1. W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, A. Stolcke, The Microsoft 2017 Conversational Speech Recognition System (2017). https://arxiv.org/abs/1708.06073.

2. T. T. Ping, Automatic speech recognition for non-native speakers. PhD thesis, Université Joseph-Fourier - Grenoble (2008).

3. A. Metallinou, J. Cheng, in Fifteenth Annual Conference of the International Speech Communication Association. Using deep neural networks to improve proficiency assessment for children English language learners, (2014).

4. T. Drugman, T. Dutoit, in INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, Brighton, United Kingdom, September 6-10, 2009. Glottal closure and opening instant detection from speech signals (ISCA, 2009), pp. 2891–2894. http://www.isca-speech.org/archive/interspeech_2009/i09_2891.html.

5. K. Livescu, J. Glass, in 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), vol. 3. Lexical modeling of non-native speech for automatic speech recognition, (2000), pp. 1683–1686. https://doi.org/10.1109/ICASSP.2000.862074.

Cited by 25 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3