Recognition of target domain Japanese speech using language model replacement

Author:

Mori Daiki,Ohta Kengo,Nishimura Ryota,Ogawa Atsunori,Kitaoka NorihideORCID

Abstract

AbstractEnd-to-end (E2E) automatic speech recognition (ASR) models, which consist of deep learning models, are able to perform ASR tasks using a single neural network. These models should be trained using a large amount of data; however, collecting speech data which matches the targeted speech domain can be difficult, so speech data is often used that is not an exact match to the target domain, resulting in lower performance. In comparison to speech data, in-domain text data is much easier to obtain. Thus, traditional ASR systems use separately trained language models and HMM-based acoustic models. However, it is difficult to separate language information from an E2E ASR model because the model learns both acoustic and language information in an integrated manner, making it very difficult to create E2E ASR models for specialized target domain which are able to achieve sufficient recognition performance at a reasonable cost. In this paper, we propose a method of replacing the language information within pre-trained E2E ASR models in order to achieve adaptation to a target domain. This is achieved by deleting the “implicit” language information contained within the ASR model by subtracting the source-domain language model trained with a transcription of the ASR’s training data in a logarithmic domain. We then integrate a target domain language model through addition in the logarithmic domain. This subtraction and addition to replace of the language model is based on Bayes’ theorem. In our experiment, we first used two datasets of the Corpus of Spontaneous Japanese (CSJ) to evaluate the effectiveness of our method. We then we evaluated our method using the Japanese Newspaper Article Speech (JNAS) and CSJ corpora, which contain audio data from the read speech and spontaneous speech domain, respectively, to test the effectiveness of our proposed method at bridging the gap between these two language domains. Our results show that our proposed language model replacement method achieved better ASR performance than both non-adapted (baseline) ASR models and ASR models adapted using the conventional Shallow Fusion method.

Funder

JSPS

Publisher

Springer Science and Business Media LLC

Reference40 articles.

1. S. Mirsamadi, J.H.L. Hansen, in iInterspeech 2015,, A study on deep neural network acoustic model adaptation for robust far-field speech recognition (2015), pp. 2430–2435

2. K. Yao, D. Yu, F. Seide, H. Su, L. Deng, Y. Gong, in 2012 IEEE Spoken Language Technology Workshop (SLT), Adaptation of context-dependent deep neural networks for automatic speech recognition (IEEE, 2012), pp. 366-369

3. I. Sutskever, O. Vinyals, Q.V. Le, in Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS), Sequence to sequence learning with neural networks (NeurIPS Foundation, 2014), pp. 3104–3112

4. M.-T. Luong, H. Pham, C.D. Manning, Effective approaches to attention-based neural machine translation (2015). arXiv preprint arXiv:1508.04025

5. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, in Proceedings of the 31st International Coference o NeuralInformation Processing Systms (NIPS), Attention is all you need. (NeurIPS Foundtio,2017), pp. 6000–6010

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3