AN EFFICIENT SPEECH GENERATIVE MODEL BASED ON DETERMINISTIC/STOCHASTIC SEPARATION OF SPECTRAL ENVELOPES

Author:

Taha M.1,Azarov E. S.1,Likhachov D. S.1,Petrovsky A. A.1

Affiliation:

1. Belarusian State University of Informatics and Radioelectronics

Abstract

The paper presents a speech generative model that provides an efficient way of generating speech waveform from its amplitude spectral envelopes. The model is based on hybrid speech representation that includes deterministic (harmonic) and stochastic (noise) components. The main idea behind the approach originates from the fact that speech signal has a determined spectral structure that is statistically bound with deterministic/stochastic energy distribution in the spectrum. The performance of the model is evaluated using an experimental low-bitrate wide-band speech coder. The quality of reconstructed speech is evaluated using objective and subjective methods. Two objective quality characteristics were calculated: Modified Bark Spectral Distortion (MBSD) and Perceptual Evaluation of Speech Quality (PESQ). Narrow-band and wide-band versions of the proposed solution were compared with MELP (Mixed Excitation Linear Prediction) speech coder and AMR (Adaptive Multi-Rate) speech coder, respectively. The speech base of two female and two male speakers were used for testing. The performed tests show that overall performance of the proposed approach is speaker-dependent and it is better for male voices. Supposedly, this difference indicates the influence of pitch highness on separation accuracy. In that way, using the proposed approach in experimental speech compression system provides decent MBSD values and comparable PESQ values with AMR speech coder at 6,6 kbit/s. Additional subjective listening testsdemonstrate that the implemented coding system retains phonetic content and speaker’s identity. It proves consistency of the proposed approach.

Publisher

Belarusian State University of Informatics and Radioelectronics

Reference8 articles.

1. A. van den Oord, Dieleman S., Zen H., Simonyan K., Vinyals O., Graves A., Kalchbrenner N., Senior A., Kavukcuoglu K. WaveNet: A generative model for raw audio, arXiv:1609.03499, 2016.

2. Shen J., Pang R., Weiss R. J., Schuster M., Jaitly N., Yang Z., Chen Z., Zhang Y., Wang Y., Skerrv-Ryan R. “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018:4779-4783.

3. Arik S., Diamos G., Gibiansky A., Miller J., Peng K., Ping W., Raiman J., and Zhou Y. Deep voice 2: Multi-speaker neural text-to-speech. arXiv:1705.08947, 2017.

4. Valin J.-V., Skoglund J. LPCNet: Improving neural speech synthesis through linear prediction, arXiv:1810.11846

5. Griffin D., Lim J. A new model-based speech analysis/synthesis system. In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1985;10:513-516.

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3