Speech Inpainting Based on Multi-Layer Long Short-Term Memory Networks

Author:

Shi Haohan1ORCID,Shi Xiyu1ORCID,Dogan Safak1ORCID

Affiliation:

1. Institute for Digital Technologies, Loughborough University London, Queen Elizabeth Olympic Park, Here East, London E20 3BS, UK

Abstract

Audio inpainting plays an important role in addressing incomplete, damaged, or missing audio signals, contributing to improved quality of service and overall user experience in multimedia communications over the Internet and mobile networks. This paper presents an innovative solution for speech inpainting using Long Short-Term Memory (LSTM) networks, i.e., a restoring task where the missing parts of speech signals are recovered from the previous information in the time domain. The lost or corrupted speech signals are also referred to as gaps. We regard the speech inpainting task as a time-series prediction problem in this research work. To address this problem, we designed multi-layer LSTM networks and trained them on different speech datasets. Our study aims to investigate the inpainting performance of the proposed models on different datasets and with varying LSTM layers and explore the effect of multi-layer LSTM networks on the prediction of speech samples in terms of perceived audio quality. The inpainted speech quality is evaluated through the Mean Opinion Score (MOS) and a frequency analysis of the spectrogram. Our proposed multi-layer LSTM models are able to restore up to 1 s of gaps with high perceptual audio quality using the features captured from the time domain only. Specifically, for gap lengths under 500 ms, the MOS can reach up to 3~4, and for gap lengths ranging between 500 ms and 1 s, the MOS can reach up to 2~3. In the time domain, the proposed models can proficiently restore the envelope and trend of lost speech signals. In the frequency domain, the proposed models can restore spectrogram blocks with higher similarity to the original signals at frequencies less than 2.0 kHz and comparatively lower similarity at frequencies in the range of 2.0 kHz~8.0 kHz.

Funder

Loughborough University

China Scholarship Council

Publisher

MDPI AG

Reference50 articles.

1. Audio inpainting;Adler;IEEE Trans. Audio Speech Lang. Process.,2011

2. Adaptive interpolation of discrete-time signals that can be modeled as autoregressive processes;Janssen;IEEE Trans. Acoust. Speech Signal Process.,1986

3. Interpolation of Missing Samples in Sound Signals Based on Autoregressive Modeling;Oudre;Image Process. Line,2018

4. Restoration of a discrete-time signal segment by interpolation based on the left-sided and right-sided autoregressive parameters;Etter;IEEE Trans. Signal Process.,1996

5. Long interpolation of audio signals using linear prediction in sinusoidal modeling;Lagrange;J. Audio Eng. Soc.,2005

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. The Method of Restoring Lost Information from Sensors Based on Auto-Associative Neural Networks;Applied System Innovation;2024-06-20

2. An Application of Image Generation AI in Industry and its Efficiency;Journal of The Japan Institute of Electronics Packaging;2024-05-01

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3