Speech-In-Noise Comprehension is Improved When Viewing a Deep-Neural-Network-Generated Talking Face

Author:

Shan Tong123ORCID,Wenner Casper E.4,Xu Chenliang5,Duan Zhiyao4,Maddox Ross K.1236ORCID

Affiliation:

1. Department of Biomedical Engineering, University of Rochester, Rochester, NY, USA

2. Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY, USA

3. Center for Visual Science, University of Rochester, Rochester, NY, USA

4. Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY, USA

5. Department of Computer Science, University of Rochester, Rochester, NY, USA

6. Department of Neuroscience, University of Rochester, Rochester, NY, USA

Abstract

Listening in a noisy environment is challenging, but many previous studies have demonstrated that comprehension of speech can be substantially improved by looking at the talker's face. We recently developed a deep neural network (DNN) based system that generates movies of a talking face from speech audio and a single face image. In this study, we aimed to quantify the benefits that such a system can bring to speech comprehension, especially in noise. The target speech audio was masked with signal to noise ratios of −9, −6, −3, and 0 dB and was presented to subjects in three audio-visual (AV) stimulus conditions: (1) synthesized AV: audio with the synthesized talking face movie; (2) natural AV: audio with the original movie from the corpus; and (3) audio-only: audio with a static image of the talker. Subjects were asked to type the sentences they heard in each trial and keyword recognition was quantified for each condition. Overall, performance in the synthesized AV condition fell approximately halfway between the other two conditions, showing a marked improvement over the audio-only control but still falling short of the natural AV condition. Every subject showed some benefit from the synthetic AV stimulus. The results of this study support the idea that a DNN-based model that generates a talking face from speech audio can meaningfully enhance comprehension in noisy environments, and has the potential to be used as a visual hearing aid.

Funder

Augmented / Virtual Reality pilot grant

National Institute on Deafness and Other Communication Disorders

Publisher

SAGE Publications

Subject

Speech and Hearing,Otorhinolaryngology

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3