Plain-to-clear speech video conversion for enhanced intelligibility-Reference-Cited by-同舟云学术

Plain-to-clear speech video conversion for enhanced intelligibility

Published:2023-01-28 Issue:1 Volume:26 Page:163-184
ISSN:1381-2416
Container-title:International Journal of Speech Technology
language:en
Short-container-title:Int J Speech Technol

Author:

Sachdeva Shubam,Ruan Haoyao,Hamarneh Ghassan,Behne Dawn M.,Jongman Allard,Sereno Joan A.,Wang Yue^ORCID

Abstract

AbstractClearly articulated speech, relative to plain-style speech, has been shown to improve intelligibility. We examine if visible speech cues in video only can be systematically modified to enhance clear-speech visual features and improve intelligibility. We extract clear-speech visual features of English words varying in vowels produced by multiple male and female talkers. Via a frame-by-frame image-warping based video generation method with a controllable parameter (displacement factor), we apply the extracted clear-speech visual features to videos of plain speech to synthesize clear speech videos. We evaluate the generated videos using a robust, state of the art AI Lip Reader as well as human intelligibility testing. The contributions of this study are: (1) we successfully extract relevant visual cues for video modifications across speech styles, and have achieved enhanced intelligibility for AI; (2) this work suggests that universal talker-independent clear-speech features may be utilized to modify any talker’s visual speech style; (3) we introduce “displacement factor” as a way of systematically scaling the magnitude of displacement modifications between speech styles; and (4) the high definition generated videos make them ideal candidates for human-centric intelligibility and perceptual training studies.

Funder

Social Sciences and Humanities Research Council of Canada

Simon Fraser University

Publisher

Springer Science and Business Media LLC

Subject

Computer Vision and Pattern Recognition,Linguistics and Language,Human-Computer Interaction,Language and Linguistics,Software

Link

https://link.springer.com/content/pdf/10.1007/s10772-023-10018-z.pdf

Reference46 articles.

1. Chen, T. H., & Massaro, D. W. (2008). Seeing pitch: Visual information for lexical tones of Mandarin-Chinese. Journal of the Acoustical Society of America, 123, 2356–2366.

2. Cooke, M., & Lu, Y. (2010). Spectral and temporal changes to speech produced in the presence of energetic and informational maskers. Journal of the Acoustical Society of America, 128, 2059–2069.

3. Dong, X., Yan, Y., Ouyang, W., & Yang, Y. (2018). Style aggregated network for facial landmark detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 379–388).

4. Ferguson, S. H. (2004). Talker differences in clear and conversational speech: Vowel intelligibility for normal-hearing listeners. Journal of the Acoustical Society of America, 116, 2365–2373.

5. Ferguson, S. H. (2012). Talker differences in clear and conversational speech: Vowel intelligibility for older adults with hearing loss. Journal of Speech, Language, and Hearing Research, 55, 779–790.