Reconsidering Read and Spontaneous Speech: Causal Perspectives on the Generation of Training Data for Automatic Speech Recognition-Reference-Cited by-同舟云学术

Reconsidering Read and Spontaneous Speech: Causal Perspectives on the Generation of Training Data for Automatic Speech Recognition

Published:2023-02-19 Issue:2 Volume:14 Page:137
ISSN:2078-2489
Container-title:Information
language:en
Short-container-title:Information

Author:

Gabler Philipp¹^ORCID,Geiger Bernhard C.¹^ORCID,Schuppler Barbara²^ORCID,Kern Roman¹³^ORCID

Affiliation:

1. Area of Knowledge Discovery, Know-Center GmbH, 8010 Graz, Austria

2. Signal Processing and Speech Communication Laboratory, Graz University of Technology, 8010 Graz, Austria

3. Institute of Interactive Systems and Data Science, Graz University of Technology, 8010 Graz, Austria

Abstract

Superficially, read and spontaneous speech—the two main kinds of training data for automatic speech recognition—appear as complementary, but are equal: pairs of texts and acoustic signals. Yet, spontaneous speech is typically harder for recognition. This is usually explained by different kinds of variation and noise, but there is a more fundamental deviation at play: for read speech, the audio signal is produced by recitation of the given text, whereas in spontaneous speech, the text is transcribed from a given signal. In this review, we embrace this difference by presenting a first introduction of causal reasoning into automatic speech recognition, and describing causality as a tool to study speaking styles and training data. After breaking down the data generation processes of read and spontaneous speech and analysing the domain from a causal perspective, we highlight how data generation by annotation must affect the interpretation of inference and performance. Our work discusses how various results from the causality literature regarding the impact of the direction of data generation mechanisms on learning and prediction apply to speech data. Finally, we argue how a causal perspective can support the understanding of models in speech processing regarding their behaviour, capabilities, and limitations.

Funder

FWF Austrian Science Fund

Publisher

MDPI AG

Subject

Information Systems

Link

https://www.mdpi.com/2078-2489/14/2/137/pdf

Reference111 articles.

1. Whither Speech Recognition?;Pierce;J. Acoust. Soc. Am.,1969

2. Whither Speech Recognition: The next 25 Years;Roe;IEEE Commun. Mag.,1993

3. Hannun, A. (2021). The History of Speech Recognition to the Year 2030. arXiv.

4. Chen, F., and Jokinen, K. (2010). Speech Technology: Theory and Applications, Springer.

5. Galitsky, B. (2019). Developing Enterprise Chatbots: Learning Linguistic Structures, Springer International Publishing.

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Automatic Speech Recognition of Conversational Speech in Individuals With Disordered Speech;Journal of Speech, Language, and Hearing Research;2024-07-04

2. Evaluating OpenAI's Whisper ASR: Performance analysis across diverse accents and speaker traits;JASA Express Letters;2024-02-01

3. Accents in Speech Recognition through the Lens of a World Englishes Evaluation Set;Research in Language;2023-12-28