Quantification of Automatic Speech Recognition System Performance on d/Deaf and Hard of Hearing Speech-Reference-Cited by-同舟云学术

Quantification of Automatic Speech Recognition System Performance on d/Deaf and Hard of Hearing Speech

Published:2024-08-19 Issue: Volume: Page:
ISSN:0023-852X
Container-title:The Laryngoscope
language:en
Short-container-title:The Laryngoscope

Author:

Zhao Robin¹^ORCID,Choi Anna S.G.²,Koenecke Allison²,Rameau Anaïs¹^ORCID

Affiliation:

1. Sean Parker Institute for the Voice, Weill Cornell Medical College New York New York U.S.A.

2. Department of Information Science Cornell University Ithaca New York U.S.A.

Abstract

ObjectiveTo evaluate the performance of commercial automatic speech recognition (ASR) systems on d/Deaf and hard‐of‐hearing (d/Dhh) speech.MethodsA corpus containing 850 audio files of d/Dhh and normal hearing (NH) speech from the University of Memphis Speech Perception Assessment Laboratory was tested on four speech‐to‐text application program interfaces (APIs): Amazon Web Services, Microsoft Azure, Google Chirp, and OpenAI Whisper. We quantified the Word Error Rate (WER) of API transcriptions for 24 d/Dhh and nine NH participants and performed subgroup analysis by speech intelligibility classification (SIC), hearing loss (HL) onset, and primary communication mode.ResultsMean WER averaged across APIs was 10 times higher for the d/Dhh group (52.6%) than the NH group (5.0%). APIs performed significantly worse for “low” and “medium” SIC (85.9% and 46.6% WER, respectively) as compared to “high” SIC group (9.5% WER, comparable to NH group). APIs performed significantly worse for speakers with prelingual HL relative to postlingual HL (80.5% and 37.1% WER, respectively). APIs performed significantly worse for speakers primarily communicating with sign language (70.2% WER) relative to speakers with both oral and sign language communication (51.5%) or oral communication only (19.7%).ConclusionCommercial ASR systems underperform for d/Dhh individuals, especially those with “low” and “medium” SIC, prelingual onset of HL, and sign language as primary communication mode. This contrasts with Big Tech companies' promises of accessibility, indicating the need for ASR systems ethically trained on heterogeneous d/Dhh speech data.Level of Evidence3 Laryngoscope, 2024

Publisher

Wiley

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1002/lary.31713

Reference37 articles.

1. PapakyriakopoulosO ChoiASG ThongW ZhaoD AndrewsJ BourkeR XiangA KoeneckeA.Augmented Datasheets for Speech Datasets and Ethical Decision‐Making. In Proceedings of the 2023 ACM Conference on Fairness Accountability and Transparency (FAccT'23).2023;881–904.

2. Racial disparities in automated speech recognition

3. Hey Siri: How Effective are Common Voice Recognition Systems at Recognizing Dysphonic Voices?

4. Quantifying and Improving the Performance of Speech Recognition Systems on Dysphonic Speech

5. Relationships between speech production and speech perception skills in young cochlear‐implant users