Identifying bias in models that detect vocal fold paralysis from audio recordings using explainable machine learning and clinician ratings-Reference-Cited by-同舟云学术

Identifying bias in models that detect vocal fold paralysis from audio recordings using explainable machine learning and clinician ratings

Published:2024-05-30 Issue:5 Volume:3 Page:e0000516
ISSN:2767-3170
Container-title:PLOS Digital Health
language:en
Short-container-title:PLOS Digit Health

Author:

Low Daniel M.^ORCID,Rao Vishwanatha,Randolph Gregory,Song Phillip C.,Ghosh Satrajit S.^ORCID

Abstract

Detecting voice disorders from voice recordings could allow for frequent, remote, and low-cost screening before costly clinical visits and a more invasive laryngoscopy examination. Our goals were to detect unilateral vocal fold paralysis (UVFP) from voice recordings using machine learning, to identify which acoustic variables were important for prediction to increase trust, and to determine model performance relative to clinician performance. Patients with confirmed UVFP through endoscopic examination (N = 77) and controls with normal voices matched for age and sex (N = 77) were included. Voice samples were elicited by reading the Rainbow Passage and sustaining phonation of the vowel "a". Four machine learning models of differing complexity were used. SHapley Additive exPlanations (SHAP) was used to identify important features. The highest median bootstrapped ROC AUC score was 0.87 and beat clinician’s performance (range: 0.74–0.81) based on the recordings. Recording durations were different between UVFP recordings and controls due to how that data was originally processed when storing, which we can show can classify both groups. And counterintuitively, many UVFP recordings had higher intensity than controls, when UVFP patients tend to have weaker voices, revealing a dataset-specific bias which we mitigate in an additional analysis. We demonstrate that recording biases in audio duration and intensity created dataset-specific differences between patients and controls, which models used to improve classification. Furthermore, clinician’s ratings provide further evidence that patients were over-projecting their voices and being recorded at a higher amplitude signal than controls. Interestingly, after matching audio duration and removing variables associated with intensity in order to mitigate the biases, the models were able to achieve a similar high performance. We provide a set of recommendations to avoid bias when building and evaluating machine learning models for screening in laryngology.

Funder

National Institute on Deafness and Other Communication Disorders

RallyPoint Fellowship

Amelia Peabody Charitable Fund

Gift to McGovern Institute for Brain Research at MIT

National Institute of Biomedical Imaging and Bioengineering

NIH Office of the Director

Publisher

Public Library of Science (PLoS)

Reference69 articles.

1. Parkinson’s Disease Diagnosis Using Machine Learning and Voice

2. Automated assessment of psychiatric disorders using speech: A systematic review;DM Low;Laryngoscope Investig Otolaryngol,2020

3. Clinical practice guideline: Hoarseness (dysphonia).;RJ Stachler;Otolaryngol Head Neck Surg.,2018

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. New developments in the application of artificial intelligence to laryngology;Current Opinion in Otolaryngology & Head & Neck Surgery;2024-07-25