A Deep Learning Approach for Quantifying Vocal Fold Dynamics During Connected Speech Using Laryngeal High-Speed Videoendoscopy-Reference-Cited by-同舟云学术

A Deep Learning Approach for Quantifying Vocal Fold Dynamics During Connected Speech Using Laryngeal High-Speed Videoendoscopy

Published:2022-06-08 Issue:6 Volume:65 Page:2098-2113
ISSN:1092-4388
Container-title:Journal of Speech, Language, and Hearing Research
language:en
Short-container-title:J Speech Lang Hear Res

Author:

Yousef Ahmed M.¹,Deliyski Dimitar D.¹,Zacharias Stephanie R. C.²³,de Alarcon Alessandro⁴⁵,Orlikoff Robert F.⁶,Naghibolhosseini Maryam¹^ORCID

Affiliation:

1. Department of Communicative Sciences and Disorders, Michigan State University, East Lansing

2. Head and Neck Regenerative Medicine Program, Mayo Clinic, Scottsdale, AZ

3. Department of Otolaryngology—Head and Neck Surgery, Mayo Clinic, Phoenix, AZ

4. Division of Pediatric Otolaryngology, Cincinnati Children's Hospital Medical Center, OH

5. Department of Otolaryngology—Head and Neck Surgery, University of Cincinnati, OH

6. College of Allied Health Sciences, East Carolina University, Greenville, NC

Abstract

Purpose:Voice disorders are best assessed by examining vocal fold dynamics in connected speech. This can be achieved using flexible laryngeal high-speed videoendoscopy (HSV), which enables us to study vocal fold mechanics with high temporal details. Analysis of vocal fold vibration using HSV requires accurate segmentation of the vocal fold edges. This article presents an automated deep-learning scheme to segment the glottal area in HSV from which the glottal edges are derived during connected speech.Method:Using a custom-built HSV system, data were obtained from a vocally healthy participant reciting the “Rainbow Passage.” A deep neural network was designed for glottal area segmentation in the HSV data. A recently introduced hybrid approach by the authors was utilized as an automated labeling tool to train the network on a set of HSV frames, where the glottis region was automatically annotated during vocal fold vibrations. The network was then tested against manually segmented frames using different metrics, intersection over union (IoU), and Boundary F1 (BF) score, and its performance was assessed on various phonatory events on the HSV sequence.Results:The designed network was successfully trained using the hybrid approach, without the need for manual labeling, and tested on the manually labeled data. The performance metrics showed a mean IoU of 0.82 and a mean BF score of 0.96. In addition, the evaluation assessment of the network's performance demonstrated an accurate segmentation of the glottal edges/area even during complex nonstationary phonatory events and when vocal folds were not vibrating, thus overcoming the limitations of the previous hybrid approach that could only be applied to the vibrating vocal folds.Conclusions:The introduced automated scheme guarantees accurate glottis representation in challenging color HSV data with lower image quality and excessive laryngeal maneuvers during all instances of connected speech. This facilitates the future development of HSV-based measures to assess the running vibratory characteristics of the vocal folds in speakers with and without voice disorder.Supplemental Material:

https://doi.org/10.23641/asha.19798864

Publisher

American Speech Language Hearing Association

Subject

Speech and Hearing,Linguistics and Language,Language and Linguistics

Link

http://pubs.asha.org/doi/pdf/10.1044/2022_JSLHR-21-00540

Reference79 articles.

1. Aronson, A. E. , & Bless, D. (2011). Clinical voice disorders. Thieme.

2. Videostroboscopic evaluation of the larynx;Bless D. M.;Ear, Nose & Throat Journal,1987

3. Brown, C. , Naghibolhosseini, M. , Zacharias, S. R. , & Deliyski, D. D. (2019). Investigation of high-speed videoendoscopy during connected speech in norm and neurogenic voice disorder. Michigan Speech-Language-Hearing Association (MSHA) Annual Conference, East Lansing, MI, United States.

4. Csurka, G. , Larlus, D. , Perronnin, F. , & Meylan, F. (2013). What is a good evaluation measure for semantic segmentation? In T. Burghardt , D. Damen , W. Mayol-Cuevas , & M. Mirmehdi (Eds.). Proceedings of the British Machine Vision Conference (Vol. 27, No. 2013, pp. 32.1−32.11). BMVA Press. https://doi.org/10.5244/C.27.32

5. Endoscope Motion Compensation for Laryngeal High-Speed Videoendoscopy

Cited by 16 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Supraglottic Laryngeal Maneuvers in Adductor Laryngeal Dystonia During Connected Speech;Journal of Voice;2024-08

2. New developments in the application of artificial intelligence to laryngology;Current Opinion in Otolaryngology & Head & Neck Surgery;2024-07-25

3. Sociodemographic reporting in videomics research: a review of practices in otolaryngology - head and neck surgery;European Archives of Oto-Rhino-Laryngology;2024-05-05

4. How reliable is assessment of true vocal cord-arytenoid unit mobility in patients affected by laryngeal cancer? a multi-institutional study on 366 patients from the ARYFIX collaborative group;Oral Oncology;2024-05

5. Artificial intelligence in otolaryngology;Big Data in Otolaryngology;2024