Improving Text-Independent Forced Alignment to Support Speech-Language Pathologists with Phonetic Transcription-Reference-Cited by-同舟云学术

Improving Text-Independent Forced Alignment to Support Speech-Language Pathologists with Phonetic Transcription

Published:2023-12-06 Issue:24 Volume:23 Page:9650
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Li Ying¹^ORCID,Wohlan Bryce Johannas¹^ORCID,Pham Duc-Son¹^ORCID,Chan Kit Yan¹^ORCID,Ward Roslyn²^ORCID,Hennessey Neville²^ORCID,Tan Tele¹^ORCID

Affiliation:

1. School of EECMS, Curtin University, Bentley, WA 6102, Australia

2. School of Allied Health, Curtin University, Bentley, WA 6102, Australia

Abstract

Problem: Phonetic transcription is crucial in diagnosing speech sound disorders (SSDs) but is susceptible to transcriber experience and perceptual bias. Current forced alignment (FA) tools, which annotate audio files to determine spoken content and its placement, often require manual transcription, limiting their effectiveness. Method: We introduce a novel, text-independent forced alignment model that autonomously recognises individual phonemes and their boundaries, addressing these limitations. Our approach leverages an advanced, pre-trained wav2vec 2.0 model to segment speech into tokens and recognise them automatically. To accurately identify phoneme boundaries, we utilise an unsupervised segmentation tool, UnsupSeg. Labelling of segments employs nearest-neighbour classification with wav2vec 2.0 labels, before connectionist temporal classification (CTC) collapse, determining class labels based on maximum overlap. Additional post-processing, including overfitting cleaning and voice activity detection, is implemented to enhance segmentation. Results: We benchmarked our model against existing methods using the TIMIT dataset for normal speakers and, for the first time, evaluated its performance on the TORGO dataset containing SSD speakers. Our model demonstrated competitive performance, achieving a harmonic mean score of 76.88% on TIMIT and 70.31% on TORGO. Implications: This research presents a significant advancement in the assessment and diagnosis of SSDs, offering a more objective and less biased approach than traditional methods. Our model’s effectiveness, particularly with SSD speakers, opens new avenues for research and clinical application in speech pathology.

Funder

PROMPT Institute Research

WA Near Miss Award

Department of Health WA and administered through the Future Health Research and Innovation (FHRI) Fund

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/23/24/9650/pdf

Reference53 articles.

1. Diagnostic and statistical manual of mental disorders;Carter;Ther. Recreat. J.,2014

2. Subtyping children with speech sound disorders by endophenotypes;Lewis;Top. Lang. Disord.,2011

3. Speech sound disorder at 4 years: Prevalence, comorbidities, and predictors in a community cohort of children;Eadie;Dev. Med. Child Neurol.,2015

4. A 28-year follow-up of adults with a history of moderate phonological disorder: Educational and occupational results;Felsenfeld;J. Speech Lang. Hear. Res.,1994

5. When he’s around his brothers¦he’s not so quiet: The private and public worlds of school-aged children with speech sound disorder;McLeod;J. Commun. Disord.,2013