Analytical Review of Methods for Automatic Analysis of Extra-Linguistic Units in Spontaneous Speech-Reference-Cited by-同舟云学术

Analytical Review of Methods for Automatic Analysis of Extra-Linguistic Units in Spontaneous Speech

Published:2024-01-11 Issue:1 Volume:23 Page:5-38
ISSN:2713-3206
Container-title:Informatics and Automation
language:
Short-container-title:IA

Author:

Povolotskaia Anastasiia^ORCID,Karpov Alexey^ORCID

Abstract

The accuracy of automatic spontaneous speech recognition systems is far from that of trained speech recognition systems. This is due to the fact that spontaneous speech is not as smooth and failure-free as spontaneous speech. Spontaneous speech varies from speaker to speaker: the quality of phonemes’ pronunciation, the presence of pauses, speech disruptions and extralinguistic items (laughing, coughing, sneezing, and chuckling when expressing emotions of irritation, etc.) interrupt the fluency of verbal speech. However, it is worth noting that extralinguistic items very often carry important paralinguistic information, so it is crucial for automatic spontaneous speech recognition systems not only to identify such phenomena and distinguish them from the verbal components of speech but also to classify them. This review presents an analysis of works on the topic of automatic detection and analysis of extralinguistic items in spontaneous speech. Both individual methods and approaches to the recognition of extralinguistic items in a speech stream, and works related to the multiclass classification of isolatedly recorded extralinguistic units are considered and described. The most popular methods of extralinguistic units’ analysis are neural networks, such as deep neural networks and networks based on transformer models. The basic concepts related to the term extralinguistic items are given, the original systematization of extralinguistic items in the Russian language is proposed, the corpus and databases of audio spoken speech both in Russian and in other languages are described, the data sets of extralinguistic items recorded isolatedly are also given. The accuracy of extralinguistic items recognition increases with the following conditions of work with the speech signal: pre-processing of audio signals of items has shown an increase in the accuracy of separately recorded extralinguistic items classification; consideration of context (analysis of several frames of speech signal) and use of filters for smoothing the time series after extraction of feature vectors showed an increase in accuracy in frame-by-frame analysis of the speech signal with spontaneous speech.

Publisher

SPIIRAS

Reference54 articles.

1. Верходанова В.О., Шапранов В.В., Кипяткова И.С., Карпов А.А. Автоматическое определение вокализованных хезитаций в русской речи // Вопросы языкознания. 2018. № 6. С. 104–118.

2. Ataollahi F., Suarez M.T. Laughter Classification Using 3D Convolutional Neural Networks // Proceedings of the 3rd International Conference on Advances in Artificial Intelligence (ICAAI '19). 2019. pp. 47–51.

3. Судьенкова А.В. Обзор методов извлечения акустических признаков речи в задаче распознавания диктора // Сборник научных трудов НГТУ. 2019. № 3–4. С. 139–164.

4. Hsu J.-H., Su M.-H., Wu C.-H., Chen Y.-H. Speech Emotion Recognition Considering Nonverbal Vocalization in Affective Conversations // IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2021. vol. 29. pp. 1675–1686.

5. Dumpala S.H., Alluri K.N.R.K.R. An Algorithm for Detection of Breath Sounds in Spontaneous Speech with Application to Speaker Recognition. Speech and Computer: 19th International Conference (SPECOM). 2017. pp. 98–108.