Abstract
The accuracy of automatic spontaneous speech recognition systems is far from that of trained speech recognition systems. This is due to the fact that spontaneous speech is not as smooth and failure-free as spontaneous speech. Spontaneous speech varies from speaker to speaker: the quality of phonemes’ pronunciation, the presence of pauses, speech disruptions and extralinguistic items (laughing, coughing, sneezing, and chuckling when expressing emotions of irritation, etc.) interrupt the fluency of verbal speech. However, it is worth noting that extralinguistic items very often carry important paralinguistic information, so it is crucial for automatic spontaneous speech recognition systems not only to identify such phenomena and distinguish them from the verbal components of speech but also to classify them. This review presents an analysis of works on the topic of automatic detection and analysis of extralinguistic items in spontaneous speech. Both individual methods and approaches to the recognition of extralinguistic items in a speech stream, and works related to the multiclass classification of isolatedly recorded extralinguistic units are considered and described. The most popular methods of extralinguistic units’ analysis are neural networks, such as deep neural networks and networks based on transformer models. The basic concepts related to the term extralinguistic items are given, the original systematization of extralinguistic items in the Russian language is proposed, the corpus and databases of audio spoken speech both in Russian and in other languages are described, the data sets of extralinguistic items recorded isolatedly are also given. The accuracy of extralinguistic items recognition increases with the following conditions of work with the speech signal: pre-processing of audio signals of items has shown an increase in the accuracy of separately recorded extralinguistic items classification; consideration of context (analysis of several frames of speech signal) and use of filters for smoothing the time series after extraction of feature vectors showed an increase in accuracy in frame-by-frame analysis of the speech signal with spontaneous speech.