A Linear Memory CTC-Based Algorithm for Text-to-Voice Alignment of Very Long Audio Recordings-Reference-Cited by-同舟云学术

A Linear Memory CTC-Based Algorithm for Text-to-Voice Alignment of Very Long Audio Recordings

Published:2023-01-31 Issue:3 Volume:13 Page:1854
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Doras Guillaume¹^ORCID,Teytaut Yann¹^ORCID,Roebel Axel¹^ORCID

Affiliation:

1. Analysis/Synthesis Team - STMS UMR 9912, IRCAM, Sorbonne University, CNRS, French Ministry of Culture 1, Place Igor Stravinsky, 75004 Paris, France

Abstract

Synchronisation of a voice recording with the corresponding text is a common task in speech and music processing, and is used in many practical applications (automatic subtitling, audio indexing, etc.). A common approach derives a mid-level feature from the audio and finds its alignment to the text by means of maximizing a similarity measure via Dynamic Time Warping (DTW). Recently, a Connectionist Temporal Classification (CTC) approach was proposed that directly emits character probabilities and uses those to find the optimal text-to-voice alignment. While this method yields promising results, the memory complexity of the optimal alignment search remains quadratic in input lengths, limiting its application to relatively short recordings. In this work, we describe how recent improvements brought to the textbook DTW algorithm can be adapted to the CTC context to achieve linear memory complexity. We then detail our overall solution and demonstrate that it can align text to several hours of audio with a mean alignment error of 50 ms for speech, and 120 ms for singing voice, which corresponds to a median alignment error that is below 50 ms for both voice types. Finally, we evaluate its robustness to transcription errors and different languages.

Funder

French National Research Agency (Agence Nationale de la Recherche—ANR) as part of the ARS project

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/13/3/1854/pdf

Reference68 articles.

1. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm;Viterbi;IEEE Trans. Inf. Theory,1967

2. A general method applicable to the search for similarities in the amino acid sequence of two proteins;Needleman;J. Mol. Biol.,1970

3. Speech discrimination by dynamic programming;Vintsyuk;Cybernetics,1968

4. Automatic recognition of 200 words;Velichko;Int. J. Man-Mach. Stud.,1970

5. Minimum prediction residual principle applied to speech recognition;Itakura;IEEE Trans. Acoust. Speech Signal Process.,1975

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Study on an English Speaking Practice System based on Automatic Speech Recognition Technology;Journal of Education and Educational Research;2023-06-26

2. Rediscovering Automatic Detection of Stuttering and Its Subclasses through Machine Learning—The Impact of Changing Deep Model Architecture and Amount of Data in the Training Set;Applied Sciences;2023-05-18