Author:
Pinto Dennis,Arnau José-María,Riera Marc,Cruz Josep-Llorenç,González Antonio
Abstract
AbstractWith mobile and embedded devices getting more integrated in our daily lives, the focus is increasingly shifting toward human-friendly interfaces, making automatic speech recognition (ASR) a central player as the ideal means of interaction with machines. ASR is essential for many cognitive computing applications, such as speech-based assistants, dictation systems and real-time language translation. Consequently, interest in speech technology has grown in the last few years, with more systems being proposed and higher accuracy levels being achieved, even surpassing human accuracy. However, highly accurate ASR systems are computationally expensive, requiring on the order of billions of arithmetic operations to decode each second of audio, which conflicts with a growing interest in deploying ASR on edge devices. On these devices, efficient hardware acceleration is key for achieving acceptable performance. In this paper, we propose a technique to improve the energy efficiency and performance of ASR systems, focusing on low-power hardware for edge devices. We focus on optimizing the DNN-based acoustic model evaluation, as we have observed it to be the main bottleneck in popular ASR systems, by leveraging run-time information from the beam search. By doing so, we reduce energy and execution time of the acoustic model evaluation by 25.6 and 25.9 %, respectively, with negligible accuracy loss.
Funder
CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020
Spanish MICINN Ministry
Spanish State Research Agency
Catalan Agency for University and Research
ICREA Academia
Universitat Politècnica de Catalunya
Publisher
Springer Science and Business Media LLC
Reference81 articles.
1. Alharbi S, Alrazgan M, Alrashed A et al (2021) Automatic speech recognition: systematic literature review. IEEE Access 9:131858–131876
2. Amazon (2014) Alexa. https://en.wikipedia.org/wiki/Amazon_Alexa, [Online; accessed 22-Mar-2024]
3. Amodei D, Ananthanarayanan S, Anubhai R, et al (2016) Deep speech 2: End-to-end speech recognition in english and mandarin. In: International Conference on Machine Learning, pp 173–182
4. Apple (2011) Siri. https://en.wikipedia.org/wiki/Siri, [Online; accessed 22-Mar-2024]
5. Baevski A, Zhou Y, Mohamed A et al (2020) wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv Neural Inf Process Syst 33:12449–12460