Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition-Reference-Cited by-同舟云学术

Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition

Published:2021-04-28 Issue:9 Volume:21 Page:3063
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Laptev Aleksandr^ORCID,Andrusenko Andrei^ORCID,Podluzhny Ivan^ORCID,Mitrofanov Anton^ORCID,Medennikov Ivan^ORCID,Matveev Yuri^ORCID

Abstract

With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token’s contexts and to regularize their distribution for the model’s recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.

Funder

ITMO University

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/21/9/3063/pdf

Reference54 articles.

1. Voice Report: From Answers to Action: Customer Adoption of Voice Technology and Digital Assistants; Microsofthttps://about.ads.microsoft.com/en-us/insights/2019-voice-report

2. User Experience with Smart Voice Assistants: The Accent Perspective

3. A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency

4. Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition

5. Efficient Voice Trigger Detection for Low Resource Hardware

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Deep Models for Low-Resourced Speech Recognition: Livvi-Karelian Case;Mathematics;2023-09-05

2. Adapting Off-the-Shelf Speech Recognition Systems for Novel Words;Information;2023-03-13

3. Deep Learning Framework for Controlling Work Sequence in Collaborative Human–Robot Assembly Processes;Sensors;2023-01-03

4. Audio Augmentation for Non-Native Children’s Speech Recognition through Discriminative Learning;Entropy;2022-10-19

5. Vocabulary Expansion for the Sub-word WFST-Based Automatic Speech Recognition System;Proceedings of the Future Technologies Conference (FTC) 2022, Volume 3;2022-10-14