Server-side rescoring of spoken entity-centric knowledge queries for virtual assistants-Reference-Cited by-同舟云学术

Server-side rescoring of spoken entity-centric knowledge queries for virtual assistants

Published:2024-06 Issue:2 Volume:27 Page:367-375
ISSN:1381-2416
Container-title:International Journal of Speech Technology
language:en
Short-container-title:Int J Speech Technol

Author:

Zhang Youyuan,Gondala Sashank,Fraga-Silva Thiago,Gysel Christophe Van^ORCID

Abstract

AbstractOn-device virtual assistants (VAs) powered by automatic speech recognition (ASR) require effective knowledge integration for the challenging entity-rich query recognition. In this paper, we conduct an empirical study of modeling strategies for server-side rescoring of spoken information domain queries using various categories of language models (LMs) (N-gram word LMs, sub-word neural LMs). We investigate the combination of on-device and server-side signals, and demonstrate significant word error rate improvements of 23%-35% relative on various entity-centric query subpopulations by integrating various server-side LMs compared to performing ASR on-device only. We also perform a comparison between LMs trained on domain data and a generative pre-trained (GPT) (a variant GPT-3) offered by OpenAI as a baseline. Furthermore, we also show that model fusion of multiple server-side LMs trained from scratch most effectively combines complementary strengths of each model and integrates knowledge learned from domain-specific data to a VA ASR system.

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s10772-024-10102-y.pdf

Reference38 articles.

1. Achanta, S., Antony, A., Golipour, L., Li, J., Raitio, T., Rasipuram, R., Rossi, F., Shi, J., Upadhyay, J., Winarsky, D., & Zhang, H. (2021). On-device neural speech synthesis. In ASRU (pp 1155–1161). IEEE.

2. Bengio, Y., Ducharme, R., & Vincent, P. (2000). A neural probabilistic language model. NeurIPS.

3. Brown, T., Mann, B., Ryder, N., et al., ... Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

4. Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30

5. Devlin, J., Chang, M. W., Lee, K., et al. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding (pp. 4171–4186). NAACL.