Query-by-Example Spoken Term Detection for Zero-Resource Languages Using Heuristic Search-Reference-Cited by-同舟云学术

Query-by-Example Spoken Term Detection for Zero-Resource Languages Using Heuristic Search

Published:2023-07-15 Issue: Volume: Page:
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

P Sudhakar¹,K Sreenivasa Rao,Mitra Pabitra²

Affiliation:

1. Advanced Technology Development Centre Indian Institute of Technology, India

2. Computer Science and Engineering Indian Institute of Technology, India

Abstract

Query-by-Example spoken content retrieval is a demanding and challenging task when a large volume of spoken content is piled up in the repositories without annotation. In the absence of annotation, spoken content retrieval is achieved by capturing the similarities between the query and spoken terms from the acoustic feature representation itself. Dynamic Time Warping (DTW) centric techniques identify the optimal alignment between the acoustic feature representations and capture the similarities between query and spoken terms. Despite feasibility, the DTW-centric techniques produce a lot of false alarms due to the variabilities that exist in natural speech and degrade the performance. In the proposed approach, the variability challenges are addressed in two stages. At first, the speaker-independent acoustic feature representation was obtained from the deep convolutional neural networks that reduce the speaker variabilities. In the second stage, the similarities between the query and spoken term were captured using the heuristic search method. The proposed approach was compared with other state-of-the-art methods using Microsoft Low-Resource Language speech corpus. A 3% improvement and 32% reduction in the hit and false alarm ratio were achieved across languages.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3609505

Reference38 articles.

1. Ju-chieh Chou Cheng-chieh Yeh Hung-yi Lee and Lin-shan Lee. 2018. Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations. https://arxiv.org/abs/1804.02812 Ju-chieh Chou Cheng-chieh Yeh Hung-yi Lee and Lin-shan Lee. 2018. Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations. https://arxiv.org/abs/1804.02812

2. Model-based unsupervised spoken term detection with spoken queries;Lee Chan;IEEE Transactions on Audio, Speech, and Language Processing,2013

3. Jia Cui , Brian Kingsbury , Bhuvana Ramabhadran , Abhinav Sethy , Kartik Audhkhasi , Xiaodong Cui , Ellen Kislal , Lidia Mangu , Markus Nussbaum-Thom , and Michael Picheny . 2015 . Multilingual representations for low resource speech recognition and keyword search . In 2015 IEEE workshop on automatic speech recognition and understanding (ASRU). IEEE, 259–266 . Jia Cui, Brian Kingsbury, Bhuvana Ramabhadran, Abhinav Sethy, Kartik Audhkhasi, Xiaodong Cui, Ellen Kislal, Lidia Mangu, Markus Nussbaum-Thom, and Michael Picheny. 2015. Multilingual representations for low resource speech recognition and keyword search. In 2015 IEEE workshop on automatic speech recognition and understanding (ASRU). IEEE, 259–266.

4. Fiscus Jonathan G., Jerome Ajot , John S. Garofolo , and George Doddingtion . 2007 . Results of the 2006 spoken term detection evaluation . In Proc. sigir, Vol. 7. 51–57 . Fiscus Jonathan G., Jerome Ajot, John S. Garofolo, and George Doddingtion. 2007. Results of the 2006 spoken term detection evaluation. In Proc. sigir, Vol. 7. 51–57.

5. Vikram Gupta , Jitendra Ajmera , Arun Kumar , and Ashish Verma . 2011 . A language independent approach to audio search . In Twelfth Annual Conference of the International Speech Communication Association. Vikram Gupta, Jitendra Ajmera, Arun Kumar, and Ashish Verma. 2011. A language independent approach to audio search. In Twelfth Annual Conference of the International Speech Communication Association.