Child-adult speech diarization in naturalistic conditions of preschool classrooms using room-independent ResNet model and automatic speech recognition-based re-segmentation

Author:

Kothalkar Prasanna V.1,Hansen John H. L.1ORCID,Irvin Dwight2,Buzhardt Jay3

Affiliation:

1. Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering and Computer Science, University of Texas at Dallas 1 , Richardson, Texas 75080, USA

2. Anita Zucker Center for Excellence in Early Childhood Studies, University of Florida, Gainesville 2 , Florida 32611, USA

3. Juniper Garden's Children's Project (JGCP), University of Kansas 3 , Kansas 66101, USA

Abstract

Speech and language development are early indicators of overall analytical and learning ability in children. The preschool classroom is a rich language environment for monitoring and ensuring growth in young children by measuring their vocal interactions with teachers and classmates. Early childhood researchers are naturally interested in analyzing naturalistic vs controlled lab recordings to measure both quality and quantity of such interactions. Unfortunately, present-day speech technologies are not capable of addressing the wide dynamic scenario of early childhood classroom settings. Due to the diversity of acoustic events/conditions in such daylong audio streams, automated speaker diarization technology would need to be advanced to address this challenging domain for segmenting audio as well as information extraction. This study investigates alternate deep learning-based lightweight, knowledge-distilled, diarization solutions for segmenting classroom interactions of 3–5 years old children with teachers. In this context, the focus on speech-type diarization which classifies speech segments as being either from adults or children partitioned across multiple classrooms. Our lightest CNN model achieves a best F1-score of ∼76.0% on data from two classrooms, based on dev and test sets of each classroom. It is utilized with automatic speech recognition-based re-segmentation modules to perform child-adult diarization. Additionally, F1-scores are obtained for individual segments with corresponding speaker tags (e.g., adult vs child), which provide knowledge for educators on child engagement through naturalistic communications. The study demonstrates the prospects of addressing educational assessment needs through communication audio stream analysis, while maintaining both security and privacy of all children and adults. The resulting child communication metrics have been used for broad-based feedback for teachers with the help of visualizations.

Funder

National Science Foundation

University of Texas at Dallas

Publisher

Acoustical Society of America (ASA)

Reference64 articles.

1. Lightweight CNN for robust voice activity detection,2020

2. wav2vec 2.0: A framework for self-supervised learning of speech representations;Adv. Neural Inf. Process. Syst.,2020

3. Accent recognition using i-vector, Gaussian mean supervector and Gaussian posterior probability supervector for spontaneous telephone speech,2013

4. Audio scene classification with discriminatively-trained segment-level features,2019

5. Voice activity detection based on time-delay neural networks,2019

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3