Affiliation:
1. Rochester Institute of Technology
2. University of Rochester
3. Carnegie Mellon University
Abstract
Real-time captioning enables deaf and hard of hearing (DHH) people to follow classroom lectures and other aural speech by converting it into visual text with less than a five second delay. Keeping the delay short allows end-users to follow and participate in conversations. This article focuses on the fundamental problem that makes real-time captioning difficult: sequential keyboard typing is much slower than speaking. We first surveyed the audio characteristics of 240 one-hour-long captioned lectures on YouTube, such as speed and duration of speaking bursts. We then analyzed how these characteristics impact caption generation and readability, considering specifically our human-powered
collaborative captioning
approach. We note that most of these characteristics are also present in more general domains. For our caption comparison evaluation, we transcribed a classroom lecture in real-time using all three captioning approaches. We recruited 48 participants (24 DHH) to watch these classroom transcripts in an eye-tracking laboratory. We presented these captions in a randomized, balanced order. We show that both hearing and DHH participants preferred and followed collaborative captions better than those generated by automatic speech recognition (ASR) or professionals due to the more consistent
flow
of the resulting captions. These results show the potential to reliably capture speech even during sudden bursts of speed, as well as for generating “enhanced” captions, unlike other human-powered captioning approaches.
Funder
National Science Foundation
Rochester Institute of Technology
National Technical Institute for the Deaf
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Science Applications,Human-Computer Interaction
Cited by
45 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献