Dialocalization-Reference-Cited by-同舟云学术

Dialocalization

Published:2010-11 Issue:4 Volume:6 Page:1-18
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Friedland Gerald¹,Yeo Chuohao²,Hung Hayley³

Affiliation:

1. International Computer Science Institute, Berkeley, CA

2. Institute for Infocomm Research, Singapore

3. IDIAP Research Institute, Martigny, Switzerland

Abstract

The following article presents a novel audio-visual approach for unsupervised speaker localization in both time and space and systematically analyzes its unique properties. Using recordings from a single, low-resolution room overview camera and a single far-field microphone, a state-of-the-art audio-only speaker diarization system (speaker localization in time) is extended so that both acoustic and visual models are estimated as part of a joint unsupervised optimization problem. The speaker diarization system first automatically determines the speech regions and estimates “who spoke when,” then, in a second step, the visual models are used to infer the location of the speakers in the video. We call this process “dialocalization.” The experiments were performed on real-world meetings using 4.5 hours of the publicly available AMI meeting corpus. The proposed system is able to exploit audio-visual integration to not only improve the accuracy of a state-of-the-art (audio-only) speaker diarization, but also adds visual speaker localization at little incremental engineering and computation costs. The combined algorithm has different properties, such as increased robustness, that cannot be observed in algorithms based on single modalities. The article describes the algorithm, presents benchmarking results, explains its properties, and systematically discusses the contributions of each modality.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/1865106.1865111

Reference46 articles.

1. Recognizing Visual Focus of Attention From Head Pose in Natural Meetings

2. Lucas-Kanade 20 Years On: A Unifying Framework

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Who Said That?: Audio-Visual Speaker Diarisation of Real-World Meetings;Interspeech 2019;2019-09-15

2. Monaural Sound Localization Based on Reflective Structure and Homomorphic Deconvolution;Sensors;2017-09-23

3. Near-Field Sound Localization Based on the Small Profile Monaural Structure;Sensors;2015-11-13

4. Monaural Sound Localization Based on Structure-Induced Acoustic Resonance;Sensors;2015-02-06

5. Scalable multimedia content analysis on parallel platforms using python;ACM Transactions on Multimedia Computing, Communications, and Applications;2014-02