Multi-Speaker Tracking from an Audio-Visual Sensing Device

58Citations
Citations of this article
30Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Compact multi-sensor platforms are portable and thus desirable for robotics and personal-assistance tasks. However, compared to physically distributed sensors, the size of these platforms makes person tracking more difficult. To address this challenge, we propose a novel 3-D audio-visual people tracker that exploits visual observations (object detections) to guide the acoustic processing by constraining the acoustic likelihood on the horizontal plane defined by the predicted height of a speaker. This solution allows the tracker to estimate, with a small microphone array, the distance of a sound. Moreover, we apply a color-based visual likelihood on the image plane to compensate for misdetections. Finally, we use a 3-D particle filter and greedy data association to combine visual observations, color-based, and acoustic likelihoods to track the position of multiple simultaneous speakers. We compare the proposed multimodal 3-D tracker against two state-of-the-art methods on the AV16.3 dataset and on a newly collected dataset with co-located sensors, which we make available to the research community. Experimental results show that our multimodal approach outperforms the other methods both in 3-D and on the image plane.

Cite

CITATION STYLE

APA

Qian, X., Brutti, A., Lanz, O., Omologo, M., & Cavallaro, A. (2019). Multi-Speaker Tracking from an Audio-Visual Sensing Device. IEEE Transactions on Multimedia, 21(10), 2576–2588. https://doi.org/10.1109/TMM.2019.2902489

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free