Abstract
Compact multi-sensor platforms are portable and thus desirable for robotics and personal-assistance tasks. However, compared to physically distributed sensors, the size of these platforms makes person tracking more difficult. To address this challenge, we propose a novel 3-D audio-visual people tracker that exploits visual observations (object detections) to guide the acoustic processing by constraining the acoustic likelihood on the horizontal plane defined by the predicted height of a speaker. This solution allows the tracker to estimate, with a small microphone array, the distance of a sound. Moreover, we apply a color-based visual likelihood on the image plane to compensate for misdetections. Finally, we use a 3-D particle filter and greedy data association to combine visual observations, color-based, and acoustic likelihoods to track the position of multiple simultaneous speakers. We compare the proposed multimodal 3-D tracker against two state-of-the-art methods on the AV16.3 dataset and on a newly collected dataset with co-located sensors, which we make available to the research community. Experimental results show that our multimodal approach outperforms the other methods both in 3-D and on the image plane.
Author supplied keywords
Cite
CITATION STYLE
Qian, X., Brutti, A., Lanz, O., Omologo, M., & Cavallaro, A. (2019). Multi-Speaker Tracking from an Audio-Visual Sensing Device. IEEE Transactions on Multimedia, 21(10), 2576–2588. https://doi.org/10.1109/TMM.2019.2902489
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.