An audio-visual particle filter for speaker tracking on the CLEAR'06 evaluation dataset

6Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We present an approach for tracking a lecturer during the course of his speech. We use features from multiple cameras and microphones, and process them in a joint particle filter framework. The filter performs sampled projections of 3D location hypotheses and scores them using features from both audio and video. On the video side, the features are based on foreground segmentation, multi-view face detection and upper body detection. On the audio side, the time delays of arrival between pairs of microphones are estimated with a generalized cross correlation function. In the CLEAR'06 evaluation, the system yielded a tracking accuracy (MOTA) of 71% for video-only, 55% for audio-only and 90% for combined audio-visual tracking. © Springer-Verlag Berlin Heidelberg 2007.

Cite

CITATION STYLE

APA

Nickel, K., Gehrig, T., Ekenel, H. K., McDonough, J., & Stiefelhagen, R. (2007). An audio-visual particle filter for speaker tracking on the CLEAR’06 evaluation dataset. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4122 LNCS, pp. 69–80). Springer Verlag. https://doi.org/10.1007/978-3-540-69568-4_4

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free