Audio-visual clustering for 3D speaker localization

5Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We address the issue of localizing individuals in a scene that contains several people engaged in a multiple-speaker conversation. We use a human-like configuration of sensors (binaural and binocular) to gather both auditory and visual observations. We show that the localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data to a representation of the common 3D scene-space, via a pair of Gaussian mixture models. Inference is performed by a version of the Expectation Maximization algorithm, which provides cooperative estimates of both the activity (speaking or not) and the 3D position of each speaker. © 2008 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Khalidov, V., Forbes, F., Hansard, M., Arnaud, E., & Horaud, R. (2008). Audio-visual clustering for 3D speaker localization. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5237 LNCS, pp. 86–97). Springer Verlag. https://doi.org/10.1007/978-3-540-85853-9_8

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free