We address the issue of localizing individuals in a scene that contains several people engaged in a multiple-speaker conversation. We use a human-like configuration of sensors (binaural and binocular) to gather both auditory and visual observations. We show that the localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data to a representation of the common 3D scene-space, via a pair of Gaussian mixture models. Inference is performed by a version of the Expectation Maximization algorithm, which provides cooperative estimates of both the activity (speaking or not) and the 3D position of each speaker. © 2008 Springer-Verlag Berlin Heidelberg.
CITATION STYLE
Khalidov, V., Forbes, F., Hansard, M., Arnaud, E., & Horaud, R. (2008). Audio-visual clustering for 3D speaker localization. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5237 LNCS, pp. 86–97). Springer Verlag. https://doi.org/10.1007/978-3-540-85853-9_8
Mendeley helps you to discover research relevant for your work.