Audio-visual clustering for 3D speaker localization

Vasil Khalidov; Florence Forbes; Miles Hansard; Elise Arnaud; Radu Horaud

Conference Proceedings

Audio-visual clustering for 3D speaker localization

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2008) 5237 LNCS 86-97

DOI: 10.1007/978-3-540-85853-9_8

5Citations

9Readers

Get full text

Abstract

We address the issue of localizing individuals in a scene that contains several people engaged in a multiple-speaker conversation. We use a human-like configuration of sensors (binaural and binocular) to gather both auditory and visual observations. We show that the localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data to a representation of the common 3D scene-space, via a pair of Gaussian mixture models. Inference is performed by a version of the Expectation Maximization algorithm, which provides cooperative estimates of both the activity (speaking or not) and the 3D position of each speaker. © 2008 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Khalidov, V., Forbes, F., Hansard, M., Arnaud, E., & Horaud, R. (2008). Audio-visual clustering for 3D speaker localization. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5237 LNCS, pp. 86–97). Springer Verlag. https://doi.org/10.1007/978-3-540-85853-9_8

Audio-visual clustering for 3D speaker localization

Abstract

Cite

Register to see more suggestions