Audio-visual speech-turn detection and tracking

Israel D. Gebru; Silèye Ba; Georgios Evangelidis; Radu Horaud

Conference Proceedings

Audio-visual speech-turn detection and tracking

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2015) 9237 143-151

DOI: 10.1007/978-3-319-22482-4_17

11Citations

9Readers

Get full text

Abstract

Speaker diarization is an important component of multiparty dialog systems in order to assign speech-signal segments among participants. Diarization may well be viewed as the problem of detecting and tracking speech turns. It is proposed to address this problem by modeling the spatial coincidence of visual and auditory observations and by combining this coincidence model with a dynamic Bayesian formulation that tracks the identity of the active speaker. Speech-turn tracking is formulated as a latent-variable temporal graphical model and an exact inference algorithm is proposed. We describe in detail an audio-visual discriminative observation model as well as a statetransition model. We also describe an implementation of a full system composed of multi-person visual tracking, sound-source localization and the proposed online diarization technique. Finally we show that the proposed method yields promising results with two challenging scenarios that were carefully recorded and annotated.

Author supplied keywords

Cite

CITATION STYLE

APA

Gebru, I. D., Ba, S., Evangelidis, G., & Horaud, R. (2015). Audio-visual speech-turn detection and tracking. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9237, pp. 143–151). Springer Verlag. https://doi.org/10.1007/978-3-319-22482-4_17

Audio-visual speech-turn detection and tracking

Abstract

Author supplied keywords

Cite

Register to see more suggestions