Multimodal speaker diarization utilizing face clustering information

0Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Multimodal clustering/diarization tries to answer the question “who spoke when” by using audio and visual information. Diarization consists of two steps, at first segmentation of the audio information and detection of the speech segments and then clustering of the speech segments to group the speakers. This task has been mainly studied on audiovisual data from meetings, news broadcasts or talk shows. In this paper, we use visual information to aid speaker clustering. We tested the proposed method in three full length movies, i.e. a scenario much more difficult than the ones used so far, where there is no certainty that speech segments and video appearances of actors will always overlap. The results proved that the visual information can improve the speaker clustering accuracy and hence the diarization process.

Cite

CITATION STYLE

APA

Kapsouras, I., Tefas, A., Nikolaidis, N., & Pitas, I. (2015). Multimodal speaker diarization utilizing face clustering information. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9218, pp. 547–554). Springer Verlag. https://doi.org/10.1007/978-3-319-21963-9_50

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free