Speaker localisation using audio-visual synchrony: An empirical study

Harriet J. Nock; Giridharan Iyengar; Chalapathy Neti

Journal Article

Speaker localisation using audio-visual synchrony: An empirical study

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2003) 2728 488-499

DOI: 10.1007/3-540-45113-7_48

43Citations

31Readers

Get full text

Abstract

This paper reviews definitions of audio-visual synchrony and examines their empirical behaviour on test sets up to 200 times larger than used by other authors. The results give new insights into the practical utility of existing synchrony definitions and justify application of audio-visual synchrony techniques to the problem of active speaker localisation in broadcast video. Performance is evaluated using a test set of twelve clips of alternating speakers from the multiple speaker CUAVE corpus. Accuracy of 76% is obtained for the task of identifying the active member of a speaker pair at different points in time, comparable to performance given by two purely video image-based schemes. Accuracy of 65% is obtained on the more challenging task of locating a point within a 100 x 100 pixel square centered on the active speaker's mouth without no prior face detection; the performance upper bound if perfect face detection were available is 69%. This result is significantly better than two purely video image-based schemes. © Springer-Verlag Berlin Heidelberg 2003.

Cite

CITATION STYLE

APA

Nock, H. J., Iyengar, G., & Neti, C. (2003). Speaker localisation using audio-visual synchrony: An empirical study. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2728, 488–499. https://doi.org/10.1007/3-540-45113-7_48

Speaker localisation using audio-visual synchrony: An empirical study

Abstract

Cite

Register to see more suggestions