Multi-level fusion of audio and visual features for speaker identification

46Citations
Citations of this article
42Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

This paper explores the fusion of audio and visual evidences through a multi-level hybrid fusion architecture based on dynamic Bayesian network (DBN), which combines model level and decision level fusion to achieve higher performance. In model level fusion, a new audio-visual correlative model (AVCM) based on DBN is proposed, which describes both the inter-correlations and loose timing synchronicity between the audio and video streams. The experiments on the CMU database and our own homegrown database both demonstrate that the methods can improve the accuracies of audiovisual bimodal speaker identification at all levels of acoustic signal-to-noise-ratios (SNR) from 0dB to 30dB with varying acoustic conditions. © Springer-Verlag Berlin Heidelberg 2005.

Cite

CITATION STYLE

APA

Wu, Z., Cai, L., & Meng, H. (2006). Multi-level fusion of audio and visual features for speaker identification. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3832 LNCS, pp. 493–499). https://doi.org/10.1007/11608288_66

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free