Dynamic Bayesian networks for audio-visual speech recognition

Ara V. Nefian; Luhong Liang; Xiaobo Pi; Xiaoxing Liu; Kevin Murphy

Journal ArticleOPEN ACCESS

Dynamic Bayesian networks for audio-visual speech recognition

Eurasip Journal on Applied Signal Processing (2002) 2002(11) 1274-1288

DOI: 10.1155/S1110865702206083

236Citations

213Readers

Get full text

Abstract

The use of visual features in audio-visual speech recognition (AVSR) is justified by both the speech generation mechanism, which is essentially bimodal in audio and visual representation, and by the need for features that are invariant to acoustic noise perturbation. As a result, current AVSR systems demonstrate significant accuracy improvements in environments affected by acoustic noise. In this paper, we describe the use of two statistical models for audio-visual integration, the coupled HMM (CHMM) and the factorial HMM (FHMM), and compare the performance of these models with the existing models used in speaker dependent audio-visual isolated word recognition. The statistical properties of both the CHMM and FHMM allow to model the state asynchrony of the audio and visual observation sequences while preserving their natural correlation over time. In our experiments, the CHMM performs best overall, outperforming all the existing models and the FHMM.

Author supplied keywords

Cite

CITATION STYLE

APA

Nefian, A. V., Liang, L., Pi, X., Liu, X., & Murphy, K. (2002). Dynamic Bayesian networks for audio-visual speech recognition. Eurasip Journal on Applied Signal Processing, 2002(11), 1274–1288. https://doi.org/10.1155/S1110865702206083

Dynamic Bayesian networks for audio-visual speech recognition

Abstract

Author supplied keywords

Cite

Register to see more suggestions