In this paper an evaluation of visual speech features is performed specifically for the tasks of speech and speaker recognition. Unlike acoustic speech processing, we demonstrate that the features employed for effective speech and speaker recognition are quite different to one another in the visual modality. Area based features (i.e. raw pixels) rather than contour features (i.e. an atomized parametric representation of the mouth, e.g. outer and inner labial contour, tongue, teeth, etc.) are investigated due to their robustness and stability. For the task of speech reading we demonstrate empirically that a large proportion of word unit class distinction stems from the temporal rather than static nature of the visual speech signal. Conversely, for the task of speaker recognition static representations suffice for effective performance although modelling the temporal nature of the signal does improve performance. Additionally, we hypothesize that traditional hidden Markov model (HMM) classifiers may, due to their assumption of intra-state observation independence and stationarity, not be the best paradigm to use for modelling visual speech for the purposes of speech recognition. Results and discussion are presented on the M2VTS database for the tasks of isolated digit, speech and text-dependent speaker recognition. © Springer-Verlag 2003.
CITATION STYLE
Lucey, S. (2003). An evaluation of visual speech features for the tasks of speech and speaker recognition. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2688, 260–267. https://doi.org/10.1007/3-540-44887-x_31
Mendeley helps you to discover research relevant for your work.