Automatic speech recognizers currently perform poorly in the presence of noise. Humans, on the other hand, often compensate for noise degradation by extracting speech information from alternative sources and then integrating this information with the acoustical signal. Visual signals from the speaker’s face are one source of supplemental speech information. We demonstrate that multiple sources of speech information can be integrated at a sub-symbolic level to improve vowel recognition. Feedforward and recurrent neural networks are trained to estimate the acoustic characteristics of the vocal tract from images of the speaker’s mouth. These estimates are then combined with the noise-degraded acoustic information, effectively increasing the signal-to-noise ratio and improving the recognition of these noise-degraded signals. Alternative symbolic strategies, such as direct categorization of the visual signals into vowels, are also presented. The performances of these neural networks compared favorably with human performance and with other pattern-matching and estimation techniques. © 1990, IEEE
CITATION STYLE
Yuhas, B. P., Goldstein, M. H., Sejnowski, T. J., & Jenkins, R. E. (1990). Neural Network Models of Sensory Integration for Improved Vowel Recognition. Proceedings of the IEEE, 78(10), 1658–1668. https://doi.org/10.1109/5.58349
Mendeley helps you to discover research relevant for your work.