Speaker source localization using audio-visual data and array processing based speech enhancement for in-vehicle environments

0Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Interactive systems for in-vehicle applications have as their central goal the primary need to improve driver safety while allowing drivers effective control of vehicle functions, access to on-board or remote information, or safe hands-free human communication s. Human-Computer interaction for in-vehicle systems requires effective audio capture, tracking of who is speaking, environmental noise suppression, and robust processing for applications such as route navigation, hands-free mobile communications, and human-to-human communications for hearing impaired subjects. In this chapter, we discuss safety with application for two interactive speech processing frameworks for in-vehicle systems. First, we consider integrating audio-visual processing for detecting the primary speech for a driver using a route navigation system. Integrating both visual and audio content allows us to reject unintended speech to be submitted for speech recognition within the route dialog system. Second, we consider a combined multi-channel array processing scheme based on a combined fixed and adaptive array processing scheme (CFA-BF) with a spectral constrained iterative Auto-LSP and auditory masked GMMSE-AMT-ERB processing for speech enhancement. The combined scheme takes advantage of the strengths offered by array processing methods in noisy environments, as well as speed and efficiency for single channel methods. We evaluate the audio-visual localization scheme for route navigation dialogs and show improved speech accuracy by up to 40% using the CIAIR in-vehicle data corpus from Nagoya, Japan. For the combined array processing and speech enhancement methods, we demonstrate consistent levels of noise suppression and voice communication quality improvement using a subset of the TIMIT corpus with four real noise sources, with an overall average 26dB increase in SegSNR from the original degraded audio corpus.

Cite

CITATION STYLE

APA

Zhang, X., Hansen, J. H. L., Takeda, K., Maeno, T., & Arehart, K. (2007). Speaker source localization using audio-visual data and array processing based speech enhancement for in-vehicle environments. In Advances for In-Vehicle and Mobile Systems: Challenges for International Standards (pp. 123–140). Springer Science and Business Media, LLC. https://doi.org/10.1007/978-0-387-45976-9_11

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free