Speech recognizer based maximum likelihood beamforming

Bhiksha Raj; Michael Seltzer; Manuel Jesus Reyes-Gomez

Book Chapter

Speech recognizer based maximum likelihood beamforming

Springer US, (2005), 65-82

DOI: 10.1007/0-387-22794-6_6

0Citations

5Readers

Get full text

Abstract

The signal to noise ratio (SNR) of speech signals can be considerably enhanced by recording them through arrays of microphones simultaneously, and combining the recordings properly. The manner in which microphone array recordings must be combined in order to obtain the best results has been the subject of much research over the years. The simplest array processing method is delay-and-sum beamforming (Johnson and Dudgeon, 1993). Sounds from any source must travel different distances to the different microphones, the recordings from which are consequently delayed with respect to each other. In delay-and-sum beamforming, the recordings are aligned to cancel out the relative delays of signals from the desired source, and averaged. Interfering noises from sources that are not coincident with the desired source remain misaligned and are attenuated by the averaging. It can be shown that if the noise signals corrupting the microphone channels are uncorrelated to each other and the target speech signal, delay-and-sum beamforming results in a 3 dB increase in the SNR of the output signal for every doubling of the number of microphones in the array. The term "beamforming" derives from the fact that such processing can be shown to selectively pick up signals from a narrow beam of locations around the desired source, by attenuating signals from other locations. The narrower the beam, the better the ability of the array to select the desired source. The beamwidth and directivity of the delay-and-sum beamformer can be improved by increasing the number of microphones in the array, and by appropriate geometric arrangement of the microphones. Far more effective than delay-and-sum beamforming is filter-and-sum beamforming. In this method, the signal recorded by each microphone is filtered by an associated filter before the various signals are combined. The spatial characteristics of the beamformer can be controlled by modifying the parameters of the microphone filters. The design of filter-and-sum beamformers usually involves the estimation of array filter parameters, such that the signal from the desired source is maximally enhanced. Unfortunately, this desired signal cannot be known a priori, and the actual design process optimizes alternative criteria that are expected to relate to the enhancement achieved on the desired signal. Sidelobe cancellation techniques design the array filters to attenuate signal energy from directions other than that of the desired source (Griffiths and Jim, 1982). Noise suppression methods design the array to suppress a known or estimated noise (Nordholm et. al., 1999). Least squares methods attempt to maximize the SNR of the array output using estimates of the power spectrum of the desired speech signal (e.g. Aichner et. al., 2003). Thus, effective beamforming requires characterization of either the noise or the desired signal. Speech recognition systems are repositories of detailed information about the speech signal. They contain statistical characterizations of the spectral measurements for the sounds in a language (usually modelled as hidden Markov models (HMMs)), phonotactic rules for how sounds can follow one another (usually represented as phonetic dictionaries that map words in the language to sequences of phonemes), and statistical or rule-based descriptions of valid word sequences (usually in the form of grammars or N-gram language models). Together these form a complete statistical characterization of every speech signal that represents a valid sentence in the language. Conversely, any valid speech signal can be expected to conform to the statistical characterizations stored in the recognizer. The beamforming algorithms presented in this chapter are founded on this observation. These algorithms attempt to optimize beamformer parameters such that the signal output by the array maximally conforms to the statistical models stored in an HMM-based speech recognizer (Seltzer, 2003). Specifically, they optimize the filter parameters of a filter-and-sum array to maximize the likelihood attributed to its output by a speech recognizer. Two kinds of beamforming algorithms are presented. The first kind aims to separate out and enhance a speech signal from a mixture of the speech and nonspeech signals. Since the interfering signals are not speech and do not con form to the models in the speech recognizer, filter parameters can be optimized using the recognizer directly. The second kind of beamforming algorithm attempts to separate signals from multiple speakers who are speaking simultaneously. This is achieved by beamforming separately for each of the speakers: the desired signal for each beamformer is the speech from one of the speakers, while the rest of the speakers are considered as interference. Here an additional complication is introduced by the fact that interfering signals are also speech that may also conform to the models in the recognizer. To account for the multiple conformant signals, the beamforming algorithm utilizes factorial hidden Markov models (FHMMs) that are derived by compounding the statistical models stored in the recognizer, to simultaneously model the desired and interfering speech signals. The microphone array filter parameters are estimated such that the likelihood of the output of the array, as measured by the constituent components of the factorial HMM that represent the desired speaker, is maximized. We note that an HMM-based speech recognition system has two distinct statistical components: acoustic models, that represent statistical constraints on the acoustic manifestation of the speech signal, and a language model that represents linguistic constraints on spoken utterances. For signal enhancement we present two algorithms: one that utilizes only the statistical acoustic constraints and needs deterministic linguistic constraints (Seltzer and Raj, 2001), and a second that utilizes both, statistical acoustic and linguistic constraints (Seltzer et. al., 2002). For speaker separation we present an algorithm utilizes only statistical acoustic constraints and requires deterministic language constraints (Reyes et. al., 2003). The development of speaker separation algorithms that utilize both statistical acoustic and linguistic constraints from the recognizer is left for future work. © 2005 Springer Science + Business Media, Inc.

Cite

CITATION STYLE

APA

Raj, B., Seltzer, M., & Reyes-Gomez, M. J. (2005). Speech recognizer based maximum likelihood beamforming. In Speech Separation by Humans and Machines (pp. 65–82). Springer US. https://doi.org/10.1007/0-387-22794-6_6

Speech recognizer based maximum likelihood beamforming

Abstract

Cite

Register to see more suggestions