Masking the feature information in multi-stream speech-analogue displays

3Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Separation of speech signals by humans is one of the auditory-nervous processes we all count on to occur automatically and efficiently. Fortunately, this is mostly the case, as long as the individual has no, or at most mild, hearing loss and he/she is relatively young. This phenomenon has been termed "the cocktail-party effect" (CPE) by Cherry (1953). Deficient speech separation - a CPE deficit - is observed in people with moderate-to-severe sensorineural hearing loss regardless of age (Souza and Turner 1994), but also in elderly individuals regardless of their peripheral hearing sensitivity and regardless of whether or not the target and the (babble-) masker speech sources are spatially separated (Gelfand et al., 1988; Divenyi and Haupt 1997). During the last 25 years, much effort has been devoted to investigate the CPE, its characteristics, its component and underlying processes, its failures in certain class of listeners-especially the elderly-as well as computational models emulating or superseding the human biological system that allows it to take place. To this date, however, we still have not reached the point at which the processes responsible for CPE and the causes of its dysfunctions would be fully understood, or at which computational means of separating simultaneous speech signals would be reliable to a degree permitting the machine to take over when humans fail or when they are not even present. Previous work on the perceptual separation of simultaneous speech sources in our laboratory has been aimed at trying to understand the mechanisms that play a role in the CPE in the young, and in its decline in the elderly. The focus of much of this work has been identification of the significant dimensions of the phenomenon, such as perceptual segregation of sources-"streams," as they are called following the terminology by Bregman (1990)-based on differences with regard to spatial location, fundamental-frequency pitch, formant frequency, and/or syllabic (or subsyllabic) rhythm. For these studies, we used simultaneous pairs of brief signals (typically shorter than 500 ms) that retained only very basic characteristics of speech: f0, a single formant, temporal envelope pattern. One surprising finding of this research was that spatial separation of two streams provided only moderate advantage even for experienced young listeners in a stream segregation task (Divenyi 2001). This finding places a heavy emphasis on speech separation performed without the benefit of spatial cues, i.e., in a single spatial channel-as when listening to an amalgam of voices through a single loudspeaker. To successfully perform this task, the listener must rely on other cues, e.g., pitch difference, spectral pattern difference, and temporal pattern difference between the concurrent speech streams. But how do simultaneous streams of speech interfere with one another? For the situation of a single target speaker's speech embedded in the babble of six to twelve other speakers, one is tempted to attribute interference to masking. However, since interference also occurs by the presence of a single unwanted source that allows the target to remain audible at least part of the time and in part of the spectrum, traditional, or energetic, masking has been shown to account for not more than part of this interference. As Brungart describes in chapter 17 in the present book, much of the masking in these situations is informational: the target may be above energetic-masked threshold but the information therein is fully or partially blocked by the presence of similar, although not identical, information in the interferer. One reason for the existence of informational, in addition to energetic, masking in speech is that, according to its most widely accepted definition, energetic masking implies stationary signals and maskers. Speech, on the other hand, is quintessentially dynamic, characterize by Plomp as a signal "slowly varying in amplitude and frequency" (Plomp 1983). Thus, to uncover the nature of speech-by-speech interference, it will be necessary to look at its dynamic features. By features, we mean spectro-temporal acoustic patterns, that may or may not coincide with phonetic or phonological features, and by dynamic we mean that these patterns are undergoing slow (2 to 20-Hz) amplitude modulation (AM) and/or frequency modulation (FM). In our investigations of properties of the interference, we first decided to focus our attention on only one given feature in the target stream in the presence of interference by the same feature in another - the distractor stream. The specific question we asked was: how resistant are we to interference when identifying a pattern of slow AM or FM fluctuations? The question cast in psychophysically tractable terms is: in a target stream containing a pattern of a given feature, what degree of informational masking would be produced by a simultaneously present distractor stream that contains a random sequence of the same feature? In the listening experiments described in the following paragraphs, two such features were investigated: syllabic-rate rhythmic pattern (AM) and formant excursion pattern (FM). © 2005 Springer Science + Business Media, Inc.

Cite

CITATION STYLE

APA

Divenyi, P. L. (2005). Masking the feature information in multi-stream speech-analogue displays. In Speech Separation by Humans and Machines (pp. 269–281). Springer US. https://doi.org/10.1007/0-387-22794-6_18

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free