Spatial release from energetic and informational masking in a selective speech identification task.
- PubMed: 18537388
Abstract
A masker can reduce target intelligibility both by interfering with the target's peripheral representation ("energetic masking") and/or by causing more central interference ("informational masking"). Intelligibility generally improves with increasing spatial separation between two sources, an effect known as spatial release from masking (SRM). Here, SRM was measured using two concurrent sine-vocoded talkers. Target and masker were each composed of eight different narrowbands of speech (with little spectral overlap). The broadband target-to-masker energy ratio (TMR) was varied, and response errors were used to assess the relative importance of energetic and informational masking. Performance improved with increasing TMR. SRM occurred at all TMRs; however, the pattern of errors suggests that spatial separation affected performance differently, depending on the dominant type of masking. Detailed error analysis suggests that informational masking occurred due to failures in either across-time linkage of target segments (streaming) or top-down selection of the target. Specifically, differences in the spatial cues in target and masker improved streaming and target selection. In contrast, level differences helped listeners select the target, but had little influence on streaming. These results demonstrate that at least two mechanisms (differentially affected by spatial and level cues) influence informational masking.
Author-supplied keywords
Spatial release from energetic and informational masking in a selective speech identification task.
ion a)
aring
08;
by
sing
in
skin
aske
. Th
asse
incr
l se
iled
me
ence
t, le
sults
in
482
LF
ground of competing talkers, at least two forms of masking Previous studies comparing energetic and informational
can interfere with performance: energetic and informational
masking see, e.g., Freyman et al., 1999; Brungart et al.,
2001; Arbogast et al., 2002; Brungart et al., 2005; Kidd et
al., 2005. Spatially separating the target from concurrent
maskers can improve performance, an effect known as spa-
tial release from masking SRM; e.g., see Hirsh, 1948;
Cherry, 1953; Arbogast et al., 2002. While traditional bin-
aural models can account for spatial release from energetic
masking e.g., Zurek, 1993, the mechanisms underlying spa-
tial release from informational masking are poorly under-
stood.
Two stimulus characteristics are often said to contribute
to informational masking: 1 similarity between target and
masker with respect to perceptual e.g., Freyman et al., 1999;
Darwin and Hukin, 2000; Brungart, 2001a or linguistic at-
tributes Hawley et al., 2004; Van Engen and Bradlow,
2007, and 2 uncertainty about either target or masker e.g.,
Lutfi, 1993; Kidd et al., 2005a; Freyman et al., 2007. Past
work suggests that at least some of these effects of informa-
tional masking are due to failures in segregation and/or at-
tention e.g., Brungart et al., 2005; Edmonds and Culling,
masking indicate that analysis of response errors can disso-
ciate effects caused by energetic and informational masking.
For instance, in selective speech identification tasks with in-
terference from informational masking, subjects often report
words from the masker rather than the target messages
Brungart, 2001b; Kidd et al., 2005a; Wightman and Kistler,
2005. In contrast, for selective-listening tasks that are domi-
nated by energetic masking, errors are more randomly dis-
tributed. The current study attempts to tease apart the mecha-
nisms underlying informational masking through a more
detailed analysis of response patterns than has been under-
taken in previous studies. The analyses are driven by the
hypothesis that similarity between target and masker can in-
terfere with 1 extracting local time-frequency segments
from the acoustic mixture, 2 connecting segments from the
same source across time streaming and/or 3 selecting the
correct target segments or stream even if they are properly
segmented and streamed.
We explored how spatial separation between target and
masker influences the pattern of responses, and how these
patterns are affected by the level difference between target
and masker. Listeners were asked to report a phrase from a
variable-level target message that was presented simulta-
neously with a fixed-level masker message. The locations of
target and masker were simulated over headphones to be
either co-located or spatially separated by 90°. In addition,
a
Portions of this work were presented at the 2005 Mid-Winter meeting of
the Association for Research in Otolaryngology.
bAuthor to whom correspondence should be addressed. Electronic mail:
shinn@cns.bu.edu.in a selective speech identificat
Antje Ihlefeld and Barbara Shinn-Cunninghamb
Auditory Neuroscience Laboratory, Boston University He
Boston, Massachusetts 02215, USA1
Received 10 November 2006; revised 12 March 20
A masker can reduce target intelligibility both
representation “energetic masking” and/or by cau
masking”. Intelligibility generally improves with
sources, an effect known as spatial release from ma
two concurrent sine-vocoded talkers. Target and m
narrowbands of speech with little spectral overlap
TMR was varied, and response errors were used to
informational masking. Performance improved with
however, the pattern of errors suggests that spatia
depending on the dominant type of masking. Deta
masking occurred due to failures in either across-ti
top-down selection of the target. Specifically, differ
improved streaming and target selection. In contras
target, but had little influence on streaming. These re
differentially affected by spatial and level cues
Acoustical Society of America. DOI: 10.1121/1.290
PACS numbers: 43.66.Dc, 43.66.Pn, 43.66.Qp R
I. INTRODUCTION
When listening selectively for target speech in a back-J. Acoust. Soc. Am. 123 6, June 2008 0001-4966/2008/1236task
Research Center, 677 Beacon St.,
accepted 13 March 2008
interfering with the target’s peripheral
more central interference “informational
creasing spatial separation between two
g SRM. Here, SRM was measured using
r were each composed of eight different
e broadband target-to-masker energy ratio
ss the relative importance of energetic and
easing TMR. SRM occurred at all TMRs;
paration affected performance differently,
error analysis suggests that informational
linkage of target segments streaming or
s in the spatial cues in target and masker
vel differences helped listeners select the
demonstrate that at least two mechanisms
fluence informational masking. © 2008
6
Pages: 4369–4379
2006. However, there is no clear consensus on how these
two processes contribute to spatial release from informa-
tional masking.© 2008 Acoustical Society of America 4369/4369/11/$23.00
level between target and masker could provide listeners with +
Left-ear
Speech
A)a cue to select the target and/or better link target keywords
across time into a coherent target stream.
Analysis of response errors revealed systematic changes
with spatial configuration and target level in the likelihoods
of reporting all target keywords, all keywords from the
masker, or a mixture of keywords from target and masker.
The pattern of results suggests that the relative contributions
of energetic and informational masking change systemati-
cally with the target-to-masker broadband energy ratio
TMR. At near-zero-dB TMRs, when informational masking
occurs, space and intensity cues may help listeners track key-
words across time to form a proper stream, as well as enable
listeners to select the proper keywords or streams out of the
mixture.
II. METHODS
A. Subjects
Four subjects ages 21–24 were paid for their participa-
tion in the experiments. All subjects were native speakers of
American English and had normal hearing, confirmed by an
audiometric screening. All subjects gave written informed
consent as approved by the Boston University Charles River
Campus Institutional Review Board before participating in
the study.
B. Stimuli
Raw speech stimuli were taken from the Coordinate Re-
sponse Measure corpus CRM, see Bolia et al., 2000, which
consists of highly predictable sentences of the form “Ready
call sign, go to color number now.” The call sign was
one of the set “Baron,” “Eagle,” “Tiger,” and “Arrow”; the
color was one of the set “white,” “red,” “blue,” “green”;
and the number was one of the digits between one and eight,
excluding the number seven as it is the only two-syllable
digit and is therefore relatively easy to identify. For each
session, one of the four call signs was randomly selected as
the target call sign.
In each trial, two different sentences were used as
sources. The call signs, numbers, and colors in the two utter-
ances were randomly chosen, but constrained to differ from
each other in each trial with one sentence always containing
the target call sign. In order to minimize differences be-
tween competing messages, talker 0 was used for both sen-
tences.
Each speech signal was processed to produce intelli-
gible, spectrally sparse speech signals e.g., see Shannon et
al., 1995; Dorman et al., 1997; Arbogast et al., 2002; Brun-
gart et al., 2005. All processing was implemented in MAT-
LAB 6.5 see Fig. 1a for a diagram of the processing
scheme. Each target and masker source signal was bandpass
filtered into 16 fixed frequency bands of 1 /3 octave width,
with center frequencies spaced evenly on a logarithmic scale
between 175 Hz and 5.6 kHz every one-third octave. The
envelope of each band was extracted using the Hilbert trans-
form. Subsequently, each envelope was multiplied by a pure
tone carrier at the center frequency of that band. Figure 1b4370 J. Acoust. Soc. Am., Vol. 123, No. 6, June 2008 A. Ihlefeshows an example stimulus spectrum with all 16 bands con-
secutive bands are shown in alternating shades. In contrast
to many previous experiments using amplitude-modulated
sine-wave carrier speech e.g., Arbogast et al., 2002; Gallun
et al., 2005; Brungart et al., 2005; Kidd et al., 2005b, the
frequency bands of the current stimuli were not equalized to
have similar spectral amplitudes, so that the high-frequency
bands have less energy than the low-frequency bands.
On each individual trial, eight of the 16 bands were cho-
sen randomly while ensuring that four of these bands were
selected from the lower eight bands 175–882 Hz and four
were selected from the upper eight bands 1.1–5.6 kHz.
This resulted in a set of 8! / 4!4!2 or 4900 different pos-
sible spectral combinations. The eight bands were then
summed to create the raw waveform for one source. The
remaining eight bands were used to construct the other
source using otherwise identical processing. As a result, the
two raw sources had identical statistics over the course of the
experiment, but differed within a trial in their timbre, call
sign, and keywords and, on most trials, level.
The raw source waveforms were scaled to have the same
fixed, broadband root-mean-square RMS energy prior to
spatial processing described below. When target and
masker were set to the same level of broadband RMS energy,
the within-band energy ratio of one utterance to another was
on the order of 20 dB at all frequencies cf. Fig. 1b. In
fact, a model of the auditory periphery shows that for these
175 1000 5600
-20
-10
0
10
20
30
40
50
60
Frequency [Hz]
A
m
p
li
tu
d
e
[d
B
]
Source 1
Source 2
bandpass
center f1
Hilbert
envelope
x
bandpass
center f8
Hilbert
envelope
x Right-ear
Speech
Stimulus
+
Stimulus
Speech
Waveform
B)
FIG. 1. a Flow chart showing how the stimuli were generated. b An
example of the resulting interleaved spectra of two competing spectrally
sparse messages, one in gray and one in black.ld and B. G. Shinn-Cunningham: Spatial factors in selective listening
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


