Recurrent timing nets for F0-based speaker separation

3Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Arguably, the most important barrier to widespread use of automatic speech recognition systems in real-life situations is their present inability to separate speech of individual speakers from other sound sources: other speakers, acoustic clutter, background noise. We believe that critical examination of biological auditory systems with a focus on "reverse engineering" these systems, can lead to discovery of new functional principles of information representation and processing that can subsequently be applied to the design of artificial speech recognition systems. From our experiences in attempting to understand the essentials of how the auditory system works as an information processing device, we believe that there are three major areas where speech recognizers could profit from incorporating processing strategies inspired by auditory systems. These areas are: 1) use of temporally-coded, front-end representations that are precise, robust, and transparent, and which encode the fine temporal (phase) structure of periodicities below 4 kHz, 2) use of early scene analysis mechanisms that form distinct auditory objects by means of common onset/offset/temporal contiguity and common harmonic structure (F0, voice pitch), and 3) use of central phonetic analyzers that are designed to operate on multiscale, temporally-coded, autocorrelation-like front-end representations as they present themselves after initial object formation/scene analysis processing. This paper will address the first two areas, with emphasis on possible neural mechanisms (neural timing nets) that could exploit phase-locked fine timing information to separate harmonic sounds on the basis of differences in their fundamental frequencies (harmonicity). Psychoacoustical evidence suggests that the auditory system employs extremely effective low-level, bottom-up representational and scene analysis strategies to enable individual sound sources to perform this separation. Neurophysiological evidence suggests that the auditory system utilizes interspike interval information for representing sound in early stages of auditory processing. Interval-based temporal codes are known to provide high-quality, precise, and robust representations of stimulus periodicities and spectra over large dynamic ranges and in adverse sonic environments. We have recently proposed neural timing networks that operate on temporally-coded inputs to carry out spike pattern analyses entirely in the time domain. These complement connectionist and time-delay architectures that produce "spatial", atemporal patterns of element activations as their outputs. In effect neural timing architectures provide neural network implementations of analog signal processing operations (e.g. cross-correlation, autocorrelation, convolution, cross-spectral product). The ubiquity of neural (tapped) delay lines in the brain may mean that many signal processing operations are more easily and flexibly implemented neurally using time domain rather than frequency domain and/or discrete feature detection strategies. We have found that simple recurrent timing nets can be devised that operate on temporal fine structure of inputs to build up and separate periodic signals with different fundamental periods (Cariani, 2001a). Simple recurrent nets consist of arrays of coincidence detectors fed by common input lines and conduction delay loops of different recurrence times. A processing rule facilitates correlations between input and loop signals to amplify periodic patterns and segregate those with different periods, thereby allowing constituent waveforms to be recovered. The processing is akin to a dense array of adaptive-prediction comb filters. Based on time codes and temporal processing, timing nets constitute a new, general strategy for scene analysis in neural networks. The nets build up correlational invariances rather than using features to label, segregate and bind channels: they provide a possible means by which the fine temporal structure of voiced speech might be exploited for the speaker separation and enhancement. © 2005 Springer Science + Business Media, Inc.

Cite

CITATION STYLE

APA

Cariani, P. (2005). Recurrent timing nets for F0-based speaker separation. In Speech Separation by Humans and Machines (pp. 31–53). Springer US. https://doi.org/10.1007/0-387-22794-6_4

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free