Localization based stereo speech source separation using probabilistic time-frequency masking and deep neural networks

45Citations
Citations of this article
39Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Time-frequency (T-F) masking is an effective method for stereo speech source separation. However, reliable estimation of the T-F mask from sound mixtures is a challenging task, especially when room reverberations are present in the mixtures. In this paper, we propose a new stereo speech separation system where deep neural networks are used to generate soft T-F mask for separation. More specifically, the deep neural network, which is composed of two sparse autoencoders and a softmax regression, is used to estimate the orientations of the dominant source at each T-F unit, based on low-level features, such as mixing vector (MV), interaural level, and phase difference (IPD/ILD). The dataset for training the networks was generated by the convolution of binaural room impulse responses (RIRs) and clean speech signals positioned in different angles with respect to the sensors. With the training dataset, we use unsupervised learning to extract high-level features from low-level features and use supervised learning to find the nonlinear functions between high-level features and the orientations of dominant source. By using the trained networks, the probability that each T-F unit belongs to different sources (target and interferers) can be estimated based on the localization cues which is further used to generate the soft mask for source separation. Experiments based on real binaural RIRs and TIMIT dataset are provided to show the performance of the proposed system for reverberant speech mixtures, as compared with a model-based T-F masking technique proposed recently.

Cite

CITATION STYLE

APA

Yu, Y., Wang, W., & Han, P. (2016). Localization based stereo speech source separation using probabilistic time-frequency masking and deep neural networks. Eurasip Journal on Audio, Speech, and Music Processing, 2016(1). https://doi.org/10.1186/s13636-016-0085-x

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free