DNN based mask estimation for supervised speech separation

17Citations
Citations of this article
11Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This chapter introduces deep neural network (DNN) based mask estimation for supervised speech separation. Originated in computational auditory scene analysis (CASA), we treat speech separation as a mask estimation problem. Given a time-frequency (T-F) representation of noisy speech, the ideal binary mask (IBM) or ideal ratio mask (IRM) is defined to differentiate speech-dominant T-F units from noise-dominant ones. Mask estimation is then formulated as a problem of supervised learning, which learns a mapping function from acoustic features extracted from noisy speech to an ideal mask. Three main aspects of supervised learning are learning machines, training targets, and features, which are discussed in separate sections. Subsequently, we describe several representative supervised algorithms, mainly for monaural speech separation. For supervised separation, generalization to unseen conditions is a critical issue. The generalization capability of supervised speech separation is also discussed.

Cite

CITATION STYLE

APA

Chen, J., & Wang, D. L. (2018). DNN based mask estimation for supervised speech separation. In Signals and Communication Technology (pp. 207–235). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-319-73031-8_9

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free