DNN based mask estimation for supervised speech separation

Jitong Chen; De Liang Wang

Book Chapter

DNN based mask estimation for supervised speech separation

Springer Science and Business Media Deutschland GmbH, (2018), 207-235

DOI: 10.1007/978-3-319-73031-8_9

17Citations

11Readers

Get full text

Abstract

This chapter introduces deep neural network (DNN) based mask estimation for supervised speech separation. Originated in computational auditory scene analysis (CASA), we treat speech separation as a mask estimation problem. Given a time-frequency (T-F) representation of noisy speech, the ideal binary mask (IBM) or ideal ratio mask (IRM) is defined to differentiate speech-dominant T-F units from noise-dominant ones. Mask estimation is then formulated as a problem of supervised learning, which learns a mapping function from acoustic features extracted from noisy speech to an ideal mask. Three main aspects of supervised learning are learning machines, training targets, and features, which are discussed in separate sections. Subsequently, we describe several representative supervised algorithms, mainly for monaural speech separation. For supervised separation, generalization to unseen conditions is a critical issue. The generalization capability of supervised speech separation is also discussed.

Cite

CITATION STYLE

APA

Chen, J., & Wang, D. L. (2018). DNN based mask estimation for supervised speech separation. In Signals and Communication Technology (pp. 207–235). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-319-73031-8_9

DNN based mask estimation for supervised speech separation

Abstract

Cite

Register to see more suggestions