Effect of spectrogram resolution on deep-neural-network-based speech enhancement

6Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.

Abstract

In recent single-channel speech enhancement, deep neural network (DNN) has played a quite important role for achieving high performance. One standard use of DNN is to construct a maskgenerating function for time-frequency (T-F) masking. For applying a mask in T-F domain, the shorttime Fourier transform (STFT) is usually utilized because of its well-understood and invertible nature. While the mask-generating regression function has been studied for a long time, there is less research on T-F transform from the viewpoint of speech enhancement. Since the performance of speech enhancement depends on both the T-F mask estimator and T-F transform, investigating T-F transform should be beneficial for designing a better enhancement system. In this paper, as a step toward optimal T-F transform in terms of speech enhancement, we experimentally investigated the effect of parameter settings of STFT on a DNN-based mask estimator. We conducted the experiments using three types of DNN architectures with three types of loss functions, and the results suggested that U-Net is robust to the parameter setting while that is not the case for fully connected and BLSTM networks.

Cite

CITATION STYLE

APA

Takeuchi, D., Yatabe, K., Koizumi, Y., Oikawa, Y., & Harada, N. (2020). Effect of spectrogram resolution on deep-neural-network-based speech enhancement. Acoustical Science and Technology, 41(5), 769–775. https://doi.org/10.1250/ast.41.769

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free