In recent single-channel speech enhancement, deep neural network (DNN) has played a quite important role for achieving high performance. One standard use of DNN is to construct a maskgenerating function for time-frequency (T-F) masking. For applying a mask in T-F domain, the shorttime Fourier transform (STFT) is usually utilized because of its well-understood and invertible nature. While the mask-generating regression function has been studied for a long time, there is less research on T-F transform from the viewpoint of speech enhancement. Since the performance of speech enhancement depends on both the T-F mask estimator and T-F transform, investigating T-F transform should be beneficial for designing a better enhancement system. In this paper, as a step toward optimal T-F transform in terms of speech enhancement, we experimentally investigated the effect of parameter settings of STFT on a DNN-based mask estimator. We conducted the experiments using three types of DNN architectures with three types of loss functions, and the results suggested that U-Net is robust to the parameter setting while that is not the case for fully connected and BLSTM networks.
Mendeley helps you to discover research relevant for your work.
CITATION STYLE
Takeuchi, D., Yatabe, K., Koizumi, Y., Oikawa, Y., & Harada, N. (2020). Effect of spectrogram resolution on deep-neural-network-based speech enhancement. Acoustical Science and Technology, 41(5), 769–775. https://doi.org/10.1250/ast.41.769