Recent advancements in deep learning-based speech enhancement models have extensively used attention mechanisms to achieve state-of-the-art methods by demonstrating their effectiveness. This paper proposes a transformer attention network based sub-convolutional U-Net (TANSCUNet) for speech enhancement. Instead of adopting conventional RNNs and temporal convolutional networks for sequence modeling, we employ a novel transformer-based attention network between the sub-convolutional U-Net encoder and decoder for better feature learning. More specifically, it is composed of several adaptive time―frequency attention modules and an adaptive hierarchical attention module, aiming to capture long-term time-frequency dependencies and further aggregate hierarchical contextual information. Additionally, a sub-convolutional encoder-decoder model used different kernel sizes to extract multi-scale local and contextual features from the noisy speech. The experimental results show that the proposed model outperforms several state-of-the-art methods.
CITATION STYLE
Yecchuri, S., & Vanambathina, S. D. (2024). Sub-convolutional U-Net with transformer attention network for end-to-end single-channel speech enhancement. Eurasip Journal on Audio, Speech, and Music Processing, 2024(1). https://doi.org/10.1186/s13636-024-00331-z
Mendeley helps you to discover research relevant for your work.