Abstract
Recently, Transformer shows the potential to exploit the long-range sequence dependency in speech with self-attention. It has been introduced in single channel speech enhancement to improve the accuracy of speech estimation from a noise mixture. However, the amount of information represented across attention-heads is often huge, which leads to increased computational complexity. To address this issue, the axial attention is proposed i.e., to split a 2D attention into two 1-D attentions. In this paper, we develop a new method for speech enhancement by leveraging the axial attention, where we generate time and frequency sub-attention maps by calculating the attention map along time- and frequency-axis. Different from the conventional axial attention, the proposed method provides two parallel multi-head attentions for time- and frequency-axis, respectively. Moreover, the frequency-band aware attention is proposed i.e., high frequency-band attention (HFA), and low frequency-band attention (LFA), which facilitates the exploitation of the information related to speech and noise in different frequency bands in the noisy mixture. To re-use high-resolution feature maps from the encoder, we design a U-shaped Transformer, which helps recover lost information from the high-level representations to further improve the speech estimation accuracy. Extensive experiments on four public datasets are used to demonstrate the efficacy of the proposed method.
Author supplied keywords
Cite
CITATION STYLE
Li, Y., Sun, Y., Wang, W., & Naqvi, S. M. (2023). U-Shaped Transformer with Frequency-Band Aware Attention for Speech Enhancement. IEEE/ACM Transactions on Audio Speech and Language Processing, 31, 1511–1521. https://doi.org/10.1109/TASLP.2023.3265839
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.