Low Complexity Speech Enhancement Network Based on Frame-Level Swin Transformer

6Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.

Abstract

In recent years, Transformer has shown great performance in speech enhancement by applying multi-head self-attention to capture long-term dependencies effectively. However, the computation of Transformer is quadratic with the input speech spectrograms, which makes it computationally expensive for practical use. In this paper, we propose a low complexity hierarchical frame-level Swin Transformer network (FLSTN) for speech enhancement. FLSTN takes several consecutive frames as a local window and restricts self-attention within it, reducing the complexity to linear with spectrogram size. A shifted window mechanism enhances information exchange between adjacent windows, so that window-based local attention becomes disguised global attention. The hierarchical structure allows FLSTN to learn speech features at different scales. Moreover, we designed the band merging layer and the band expanding layer for decreasing and increasing the spatial resolution of feature maps, respectively. We tested FLSTN on both 16 kHz wide-band speech and 48 kHz full-band speech. Experimental results demonstrate that FLSTN can handle speech with different bandwidths well. With very few multiply–accumulate operations (MACs), FLSTN not only has a significant advantage in computational complexity but also achieves comparable objective speech quality metrics with current state-of-the-art (SOTA) models.

References Powered by Scopus

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

16590Citations
N/AReaders
Get full text

Suppression of Acoustic Noise in Speech Using Spectral Subtraction

4051Citations
N/AReaders
Get full text

An algorithm for intelligibility prediction of time-frequency weighted noisy speech

1934Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Spatio-Temporal Features Representation Using Recurrent Capsules for Monaural Speech Enhancement

5Citations
N/AReaders
Get full text

DPHT-ANet: Dual-path high-order transformer-style fully attentional network for monaural speech enhancement

0Citations
N/AReaders
Get full text

A Two-Stage Beamforming and Diffusion-Based Refiner System for 3D Speech Enhancement

0Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Jiang, W., Sun, C., Chen, F., Leng, Y., Guo, Q., Sun, J., & Peng, J. (2023). Low Complexity Speech Enhancement Network Based on Frame-Level Swin Transformer. Electronics (Switzerland), 12(6). https://doi.org/10.3390/electronics12061330

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 2

100%

Readers' Discipline

Tooltip

Engineering 2

50%

Business, Management and Accounting 1

25%

Computer Science 1

25%

Article Metrics

Tooltip
Mentions
News Mentions: 1

Save time finding and organizing research with Mendeley

Sign up for free