LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

Leyuan Qu; Cornelius Weber; Stefan Wermter

Journal ArticleOPEN ACCESS

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

IEEE Transactions on Neural Networks and Learning Systems (2024) 35(2) 2772-2782

DOI: 10.1109/TNNLS.2022.3191677

7Citations

20Readers

Abstract

The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos. We propose LipSound2 that consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms directly without requiring any human annotations. The proposed LipSound2 model is first pre-trained on ∼ 2400-h multilingual (e.g., English and German) audio-visual data (VoxCeleb2). To verify the generalizability of the proposed method, we then fine-tune the pre-trained model on domain-specific datasets (GRID and TCD-TIMIT) for English speech reconstruction and achieve a significant improvement on speech quality and intelligibility compared to previous approaches in speaker-dependent and speaker-independent settings. In addition to English, we conduct Chinese speech reconstruction on the Chinese Mandarin Lip Reading (CMLR) dataset to verify the impact on transferability. Finally, we train the cascaded lip reading (video-to-text) system by fine-tuning the generated audios on a pre-trained speech recognition system and achieve the state-of-the-art performance on both English and Chinese benchmark datasets.

Author supplied keywords

Cite

CITATION STYLE

APA

Qu, L., Weber, C., & Wermter, S. (2024). LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading. IEEE Transactions on Neural Networks and Learning Systems, 35(2), 2772–2782. https://doi.org/10.1109/TNNLS.2022.3191677

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

Abstract

Author supplied keywords

Cite

Register to see more suggestions