Benefits of pre-trained mono- and cross-lingual speech representations for spoken language understanding of Dutch dysarthric speech

9Citations
Citations of this article
23Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

With the rise of deep learning, spoken language understanding (SLU) for command-and-control applications such as a voice-controlled virtual assistant can offer reliable hands-free operation to physically disabled individuals. However, due to data scarcity, it is still a challenge to process dysarthric speech. Pre-training (part of) the SLU model with supervised automatic speech recognition (ASR) targets or with self-supervised learning (SSL) may help to overcome a lack of data, but no research has shown which pre-training strategy performs better for SLU on dysarthric speech and to which extent the SLU task benefits from knowledge transfer from pre-training with dysarthric acoustic tasks. This work aims to compare different mono- or cross-lingual pre-training (supervised and unsupervised) methodologies and quantitatively investigates the benefits of pre-training for SLU tasks on Dutch dysarthric speech. The designed SLU systems consist of a pre-trained speech representations encoder and a SLU decoder to map encoded features to intents. Four types of pre-trained encoders, a mono-lingual time-delay neural network (TDNN) acoustic model, a mono-lingual transformer acoustic model, a cross-lingual transformer acoustic model (Whisper), and a cross-lingual SSL Wav2Vec2.0 model (XLSR-53), are evaluated complemented with three types of SLU decoders: non-negative matrix factorization (NMF), capsule networks, and long short-term memory (LSTM) networks. The acoustic analysis of the four pre-trained encoders are tested on Dutch dysarthric home-automation data with word error rate (WER) results to investigate the correlations of the dysarthric acoustic task (ASR) and the semantic task (SLU). By introducing the intelligibility score (IS) as a metric of the impairment severity, this paper further quantitatively analyzes dysarthria-severity-dependent models for SLU tasks.

Cite

CITATION STYLE

APA

Wang, P., & Van hamme, H. (2023). Benefits of pre-trained mono- and cross-lingual speech representations for spoken language understanding of Dutch dysarthric speech. Eurasip Journal on Audio, Speech, and Music Processing, 2023(1). https://doi.org/10.1186/s13636-023-00280-z

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free