Exploring Native and Non-Native English Child Speech Recognition With Whisper

1Citations
Citations of this article
14Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Modern end-to-end Automatic Speech Recognition (ASR) systems struggle to recognise children's speech. This challenge is due to the high acoustic variability in children's voices and the scarcity of child speech training data, particularly for accented or low-resource languages. This study focuses on improving the performance of ASR on native and non-native English child speech using publicly available datasets. We evaluate how the large-scale whisper models (trained with a large amount of adult speech data) perform with child speech. In addition, we perform finetuning experiments using different child speech datasets to investigate the performance of whisper ASR on non-native English-speaking children's speech. Our findings indicate relative Word Error Rate (WER) improvements ranging from 29% to 89% over previous benchmarks on the same datasets. Notably, these gains were achieved by finetuning with only a 10% sample of unseen non-native datasets. These results demonstrate the potential of whisper for improving ASR in a low-resource scenario for non-native child speech.

Cite

CITATION STYLE

APA

Jain, R., Barcovschi, A., Yiwere, M. Y., Corcoran, P., & Cucu, H. (2024). Exploring Native and Non-Native English Child Speech Recognition With Whisper. IEEE Access, 12, 41601–41610. https://doi.org/10.1109/ACCESS.2024.3378738

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free