Deep group residual convolutional CTC networks for speech recognition

3Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

End-to-end deep neural networks have been widely used in the literature to model 2D correlations in the audio signal. Both Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) have shown improvements across a wide variety of speech recognition tasks. Especially, CNNs effectively exploit temporal and spectral local correlations to gain translation invariance. However, all CNNs used in existing work assume each channel’s feature map is independent of each other, which may not fully utilize and combine information about input features. Meanwhile, most CNNs in literature use shallow layers may not be deep enough to capture all human speech signal information. In this paper, we propose a novel neural network, denoted as GRCNN-CTC, which integrates group residual convloutional blocks and recurrent layers paired with Connectionist Temporal Classification (CTC) loss. Experimental results show that our proposed GRCNN-CTC achieve 1.11% Word Error Rate (WER) and 0.48% Character Error Rate (CER) improvements on a subset of the LibriSpeech dataset compared to the baseline automatic speech recognition (ASR) system. In addition, our model greatly reduces computational overhead and converges faster, leading to scale up to deeper architecture.

Cite

CITATION STYLE

APA

Wang, K., Guan, D., & Li, B. (2018). Deep group residual convolutional CTC networks for speech recognition. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11323 LNAI, pp. 318–328). Springer Verlag. https://doi.org/10.1007/978-3-030-05090-0_27

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free