End-to-end Mandarin speech recognition combining CNN and BLSTM

37Citations
Citations of this article
44Readers
Mendeley users who have this article in their library.

Abstract

Since conventional Automatic Speech Recognition (ASR) systems often contain many modules and use varieties of expertise, it is hard to build and train such models. Recent research show that end-to-end ASRs can significantly simplify the speech recognition pipelines and achieve competitive performance with conventional systems. However, most end-to-end ASR systems are neither reproducible nor comparable because they use specific language models and in-house training databases which are not freely available. This is especially common for Mandarin speech recognition. In this paper, we propose a CNN+BLSTM+CTC end-to-end Mandarin ASR. This CNN+BLSTM+CTC ASR uses Convolutional Neural Net (CNN) to learn local speech features, uses Bidirectional Long-Short Time Memory (BLSTM) to learn history and future contextual information, and uses Connectionist Temporal Classification (CTC) for decoding. Our model is completely trained on the by-far-largest open-source Mandarin speech corpus AISHELL-1, using neither any in-house databases nor external language models. Experiments show that our CNN+BLSTM+CTC model achieves a WER of 19.2%, outperforming the exiting best work. Because all the data corpora we used are freely available, our model is reproducible and comparable, providing a new baseline for further Mandarin ASR research.

References Powered by Scopus

Librispeech: An ASR corpus based on public domain audio books

5052Citations
N/AReaders
Get full text

Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks

3579Citations
N/AReaders
Get full text

AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline

649Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Automatic speech recognition: a survey

232Citations
N/AReaders
Get full text

Continuous Human Activity Recognition with Distributed Radar Sensor Networks and CNN-RNN Architectures

50Citations
N/AReaders
Get full text

Mini-batch sample selection strategies for deep learning based speech recognition

40Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Wang, D., Wang, X., & Lv, S. (2019). End-to-end Mandarin speech recognition combining CNN and BLSTM. Symmetry, 11(5). https://doi.org/10.3390/sym11050644

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 11

79%

Professor / Associate Prof. 2

14%

Researcher 1

7%

Readers' Discipline

Tooltip

Computer Science 15

83%

Materials Science 1

6%

Physics and Astronomy 1

6%

Engineering 1

6%

Save time finding and organizing research with Mendeley

Sign up for free