In Automatic Speech Recognition (ASR), the acoustic model (AM) is modeled by a Deep Neural Network (DNN). The DNN learns a posterior probability in a supervised fashion utilizing input features and ground-truth labels. Current approaches combine a DNN with a Hidden Markov Model (HMM) in a hybrid approach, which achieved good results in the last years. Similar approaches using a discrete version, hence a Discrete Hidden Markov Model (DHMM), have been disregarded in recent past. Our approach revisits the idea of a discrete system, more precisely the so-called Deep Neural Network Quantizer (DNNQ), demonstrating how a DNNQ is created and trained. We introduce a novel approach to train a DNNQ in a supervised fashion with an arbitrary output layer size even though suitable target values are not available. The proposed method provides a mapping function exploiting fixed ground-truth labels. Consequently, we are able to apply a frame-based cross entropy (CE) training. Our experiments demonstrate that the DNNQ reduces the Word Error Rate (WER) by 17.6 % on monophones and by 2.2 % on triphones, respectively, compared to a continuous HMM-Gaussian Mixture Model (GMM) system.
CITATION STYLE
Watzel, T., Li, L., Kürzinger, L., & Rigoll, G. (2019). Deep neural network quantizers outperforming continuous speech recognition systems. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11658 LNAI, pp. 530–539). Springer Verlag. https://doi.org/10.1007/978-3-030-26061-3_54
Mendeley helps you to discover research relevant for your work.