Phone recognition with hierarchical convolutional deep maxout networks

László Tóth

Journal ArticleOPEN ACCESS

Phone recognition with hierarchical convolutional deep maxout networks

Tóth L

Eurasip Journal on Audio, Speech, and Music Processing (2015) 2015(1)

DOI: 10.1186/s13636-015-0068-3

68Citations

56Readers

Abstract

Deep convolutional neural networks (CNNs) have recently been shown to outperform fully connected deep neural networks (DNNs) both on low-resource and on large-scale speech tasks. Experiments indicate that convolutional networks can attain a 10–15 % relative improvement in the word error rate of large vocabulary recognition tasks over fully connected deep networks. Here, we explore some refinements to CNNs that have not been pursued by other authors. First, the CNN papers published up till now used sigmoid or rectified linear (ReLU) neurons. We will experiment with the maxout activation function proposed recently, which has been shown to outperform the rectifier activation function in fully connected DNNs. We will show that the pooling operation of CNNs and the maxout function are closely related, and so the two technologies can be readily combined to build convolutional maxout networks. Second, we propose to turn the CNN into a hierarchical model. The origins of this approach go back to the era of shallow nets, where the idea of stacking two networks on each other was relatively well known. We will extend this method by fusing the two networks into one joint deep model with many hidden layers and a special structure. We will show that with the hierarchical modelling approach, we can reduce the error rate of the network on an expanded context of input. In the experiments on the Texas Instruments Massachusetts Institute of Technology (TIMIT) phone recognition task, we find that a CNN built from maxout units yields a relative phone error rate reduction of about 4.3 % over ReLU CNNs. Applying the hierarchical modelling scheme to this CNN results in a further relative phone error rate reduction of 5.5 %. Using dropout training, the lowest error rate we get on TIMIT is 16.5 %, which is currently the best result. Besides experimenting on TIMIT, we also evaluate our best models on a low-resource large vocabulary task, and we find that all the proposed modelling improvements give consistently better results for this larger database as well.

Author supplied keywords

Cite

CITATION STYLE

APA

Tóth, L. (2015). Phone recognition with hierarchical convolutional deep maxout networks. Eurasip Journal on Audio, Speech, and Music Processing, 2015(1). https://doi.org/10.1186/s13636-015-0068-3

Phone recognition with hierarchical convolutional deep maxout networks

Abstract

Author supplied keywords

Cite

Register to see more suggestions