Articulatory gesture rich representation learning of phonological units in low resource settings

5Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Recent literature presents evidence that both linguistic (phonemic) and non linguistic (speaker identity, emotional content) information resides at a lower dimensional manifold embedded richly inside the higher-dimensional spectral features like MFCC and PLP. Linguistic or phonetic units of speech can be broken down to a legal inventory of articulatory gestures shared across several phonemes based on their manner of articulation. We intend to discover a subspace rich in gestural information of speech and captures the invariance of similar gestures. In this paper, we investigate unsupervised techniques best suited for learning such a subspace. Main contribution of the paper is an approach to learn gesture-rich representation of speech automatically from data in completely unsupervised manner. This study compares the representations obtained through convolutional autoencoder (ConvAE) and standard unsupervised dimensionality reduction techniques such as manifold learning and Principal Component Analysis (PCA) through the task of phoneme classification. Manifold learning techniques such as Locally Linear Embedding (LLE), Isomap and Laplacian Eigenmaps are evaluated in this study. The representations which best separate different gestures are suitable for discovering subword units in case of low or zero resource speech conditions. Further, we evaluate the representation using Zero Resource Speech Challenge’s ABX discriminability measure. Results indicate that representation obtained through ConvAE and Isomap out-perform baseline MFCC features in the task of phoneme classification as well as ABX measure and induce separation between sounds composed of different set of gestures. We further cluster the representations using Dirichlet Process Gaussian Mixture Model (DPGMM) to automatically learn the cluster distribution of data and show that these clusters correspond to groups of similar manner of articulation. DPGMM distribution is used as apriori to obtain correspondence terms for robust ConvAE training.

Cite

CITATION STYLE

APA

Srivastava, B. M. L., & Shrivastava, M. (2016). Articulatory gesture rich representation learning of phonological units in low resource settings. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9918 LNCS, pp. 80–95). Springer Verlag. https://doi.org/10.1007/978-3-319-45925-7_7

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free