In virtue of the superiority of handling the sequence data and the effectiveness of preserving long-distance information, recurrent neural network language model (RNNLM) has prevailed in a range of tasks in recent years. However, a large quantities of data are required for language modelling with good performance, which poses the difficulties of modeling for low-resource languages. To address this issue, Tibetan as one of minority languages is instantiated, and its radicals (components of Tibetan characters) are explored for constructing language model. Motivated by the inherent structure of Tibetan, a novel construction of Tibetan character embedding is exploited to RNNLM. The fusion of individual radical embedding is enhanced by three ways, including using uniform weight (TRU), different weights (TRD) and radical combination (TRC). This structure, especially combining with the radicals, can extend the capability to capture long-term context dependencies and solve the low-resource problem to some extent. The experimental results suggest that this proposed structure obtained a better performance than standard RNNLM, yielding 7.4%, 12.7% and 13.5% relative perplexity reduction by using TRU, TRD and TRC respectively.
CITATION STYLE
Shen, T., Wang, L., Chen, X., Khysru, K., & Dang, J. (2017). Exploiting the Tibetan Radicals in Recurrent Neural Network for Low-Resource Language Models. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10635 LNCS, pp. 266–275). Springer Verlag. https://doi.org/10.1007/978-3-319-70096-0_28
Mendeley helps you to discover research relevant for your work.