In this paper, we address the challenge of creating accurate and robust partof-speech taggers for low-resource languages. We propose a method that leverages existing parallel data between the target language and a large set of resourcerich languages without ancillary resources such as tag dictionaries. Crucially, we use CCA to induce latent word representations that incorporate cross-genre distributional cues, as well as projected tags from a full array of resource-rich languages. We develop a probability-based confidence model to identify words with highly likely tag projections and use these words to train a multi-class SVM using the CCA features. Our method yields average performance of 85% accuracy for languages with almost no resources, outperforming a state-of-the-art partiallyobserved CRF model.
CITATION STYLE
Kim, Y. B., Snyder, B., & Sarikaya, R. (2015). Part-of-speech taggers for low-resource languages using CCA features. In Conference Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural Language Processing (pp. 1292–1302). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/d15-1150
Mendeley helps you to discover research relevant for your work.