Analysis and Development of Urdu POS Tagged Corpus

18Citations
Citations of this article
84Readers
Mendeley users who have this article in their library.

Abstract

In this paper, two corpora of Urdu (with 110K and 120K words) tagged with different POS tagsets are used to train TnT and Tree taggers. Error analysis of both taggers is done to identify frequent confusions in tagging. Based on the analysis of tagging, and syntactic structure of Urdu, a more refined tagset is derived. The existing tagged corpora are tagged with the new tagset to develop a single corpus of 230K words and the TnT tagger is retrained. The results show improvement in tagging accuracy for individual corpora to 94.2% and also for the merged corpus to 91%. Implications of these results are discussed.

Cite

CITATION STYLE

APA

Muaz, A., Ali, A., & Hussain, S. (2009). Analysis and Development of Urdu POS Tagged Corpus. In Proceedings of the 7th Workshop on Asian Language Resources, ALR 2009 - in conjunction with the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (pp. 24–31). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1690299.1690303

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free