Smoothing a lexicon-based POS tagger for Arabic and Hebrew

Saib Mansour; Khalil Sima'An; Yoad Winter

Conference ProceedingsOPEN ACCESS

Smoothing a lexicon-based POS tagger for Arabic and Hebrew

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2007) 97-103

DOI: 10.3115/1654576.1654593

15Citations

86Readers

Abstract

We propose an enhanced Part-of-Speech (POS) tagger of Semitic languages that treats Modern Standard Arabic (henceforth Arabic) and Modern Hebrew (henceforth Hebrew) using the same probabilistic model and architectural setting. We start out by porting an existing Hidden Markov Model POS tagger for Hebrew to Arabic by exchanging a morphological analyzer for Hebrew with Buckwalter's (2002) morphological analyzer for Arabic. This gives state-of-the-art accuracy (96.12%), comparable to Habash and Rambow's (2005) analyzer-based POS tagger on the same Arabic datasets. However, further improvement of such analyzer-based tagging methods is hindered by the incomplete coverage of standard morphological analyzer (Bar Haim et al., 2005). To overcome this coverage problem we supplement the output of Buckwalter's analyzer with synthetically constructed analyses that are proposed by a model which uses character information (Diab et al., 2004) in a way that is similar to Nakagawa's (2004) system for Chinese and Japanese. A version of this extended model that (unlike Nakagawa) incorporates synthetically constructed analyses also for known words achieves 96.28% accuracy on the standard Arabic test set.

Cite

CITATION STYLE

APA

Mansour, S., Sima’An, K., & Winter, Y. (2007). Smoothing a lexicon-based POS tagger for Arabic and Hebrew. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 97–103). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1654576.1654593

Smoothing a lexicon-based POS tagger for Arabic and Hebrew

Abstract

Cite

Register to see more suggestions