Abstract
Tokenization is a necessary and non-trivial step in natural language processing. In the case of Arabic, where a single word can comprise up to four independent tokens, morphological knowledge needs to be incorporated into the tokenizer. In this paper we describe a rule-based tokenizer that handles tokenization as a full-rounded process with a preprocessing stage (white space normalizer), and a post-processing stage (token filter). We also show how it handles multiword expressions, and how ambiguity is resolved.
Cite
CITATION STYLE
Attia, M. A. (2007). Arabic tokenization system. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 65–72). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1654576.1654588
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.