Arabic tokenization system

59Citations
Citations of this article
136Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Tokenization is a necessary and non-trivial step in natural language processing. In the case of Arabic, where a single word can comprise up to four independent tokens, morphological knowledge needs to be incorporated into the tokenizer. In this paper we describe a rule-based tokenizer that handles tokenization as a full-rounded process with a preprocessing stage (white space normalizer), and a post-processing stage (token filter). We also show how it handles multiword expressions, and how ambiguity is resolved.

Cite

CITATION STYLE

APA

Attia, M. A. (2007). Arabic tokenization system. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 65–72). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1654576.1654588

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free