Arabic tokenization system

Mohammed A. Attia

Conference Proceedings

Arabic tokenization system

Attia M

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2007) 65-72

DOI: 10.3115/1654576.1654588

59Citations

136Readers

Get full text

Abstract

Tokenization is a necessary and non-trivial step in natural language processing. In the case of Arabic, where a single word can comprise up to four independent tokens, morphological knowledge needs to be incorporated into the tokenizer. In this paper we describe a rule-based tokenizer that handles tokenization as a full-rounded process with a preprocessing stage (white space normalizer), and a post-processing stage (token filter). We also show how it handles multiword expressions, and how ambiguity is resolved.

Cite

CITATION STYLE

APA

Attia, M. A. (2007). Arabic tokenization system. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 65–72). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1654576.1654588

Arabic tokenization system

Abstract

Cite

Register to see more suggestions