Most of current Arabic morphological analyzer use complex rules to handle the idiosyncrasies of certain Arabic word classes and special cases. The question that arises: is it feasible to design a pattern-oriented morphological analyzer that streamlines the process and avoid the use of complex rules? To answer this question a detailed study has been conducted using a small representative Arabic corpus. The study revealed that most of the words in the language can be generated using a limited number of patterns, morphemes and particles. Inflected and derivational words can be generated through combinations of roots and patterns. The total number of roots is around 10,000 while the total number of morphological patterns is below 1000. The total number of particles is around 325. Around 70% of words in the experimental corpus are templatic (based on morphological patterns). Although, the number of identified patterns reached 943, only a small subset of these is active. For example, the top 12 patterns in the identified list accounted for more than 50% of the generated templatic words. Although the total number of roots is around 10,000 the number of active roots is 3,461. Particles and similar morphemes account for around 30% of the text in the experimental corpus. These features greatly simplify the development of NLP applications such as spelling correctors, normalizers, lemmatizes and higher-level applications.
CITATION STYLE
El-Affendi, M. A. (2018). The Generative Power of Arabic Morphology and Implications: A case for pattern orientation in arabic corpus annotation and a proposed pattern ontology. In Advances in Intelligent Systems and Computing (Vol. 753, pp. 36–45). Springer Verlag. https://doi.org/10.1007/978-3-319-78753-4_4
Mendeley helps you to discover research relevant for your work.