Design and Development of Media-Corpus of the Kazakh Language

5Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The aim of this work was design and development of a media-corpus of the Kazakh language. The media-corpus is hosted by the al-Farabi Kazakh National University and serves linguists as an empirical basis for research on contemporary written Kazakh. The information system for media-corpus was built on the basis of component software architecture. To make the processes of collection, storage and analysis of media-texts in the Kazakh language automatic, four components of the information system were designed and developed. The text files are saved in XML format. At the stage of analysis such tasks as text normalization, removing stop words, adding metadata and morphological analysis are performed. The morphological analyzer receives an input of a plain text, and at the output gives the text in XML format, which is further convenient to work with as it is easily converted to JSON format. The XML format is defined using XML Schema Definition (XSD). XSD allows to convert data into any other format, which simplifies the data exchange between the systems. For the case of incomplete morphological markup and the presence of homonymy, a special interface to perform manual markup is developed.

Cite

CITATION STYLE

APA

Mansurova, M., Madiyeva, G., Aubakirov, S., Yermekov, Z., & Alimzhanov, Y. (2017). Design and Development of Media-Corpus of the Kazakh Language. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10449 LNAI, pp. 509–518). Springer Verlag. https://doi.org/10.1007/978-3-319-67077-5_49

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free