Design and Development of Media-Corpus of the Kazakh Language

Madina Mansurova; Gulmira Madiyeva; Sanzhar Aubakirov; Zhantemir Yermekov; Yermek Alimzhanov

Conference Proceedings

Design and Development of Media-Corpus of the Kazakh Language

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2017) 10449 LNAI 509-518

DOI: 10.1007/978-3-319-67077-5_49

5Citations

5Readers

Get full text

Abstract

The aim of this work was design and development of a media-corpus of the Kazakh language. The media-corpus is hosted by the al-Farabi Kazakh National University and serves linguists as an empirical basis for research on contemporary written Kazakh. The information system for media-corpus was built on the basis of component software architecture. To make the processes of collection, storage and analysis of media-texts in the Kazakh language automatic, four components of the information system were designed and developed. The text files are saved in XML format. At the stage of analysis such tasks as text normalization, removing stop words, adding metadata and morphological analysis are performed. The morphological analyzer receives an input of a plain text, and at the output gives the text in XML format, which is further convenient to work with as it is easily converted to JSON format. The XML format is defined using XML Schema Definition (XSD). XSD allows to convert data into any other format, which simplifies the data exchange between the systems. For the case of incomplete morphological markup and the presence of homonymy, a special interface to perform manual markup is developed.

Author supplied keywords

Cite

CITATION STYLE

APA

Mansurova, M., Madiyeva, G., Aubakirov, S., Yermekov, Z., & Alimzhanov, Y. (2017). Design and Development of Media-Corpus of the Kazakh Language. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10449 LNAI, pp. 509–518). Springer Verlag. https://doi.org/10.1007/978-3-319-67077-5_49

Design and Development of Media-Corpus of the Kazakh Language

Abstract

Author supplied keywords

Cite

Register to see more suggestions