The aim of this work was design and development of a media-corpus of the Kazakh language. The media-corpus is hosted by the al-Farabi Kazakh National University and serves linguists as an empirical basis for research on contemporary written Kazakh. The information system for media-corpus was built on the basis of component software architecture. To make the processes of collection, storage and analysis of media-texts in the Kazakh language automatic, four components of the information system were designed and developed. The text files are saved in XML format. At the stage of analysis such tasks as text normalization, removing stop words, adding metadata and morphological analysis are performed. The morphological analyzer receives an input of a plain text, and at the output gives the text in XML format, which is further convenient to work with as it is easily converted to JSON format. The XML format is defined using XML Schema Definition (XSD). XSD allows to convert data into any other format, which simplifies the data exchange between the systems. For the case of incomplete morphological markup and the presence of homonymy, a special interface to perform manual markup is developed.
CITATION STYLE
Mansurova, M., Madiyeva, G., Aubakirov, S., Yermekov, Z., & Alimzhanov, Y. (2017). Design and Development of Media-Corpus of the Kazakh Language. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10449 LNAI, pp. 509–518). Springer Verlag. https://doi.org/10.1007/978-3-319-67077-5_49
Mendeley helps you to discover research relevant for your work.