Abstract
Language Identification in text mining is the process of detecting the natural language in which a document or part of it is written. Language identification aims to mimic a human's ability to recognize certain languages by computer algorithms. In this study, a new language identification approach using the angle information between the UTF-8 values of the characters in the text is proposed. The proposed angle pattern method is used for feature extraction from texts. Angle patterns method is a statistical approach. In the angle method, there are two distance parameters, R and L, which express which neighborhood to look at from the reference point to the left and right. To test the proposed approach, four datasets, two created by the authors and two publicly available on the Internet, were used. By using the features obtained by the angle pattern method, classification process was carried out with different machine learning methods such as Random Forest, Support Vector Machine, Linear Discriminant Analysis, Naive Bayes and K-nearest neighbor. Language identification performance results determined from four different data sets were observed as 96,81%, 99,39%, 93,31% and 98,60%, respectively. According to the performance results achieved as a result of the study, it has been determined that the proposed angle pattern method provides important distinguishing information in language identification application.
Author supplied keywords
Cite
CITATION STYLE
Noyan, T., Kuncan, F., Tekin, R., & Kaya, Y. (2022). A new content-free approach to identification of document language: Angle patterns. Journal of the Faculty of Engineering and Architecture of Gazi University, 37(3), 1277–1292. https://doi.org/10.17341/gazimmfd.844700
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.