A new content-free approach to identification of document language: Angle patterns

Tuba Noyan; Fatma Kuncan; Ramazan Tekin; Yilmaz Kaya

Journal ArticleOPEN ACCESS

A new content-free approach to identification of document language: Angle patterns

Journal of the Faculty of Engineering and Architecture of Gazi University (2022) 37(3) 1277-1292

DOI: 10.17341/gazimmfd.844700

6Citations

5Readers

Get full text

Abstract

Language Identification in text mining is the process of detecting the natural language in which a document or part of it is written. Language identification aims to mimic a human's ability to recognize certain languages by computer algorithms. In this study, a new language identification approach using the angle information between the UTF-8 values of the characters in the text is proposed. The proposed angle pattern method is used for feature extraction from texts. Angle patterns method is a statistical approach. In the angle method, there are two distance parameters, R and L, which express which neighborhood to look at from the reference point to the left and right. To test the proposed approach, four datasets, two created by the authors and two publicly available on the Internet, were used. By using the features obtained by the angle pattern method, classification process was carried out with different machine learning methods such as Random Forest, Support Vector Machine, Linear Discriminant Analysis, Naive Bayes and K-nearest neighbor. Language identification performance results determined from four different data sets were observed as 96,81%, 99,39%, 93,31% and 98,60%, respectively. According to the performance results achieved as a result of the study, it has been determined that the proposed angle pattern method provides important distinguishing information in language identification application.

Author supplied keywords

Cite

CITATION STYLE

APA

Noyan, T., Kuncan, F., Tekin, R., & Kaya, Y. (2022). A new content-free approach to identification of document language: Angle patterns. Journal of the Faculty of Engineering and Architecture of Gazi University, 37(3), 1277–1292. https://doi.org/10.17341/gazimmfd.844700

A new content-free approach to identification of document language: Angle patterns

Abstract

Author supplied keywords

Cite

Register to see more suggestions