Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus

Miguel A. Sanchez-Perez; Ilia Markov; Helena Gómez-Adorno; Grigori Sidorov

Conference Proceedings

Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2017) 10456 LNCS 145-151

DOI: 10.1007/978-3-319-65813-1_15

17Citations

19Readers

Get full text

Abstract

We compare the performance of character n-gram features (n= 3 - 8) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We used the same machine-learning algorithm, Liblinear SVM, in order to find out which features are more predictive and for which task. Our experiments show that higher-order character n-grams (n= 5 - 8) outperform lower-order character n-grams, and the combination of all word and character n-grams of different orders (n= 1 - 2 for words and n= 3 - 8 for characters) usually outperforms smaller subsets of such features. We also evaluate the performance of character n-grams, lexical features, and their combinations when reducing all named entities to a single symbol “NE” to avoid topic-dependent features.

Author supplied keywords

Cite

CITATION STYLE

APA

Sanchez-Perez, M. A., Markov, I., Gómez-Adorno, H., & Sidorov, G. (2017). Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10456 LNCS, pp. 145–151). Springer Verlag. https://doi.org/10.1007/978-3-319-65813-1_15

Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus

Abstract

Author supplied keywords

Cite

Register to see more suggestions