Complex Word Identification using character n-grams

6Citations
Citations of this article
76Readers
Mendeley users who have this article in their library.

Abstract

This paper investigates the use of character n-gram frequencies for identifying complex words in English, German and Spanish texts. The approach is based on the assumption that complex words are likely to contain different character sequences than simple words. The multinomial Naive Bayes classifier was used with n-grams of different lengths as features, and the best results were obtained for the combination of 2-grams and 4-grams. This variant was submitted to the Complex Word Identification Shared Task 2018 for all texts and achieved F-scores between 70% and 83%. The system was ranked in the middle range for all English tracks, as third of fourteen submissions for German, and as tenth of seventeen submissions for Spanish. The method is not very convenient for the cross-language task, achieving only 59% on the French text.

Cite

CITATION STYLE

APA

Popović, M. (2018). Complex Word Identification using character n-grams. In Proceedings of the 13th Workshop on Innovative Use of NLP for Building Educational Applications, BEA 2018 at the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HTL 2018 (pp. 341–348). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w18-0541

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free