Experiments in Sentence Language Identification with Groups of Similar Languages

Ben King; Dragomir Radev; Steven Abney

Conference Proceedings

Experiments in Sentence Language Identification with Groups of Similar Languages

1st Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, VarDial 2014 at the 25th International Conference on Computational Linguistics: System Demonstrations, COLING 2014 - Proceedings (2014) 146-154

DOI: 10.3115/v1/w14-5317

12Citations

76Readers

Get full text

Abstract

Language identification is a simple problem that becomes much more difficult when its usual assumptions are broken. In this paper we consider the task of classifying short segments of text in closely-related languages for the Discriminating Similar Languages shared task, which is broken into six subtasks, (A) Bosnian, Croatian, and Serbian, (B) Indonesian and Malay, (C) Czech and Slovak, (D) Brazilian and European Portuguese, (E) Argentinian and Peninsular Spanish, and (F) American and British English. We consider a number of different methods to boost classification performance, such as feature selection and data filtering, but we ultimately find that a simple näive Bayes classifier using character and word n-gram features is a strong baseline that is difficult to improve on, achieving an average accuracy of 0.8746 across the six tasks.

Cite

CITATION STYLE

APA

King, B., Radev, D., & Abney, S. (2014). Experiments in Sentence Language Identification with Groups of Similar Languages. In 1st Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, VarDial 2014 at the 25th International Conference on Computational Linguistics: System Demonstrations, COLING 2014 - Proceedings (pp. 146–154). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/w14-5317

Experiments in Sentence Language Identification with Groups of Similar Languages

Abstract

Cite

Register to see more suggestions