Abstract
The paper outlines a supervised approach to language identification in code-switched data, framing this as a sequence labeling task where the label of each token is identified using a classifier based on Conditional Random Fields and trained on a range of different features, extracted both from the training data and by using information from Babelnet and Babelfy. The method was tested on the development dataset provided by organizers of the shared task on language identification in code-switched data, obtaining tweet level monolingual, code-switched and weighted F1-scores of 94%, 85% and 91%, respectively, with a token level accuracy of 95.8%. When evaluated on the unseen test data, the system achieved 90%, 85% and 87.4% monolingual, code-switched and weighted tweet level F1-scores, and a token level accuracy of 95.7%.
Cite
CITATION STYLE
Sikdar, U. K., & Gambäck, B. (2016). Language Identification in Code-Switched Text Using Conditional Random Fields and Babelnet. In EMNLP 2016 - 2nd Workshop on Computational Approaches to Code Switching, CS 2016 - Proceedings of the Workshop (pp. 127–131). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w16-5817
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.