EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora

18Citations
Citations of this article
73Readers
Mendeley users who have this article in their library.

Abstract

This paper describes the goals, design and results of a shared task on the automatic linguistic annotation of German language data from genres of computer-mediated communication (CMC), social media interactions and Web corpora. The two subtasks of tokenization and part-of-speech tagging were performed on two data sets: (i) a genuine CMC data set with samples from several CMC genres, and (ii) a Web corpora data set of CC-licensed Web pages which represents the type of data found in large corpora crawled from the Web. The teams participating in the shared task achieved a substantial improvement over current off-the-shelf tools for German. The best tokenizer reached an F1score of 99.57% (vs. 98.95% off-the-shelf baseline), while the best tagger reached an accuracy of 90.44% (vs. 84.86% baseline). The gold standard (more than 20,000 tokens of training and test data) is freely available online together with detailed annotation guidelines.

Cite

CITATION STYLE

APA

Beißwenger, M., Bartsch, S., Evert, S., & Würzner, K. M. (2016). EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 44–56). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w16-2606

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free