EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora

Michael Beißwenger; Sabine Bartsch; Stefan Evert; Kay Michael Würzner

Conference ProceedingsOPEN ACCESS

EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2016) 44-56

DOI: 10.18653/v1/w16-2606

18Citations

73Readers

Abstract

This paper describes the goals, design and results of a shared task on the automatic linguistic annotation of German language data from genres of computer-mediated communication (CMC), social media interactions and Web corpora. The two subtasks of tokenization and part-of-speech tagging were performed on two data sets: (i) a genuine CMC data set with samples from several CMC genres, and (ii) a Web corpora data set of CC-licensed Web pages which represents the type of data found in large corpora crawled from the Web. The teams participating in the shared task achieved a substantial improvement over current off-the-shelf tools for German. The best tokenizer reached an F1score of 99.57% (vs. 98.95% off-the-shelf baseline), while the best tagger reached an accuracy of 90.44% (vs. 84.86% baseline). The gold standard (more than 20,000 tokens of training and test data) is freely available online together with detailed annotation guidelines.

Cite

CITATION STYLE

APA

Beißwenger, M., Bartsch, S., Evert, S., & Würzner, K. M. (2016). EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 44–56). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w16-2606

EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora

Abstract

Cite

Register to see more suggestions