Abstract
This paper describes the goals, design and results of a shared task on the automatic linguistic annotation of German language data from genres of computer-mediated communication (CMC), social media interactions and Web corpora. The two subtasks of tokenization and part-of-speech tagging were performed on two data sets: (i) a genuine CMC data set with samples from several CMC genres, and (ii) a Web corpora data set of CC-licensed Web pages which represents the type of data found in large corpora crawled from the Web. The teams participating in the shared task achieved a substantial improvement over current off-the-shelf tools for German. The best tokenizer reached an F1score of 99.57% (vs. 98.95% off-the-shelf baseline), while the best tagger reached an accuracy of 90.44% (vs. 84.86% baseline). The gold standard (more than 20,000 tokens of training and test data) is freely available online together with detailed annotation guidelines.
Cite
CITATION STYLE
Beißwenger, M., Bartsch, S., Evert, S., & Würzner, K. M. (2016). EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 44–56). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w16-2606
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.