Following the works of Carletta (1996) and Artstein and Poesio (2008), there is an increasing consensus within the field that in order to properly gauge the reliability of an annotation effort, chance-corrected measures of inter-annotator agreement should be used. With this in mind, it is striking that virtually all evaluations of syntactic annotation efforts use uncorrected parser evaluation metrics such as bracket F1 (for phrase structure) and accuracy scores (for dependencies). In this work we present a chance-corrected metric based on Krippendorff's , adapted to the structure of syntactic annotations and applicable both to phrase structure and dependency annotation without any modifications. To evaluate our metric we first present a number of synthetic experiments to better control the sources of noise and gauge the metric's responses, before finally contrasting the behaviour of our chance-corrected metric with that of uncorrected parser evaluation metrics on real corpora. © 2014 Association for Computational Linguistics.
Mendeley helps you to discover research relevant for your work.
CITATION STYLE
Skjærholt, A. (2014). A chance-corrected measure of inter-annotator agreement for syntax. In 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference (Vol. 1, pp. 934–944). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/p14-1088