Do dependency parsing metrics correlate with human judgments?

Barbara Plank; Héctor Martínez Alonso; Željko Agić; Danijela Merkler; Anders Søgaard

Conference ProceedingsOPEN ACCESS

Do dependency parsing metrics correlate with human judgments?

CoNLL 2015 - 19th Conference on Computational Natural Language Learning, Proceedings (2015) 315-320

DOI: 10.18653/v1/k15-1033

15Citations

74Readers

Abstract

Using automatic measures such as labeled and unlabeled attachment scores is common practice in dependency parser evaluation. In this paper, we examine whether these measures correlate with human judgments of overall parse quality. We ask linguists with experience in dependency annotation to judge system outputs. We measure the correlation between their judgments and a range of parse evaluation metrics across five languages. The human-metric correlation is lower for dependency parsing than for other NLP tasks. Also, inter-annotator agreement is sometimes higher than the agreement between judgments and metrics, indicating that the standard metrics fail to capture certain aspects of parse quality, such as the relevance of root attachment or the relative importance of the different parts of speech.

Cite

CITATION STYLE

APA

Plank, B., Alonso, H. M., Agić, Ž., Merkler, D., & Søgaard, A. (2015). Do dependency parsing metrics correlate with human judgments? In CoNLL 2015 - 19th Conference on Computational Natural Language Learning, Proceedings (pp. 315–320). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/k15-1033

Do dependency parsing metrics correlate with human judgments?

Abstract

Cite

Register to see more suggestions