Better than their reputation? On the reliability of relevance assessments with students

Philipp Schaer

Conference Proceedings

Better than their reputation? On the reliability of relevance assessments with students

Schaer P

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2012) 7488 LNCS 124-135

DOI: 10.1007/978-3-642-33247-0_14

11Citations

27Readers

Get full text

Abstract

During the last three years we conducted several information retrieval evaluation series with more than 180 LIS students who made relevance assessments on the outcomes of three specific retrieval services. In this study we do not focus on the retrieval performance of our system but on the relevance assessments and the inter-assessor reliability. To quantify the agreement we apply Fleiss' Kappa and Krippendorff's Alpha. When we compare these two statistical measures on average Kappa values were 0.37 and Alpha values 0.15. We use the two agreement measures to drop too unreliable assessments from our data set. When computing the differences between the unfiltered and the filtered data set we see a root mean square error between 0.02 and 0.12. We see this as a clear indicator that disagreement affects the reliability of retrieval evaluations. We suggest not to work with unfiltered results or to clearly document the disagreement rates. © 2012 Springer-Verlag.

Author supplied keywords

Cite

CITATION STYLE

APA

Schaer, P. (2012). Better than their reputation? On the reliability of relevance assessments with students. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7488 LNCS, pp. 124–135). https://doi.org/10.1007/978-3-642-33247-0_14

Better than their reputation? On the reliability of relevance assessments with students

Abstract

Author supplied keywords

Cite

Register to see more suggestions