Perspectives on Large Language Models for Relevance Judgment

59Citations
Citations of this article
41Readers
Mendeley users who have this article in their library.

Abstract

When asked, large language models∼(LLMs) like ChatGPT claim that they can assist with relevance judgments but it is not clear whether automated judgments can reliably be used in evaluations of retrieval systems. In this perspectives paper, we discuss possible ways for∼LLMs to support relevance judgments along with concerns and issues that arise. We devise a human - machine collaboration spectrum that allows to categorize different relevance judgment strategies, based on how much humans rely on machines. For the extreme point of 'fully automated judgments', we further include a pilot experiment on whether LLM-based relevance judgments correlate with judgments from trained human assessors. We conclude the paper by providing opposing perspectives for and against the use of∼LLMs for automatic relevance judgments, and a compromise perspective, informed by our analyses of the literature, our preliminary experimental evidence, and our experience as IR∼researchers.

Cite

CITATION STYLE

APA

Faggioli, G., Dietz, L., Clarke, C. L. A., Demartini, G., Hagen, M., Hauff, C., … Wachsmuth, H. (2023). Perspectives on Large Language Models for Relevance Judgment. In ICTIR 2023 - Proceedings of the 2023 ACM SIGIR International Conference on the Theory of Information Retrieval (pp. 39–50). Association for Computing Machinery, Inc. https://doi.org/10.1145/3578337.3605136

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free