A Test Collection of Synthetic Documents for Training Rankers: ChatGPT vs. Human Experts

15Citations
Citations of this article
19Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We investigate the usefulness of generative large language models (LLMs) in generating training data for cross-encoder re-rankers in a novel direction: generating synthetic documents instead of synthetic queries. We introduce a new dataset, ChatGPT-RetrievalQA, and compare the effectiveness of strong models fine-tuned on both LLM-generated and human-generated data. We build ChatGPT-RetrievalQA based on an existing dataset, the human ChatGPT comparison corpus (HC3), consisting of multiple public question collections featuring both human- and ChatGPT-generated responses. We fine-tune a range of cross-encoder re-rankers on either human-generated or ChatGPT-generated data. Our evaluation on MS MARCO DEV, TREC DL'19, and TREC DL'20 demonstrates that cross-encoder re-ranking models trained on LLM-generated responses are significantly more effective for out-of-domain re-ranking than those trained on human responses. For in-domain re-ranking, however, the human-trained re-rankers outperform the LLM-trained re-rankers. Our novel findings suggest that generative LLMs have high potential in generating training data for neural retrieval models and can be used to augment training data, especially in domains with less labeled data. ChatGPT-RetrievalQA presents various opportunities for analyzing and improving rankers with both human- and LLM-generated data. Our data, code, and model checkpoints are publicly available.

Cite

CITATION STYLE

APA

Askari, A., Aliannejadi, M., Kanoulas, E., & Verberne, S. (2023). A Test Collection of Synthetic Documents for Training Rankers: ChatGPT vs. Human Experts. In International Conference on Information and Knowledge Management, Proceedings (pp. 5311–5315). Association for Computing Machinery. https://doi.org/10.1145/3583780.3615111

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free