Expand, Highlight, Generate: RL-driven Document Generation for Passage Reranking

3Citations
Citations of this article
13Readers
Mendeley users who have this article in their library.

Abstract

Generating synthetic training data based on large language models (LLMs) for ranking models has gained attention recently. Prior studies use LLMs to build pseudo query-document pairs by generating synthetic queries from documents in a corpus. In this paper, we propose a new perspective of data augmentation: generating synthetic documents from queries. To achieve this, we propose DocGen, that consists of a three-step pipeline that utilizes the few-shot capabilities of LLMs. DocGen pipeline performs synthetic document generation by (i) expanding, (ii) highlighting the original query, and then (iii) generating a synthetic document that is likely to be relevant to the query. To further improve the relevance between generated synthetic documents and their corresponding queries, we propose DocGen-RL, which regards the estimated relevance of the document as a reward and leverages reinforcement learning (RL) to optimize DocGen pipeline. Extensive experiments demonstrate that DocGen and DocGen-RL significantly outperform existing state-of-the-art data augmentation methods, such as InPars, indicating that our new perspective of generating documents leverages the capacity of LLMs in generating synthetic data more effectively. We release the code, generated data, and model checkpoints to foster research in this area.

References Powered by Scopus

Natural Questions: A Benchmark for Question Answering Research

1770Citations
N/AReaders
Get full text

Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval

1092Citations
N/AReaders
Get full text

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

855Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge

5Citations
N/AReaders
Get full text

Synthetic Test Collections for Retrieval Evaluation

0Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Askari, A., Aliannejadi, M., Meng, C., Kanoulas, E., & Verberne, S. (2023). Expand, Highlight, Generate: RL-driven Document Generation for Passage Reranking. In EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 10087–10099). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.emnlp-main.623

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 4

80%

Lecturer / Post doc 1

20%

Readers' Discipline

Tooltip

Computer Science 5

71%

Medicine and Dentistry 2

29%

Save time finding and organizing research with Mendeley

Sign up for free