Domain-matched Pre-training Tasks for Dense Retrieval

20Citations
Citations of this article
77Readers
Mendeley users who have this article in their library.

Abstract

Pre-training on larger datasets with ever increasing model size is now a proven recipe for increased performance across almost all NLP tasks. A notable exception is information retrieval, where additional pre-training has so far failed to produce convincing results. We show that, with the right pre-training setup, this barrier can be overcome. We demonstrate this by pre-training large bi-encoder models on 1) a recently released set of 65 million synthetically generated questions, and 2) 200 million post-comment pairs from a preexisting dataset of Reddit conversations. We evaluate on a set of information retrieval and dialogue retrieval benchmarks, showing substantial improvements over supervised baselines.

Cite

CITATION STYLE

APA

Oguz, B., Lakhotia, K., Gupta, A., Lewis, P., Karpukhin, V., Piktus, A., … Mehdad, Y. (2022). Domain-matched Pre-training Tasks for Dense Retrieval. In Findings of the Association for Computational Linguistics: NAACL 2022 - Findings (pp. 1524–1534). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.findings-naacl.114

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free