Hyperlink-induced Pre-training for Passage Retrieval in Open-domain Question Answering

Jiawei Zhou; Xiaoguang Li; Lifeng Shang; Lan Luo; Ke Zhan; Enrui Hu; Xinyu Zhang; Hao Jiang; Zhao Cao; Fan Yu; Xin Jiang; Qun Liu; Lei Chen

Conference ProceedingsOPEN ACCESS

Hyperlink-induced Pre-training for Passage Retrieval in Open-domain Question Answering

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2022) 1 7135-7146

DOI: 10.18653/v1/2022.acl-long.493

17Citations

48Readers

Abstract

To alleviate the data scarcity problem in training question answering systems, recent works propose additional intermediate pre-training for dense passage retrieval (DPR). However, there still remains a large discrepancy between the provided upstream signals and the downstream question-passage relevance, which leads to less improvement. To bridge this gap, we propose the HyperLink-induced Pre-training (HLP), a method to pre-train the dense retriever with the text relevance induced by hyperlink-based topology within Web documents. We demonstrate that the hyperlink-based structures of dual-link and co-mention can provide effective relevance signals for large-scale pre-training that better facilitate downstream passage retrieval. We investigate the effectiveness of our approach across a wide range of open-domain QA datasets under zero-shot, few-shot, multi-hop, and out-of-domain scenarios. The experiments show our HLP outperforms the BM25 by up to 7 points as well as other pre-training methods by more than 10 points in terms of top-20 retrieval accuracy under the zero-shot scenario. Furthermore, HLP significantly outperforms other pre-training methods under the other scenarios.

Cite

CITATION STYLE

APA

Zhou, J., Li, X., Shang, L., Luo, L., Zhan, K., Hu, E., … Chen, L. (2022). Hyperlink-induced Pre-training for Passage Retrieval in Open-domain Question Answering. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 7135–7146). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.acl-long.493

Hyperlink-induced Pre-training for Passage Retrieval in Open-domain Question Answering

Abstract

Cite

Register to see more suggestions