LinkBERT: Pretraining Language Models with Document Links

141Citations
Citations of this article
192Readers
Mendeley users who have this article in their library.

Abstract

Language model (LM) pretraining captures various knowledge from text corpora, helping downstream NLP tasks. However, existing methods such as BERT model a single document, failing to capture document dependencies and knowledge that spans across documents. In this work, we propose LinkBERT, an effective LM pretraining method that incorporates document links, such as hyperlinks. Given a pretraining corpus, we view it as a graph of documents, and create LM inputs by placing linked documents in the same context. We then train the LM with two joint self-supervised tasks: masked language modeling and our newly proposed task, document relation prediction. We study LinkBERT in two domains: general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain (pretrained on PubMed with citation links). LinkBERT outperforms BERT on various downstream tasks in both domains. It is especially effective for multi-hop reasoning and few-shot QA (+5% absolute improvement on HotpotQA and TriviaQA), and our biomedical LinkBERT attains new state-of-the-art on various BioNLP tasks (+7% on BioASQ and USMLE). We release the pretrained models, LinkBERT and BioLinkBERT, as well as code and data.

Cite

CITATION STYLE

APA

Yasunaga, M., Leskovec, J., & Liang, P. (2022). LinkBERT: Pretraining Language Models with Document Links. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 8003–8016). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.acl-long.551

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free