LinkBERT: Pretraining Language Models with Document Links

Michihiro Yasunaga; Jure Leskovec; Percy Liang

Conference ProceedingsOPEN ACCESS

LinkBERT: Pretraining Language Models with Document Links

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2022) 1 8003-8016

DOI: 10.18653/v1/2022.acl-long.551

141Citations

192Readers

Abstract

Language model (LM) pretraining captures various knowledge from text corpora, helping downstream NLP tasks. However, existing methods such as BERT model a single document, failing to capture document dependencies and knowledge that spans across documents. In this work, we propose LinkBERT, an effective LM pretraining method that incorporates document links, such as hyperlinks. Given a pretraining corpus, we view it as a graph of documents, and create LM inputs by placing linked documents in the same context. We then train the LM with two joint self-supervised tasks: masked language modeling and our newly proposed task, document relation prediction. We study LinkBERT in two domains: general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain (pretrained on PubMed with citation links). LinkBERT outperforms BERT on various downstream tasks in both domains. It is especially effective for multi-hop reasoning and few-shot QA (+5% absolute improvement on HotpotQA and TriviaQA), and our biomedical LinkBERT attains new state-of-the-art on various BioNLP tasks (+7% on BioASQ and USMLE). We release the pretrained models, LinkBERT and BioLinkBERT, as well as code and data.

Cite

CITATION STYLE

APA

Yasunaga, M., Leskovec, J., & Liang, P. (2022). LinkBERT: Pretraining Language Models with Document Links. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 8003–8016). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.acl-long.551

LinkBERT: Pretraining Language Models with Document Links

Abstract

Cite

Register to see more suggestions