Language model (LM) pretraining captures various knowledge from text corpora, helping downstream NLP tasks. However, existing methods such as BERT model a single document, failing to capture document dependencies and knowledge that spans across documents. In this work, we propose LinkBERT, an effective LM pretraining method that incorporates document links, such as hyperlinks. Given a pretraining corpus, we view it as a graph of documents, and create LM inputs by placing linked documents in the same context. We then train the LM with two joint self-supervised tasks: masked language modeling and our newly proposed task, document relation prediction. We study LinkBERT in two domains: general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain (pretrained on PubMed with citation links). LinkBERT outperforms BERT on various downstream tasks in both domains. It is especially effective for multi-hop reasoning and few-shot QA (+5% absolute improvement on HotpotQA and TriviaQA), and our biomedical LinkBERT attains new state-of-the-art on various BioNLP tasks (+7% on BioASQ and USMLE). We release the pretrained models, LinkBERT and BioLinkBERT, as well as code and data.
CITATION STYLE
Yasunaga, M., Leskovec, J., & Liang, P. (2022). LinkBERT: Pretraining Language Models with Document Links. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 8003–8016). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.acl-long.551
Mendeley helps you to discover research relevant for your work.