CITEWORTH: Cite-Worthiness Detection for Improved Scientific Document Understanding

Dustin Wright; Isabelle Augenstein

Conference Proceedings

CITEWORTH: Cite-Worthiness Detection for Improved Scientific Document Understanding

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (2021) 1796-1807

DOI: 10.18653/v1/2021.findings-acl.157

21Citations

69Readers

Get full text

Abstract

Scientific document understanding is challenging as the data is highly domain specific and diverse. However, datasets for tasks with scientific text require expensive manual annotation and tend to be small and limited to only one or a few fields. At the same time, scientific documents contain many potential training signals, such as citations, which can be used to build large labelled datasets. Given this, we present an in-depth study of cite-worthiness detection in English, where a sentence is labelled for whether or not it cites an external source. To accomplish this, we introduce CITEWORTH, a large, contextualized, rigorously cleaned labelled dataset for cite-worthiness detection built from a massive corpus of extracted plain-text scientific documents. We show that CITEWORTH is high-quality, challenging, and suitable for studying problems such as domain adaptation. Our best performing cite-worthiness detection model is a paragraph-level contextualized sentence labelling model based on Longformer, exhibiting a 5 F1 point improvement over SciBERT which considers only individual sentences. Finally, we demonstrate that language model fine-tuning with cite-worthiness as a secondary task leads to improved performance on downstream scientific document understanding tasks.

Cite

CITATION STYLE

APA

Wright, D., & Augenstein, I. (2021). CITEWORTH: Cite-Worthiness Detection for Improved Scientific Document Understanding. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 1796–1807). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-acl.157

CITEWORTH: Cite-Worthiness Detection for Improved Scientific Document Understanding

Abstract

Cite

Register to see more suggestions