Quotebank: A Corpus of Quotations from a Decade of News

Timoté Vaucher; Andreas Spitz; Michele Catasta; Robert West

Conference ProceedingsOPEN ACCESS

Quotebank: A Corpus of Quotations from a Decade of News

WSDM 2021 - Proceedings of the 14th ACM International Conference on Web Search and Data Mining (2021) 328-336

DOI: 10.1145/3437963.3441760

13Citations

16Readers

Get full text

Abstract

We present Quotebank, an open corpus of 178 million quotations attributed to the speakers who uttered them, extracted from 162 million English news articles published between 2008 and 2020. In order to produce this Web-scale corpus, while at the same time benefiting from the performance of modern neural models, we introduce Quobert, a minimally supervised framework for extracting and attributing quotations from massive corpora. Quobert avoids the necessity of manually labeled input and instead exploits the redundancy of the corpus by bootstrapping from a single seed pattern to extract training data for fine-tuning a BERT-based model. Quobert is language- and corpus agnostic and correctly attributes 86.9% of quotations in our experiments. Quotebank and Quobert are publicly available at https://doi.org/10.5281/zenodo.4277311.

Author supplied keywords

Cite

CITATION STYLE

APA

Vaucher, T., Spitz, A., Catasta, M., & West, R. (2021). Quotebank: A Corpus of Quotations from a Decade of News. In WSDM 2021 - Proceedings of the 14th ACM International Conference on Web Search and Data Mining (pp. 328–336). Association for Computing Machinery, Inc. https://doi.org/10.1145/3437963.3441760

Quotebank: A Corpus of Quotations from a Decade of News

Abstract

Author supplied keywords

Cite

Register to see more suggestions