Beyond Metadata: What Paper Authors Say About Corpora They Use

Nikolay Kolyada; Martin Potthast; Benno Stein

Conference ProceedingsOPEN ACCESS

Beyond Metadata: What Paper Authors Say About Corpora They Use

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (2021) 5085-5090

DOI: 10.18653/v1/2021.findings-acl.451

0Citations

45Readers

Abstract

The growing ecosystem of data sharing in science has put dataset search into the focus. To make data sharing and reuse more feasible, new retrieval tools and services are being developed. Currently, dataset retrieval relies almost exclusively on metadata provided by the publishers. To extend this knowledge source our work studies the task of “dataset review mining” in scientific publications. For the field of Natural Language Processing we collect metadata about datasets from established resources such as the ELRA and LDC catalogs, and then extract review statements about the datasets from ACL Anthology Corpus publications, compiling the Webis-Dataset-Reviews-21 corpus. By analyzing the reviews we identify different categories of what paper authors write about data. To the best of our knowledge, this is the first analysis of this kind in the field of Natural Language Processing, albeit similar analyses have been carried out in the social and medical sciences. Our corpus and the underlying code are shared alongside this paper.

Cite

CITATION STYLE

APA

Kolyada, N., Potthast, M., & Stein, B. (2021). Beyond Metadata: What Paper Authors Say About Corpora They Use. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 5085–5090). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-acl.451

Beyond Metadata: What Paper Authors Say About Corpora They Use

Abstract

Cite

Register to see more suggestions