Beyond Metadata: What Paper Authors Say About Corpora They Use

0Citations
Citations of this article
45Readers
Mendeley users who have this article in their library.

Abstract

The growing ecosystem of data sharing in science has put dataset search into the focus. To make data sharing and reuse more feasible, new retrieval tools and services are being developed. Currently, dataset retrieval relies almost exclusively on metadata provided by the publishers. To extend this knowledge source our work studies the task of “dataset review mining” in scientific publications. For the field of Natural Language Processing we collect metadata about datasets from established resources such as the ELRA and LDC catalogs, and then extract review statements about the datasets from ACL Anthology Corpus publications, compiling the Webis-Dataset-Reviews-21 corpus. By analyzing the reviews we identify different categories of what paper authors write about data. To the best of our knowledge, this is the first analysis of this kind in the field of Natural Language Processing, albeit similar analyses have been carried out in the social and medical sciences. Our corpus and the underlying code are shared alongside this paper.

Cite

CITATION STYLE

APA

Kolyada, N., Potthast, M., & Stein, B. (2021). Beyond Metadata: What Paper Authors Say About Corpora They Use. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 5085–5090). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-acl.451

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free