The growing ecosystem of data sharing in science has put dataset search into the focus. To make data sharing and reuse more feasible, new retrieval tools and services are being developed. Currently, dataset retrieval relies almost exclusively on metadata provided by the publishers. To extend this knowledge source our work studies the task of “dataset review mining” in scientific publications. For the field of Natural Language Processing we collect metadata about datasets from established resources such as the ELRA and LDC catalogs, and then extract review statements about the datasets from ACL Anthology Corpus publications, compiling the Webis-Dataset-Reviews-21 corpus. By analyzing the reviews we identify different categories of what paper authors write about data. To the best of our knowledge, this is the first analysis of this kind in the field of Natural Language Processing, albeit similar analyses have been carried out in the social and medical sciences. Our corpus and the underlying code are shared alongside this paper.
CITATION STYLE
Kolyada, N., Potthast, M., & Stein, B. (2021). Beyond Metadata: What Paper Authors Say About Corpora They Use. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 5085–5090). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-acl.451
Mendeley helps you to discover research relevant for your work.