Identifying Incorrect Labels in the CoNLL-2003 Corpus

Frederick Reiss; Hong Xu; Bryan Cutler; Karthik Muthuraman; Zachary Eichenberger

Conference ProceedingsOPEN ACCESS

Identifying Incorrect Labels in the CoNLL-2003 Corpus

CoNLL 2020 - 24th Conference on Computational Natural Language Learning, Proceedings of the Conference (2020) 215-226

DOI: 10.18653/v1/2020.conll-1.16

27Citations

64Readers

Abstract

The CoNLL-2003 corpus for English-language named entity recognition (NER) is one of the most influential corpora for NER model research. A large number of publications, including many landmark works, have used this corpus as a source of ground truth for NER tasks. In this paper, we examine this corpus and identify over 1300 incorrect labels (out of 35089 in the corpus). In particular, the number of incorrect labels in the test fold is comparable to the number of errors that state-of-the-art models make when running inference over this corpus. We describe the process by which we identified these incorrect labels, using novel variants of techniques from semi-supervised learning. We also summarize the types of errors that we found, and we revisit several recent results in NER in light of the corrected data. Finally, we show experimentally that our corrections to the corpus have a positive impact on three state-ofthe-art models.

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Reiss, F., Xu, H., Cutler, B., Muthuraman, K., & Eichenberger, Z. (2020). Identifying Incorrect Labels in the CoNLL-2003 Corpus. In CoNLL 2020 - 24th Conference on Computational Natural Language Learning, Proceedings of the Conference (pp. 215–226). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.conll-1.16

Readers' Seniority

PhD / Post grad / Masters / Doc 18

75%

Researcher 4

17%

Lecturer / Post doc 2

Readers' Discipline

Computer Science 21

72%

Linguistics 5

17%

Social Sciences 2

Philosophy 1

Identifying Incorrect Labels in the CoNLL-2003 Corpus

Abstract

References Powered by Scopus

GloVe: Global vectors for word representation

Neural architectures for named entity recognition

An introduction to conditional random fields

Cited by Powered by Scopus

Learning from Noisy Labels for Entity-Centric Information Extraction

Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future

Detecting Label Errors by using Pre-Trained Language Models

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline