Automatic identification of spurious instances (those with potentially wrong labels in datasets) can improve the quality of existing language resources, especially when annotations are obtained through crowdsourcing or automatically generated based on coded rankings. In this paper, we present an effective approach inspired by queueing theory and psychology of learning to automatically identify spurious instances in datasets. Our approach discriminates instances based on their "difficulty to learn," determined by a downstream learner. Our method can be applied to any dataset assuming the existence of a neural network model for the target task of the dataset. Our best approach outperforms competing state-of-The-Art baselines and has a MAP of 0.85 and 0.22 in identifying spurious instances in synthetic and carefullycrowdsourced real-world datasets respectively.
CITATION STYLE
Amiri, H., Miller, T. A., & Savova, G. (2018). Spotting spurious data with neural networks. In NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference (Vol. 1, pp. 2006–2016). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/n18-1182
Mendeley helps you to discover research relevant for your work.