Towards a Better Understanding of Noise in Natural Language Processing

47Citations
Citations of this article
104Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper, we propose a definition and taxonomy of various types of non-standard textual content - generally referred to as “noise” - in Natural Language Processing (NLP). While data pre-processing is undoubtedly important in NLP, especially when dealing with user-generated content, a broader understanding of different sources of noise and how to deal with them is an aspect that has been largely neglected. We provide a comprehensive list of potential sources of noise, categorise and describe them, and show the impact of a subset of standard pre-processing strategies on different tasks. Our main goal is to raise awareness of non-standard content - which should not always be considered as “noise” - and of the need for careful, task-dependent pre-processing. This is an alternative to blanket, all-encompassing solutions generally applied by researchers through “standard” pre-processing pipelines. The intention is for this categorisation to serve as a point of reference to support NLP researchers in devising strategies to clean, normalise or embrace nonstandard content.

Cite

CITATION STYLE

APA

Sharou, K. A., Li, Z., & Specia, L. (2021). Towards a Better Understanding of Noise in Natural Language Processing. In International Conference Recent Advances in Natural Language Processing, RANLP (pp. 53–62). Incoma Ltd. https://doi.org/10.26615/978-954-452-072-4_007

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free