Towards a Better Understanding of Noise in Natural Language Processing

Khetam Al Sharou; Zhenhao Li; Lucia Specia

Conference ProceedingsOPEN ACCESS

Towards a Better Understanding of Noise in Natural Language Processing

International Conference Recent Advances in Natural Language Processing, RANLP (2021) 53-62

DOI: 10.26615/978-954-452-072-4_007

47Citations

104Readers

Get full text

Abstract

In this paper, we propose a definition and taxonomy of various types of non-standard textual content - generally referred to as “noise” - in Natural Language Processing (NLP). While data pre-processing is undoubtedly important in NLP, especially when dealing with user-generated content, a broader understanding of different sources of noise and how to deal with them is an aspect that has been largely neglected. We provide a comprehensive list of potential sources of noise, categorise and describe them, and show the impact of a subset of standard pre-processing strategies on different tasks. Our main goal is to raise awareness of non-standard content - which should not always be considered as “noise” - and of the need for careful, task-dependent pre-processing. This is an alternative to blanket, all-encompassing solutions generally applied by researchers through “standard” pre-processing pipelines. The intention is for this categorisation to serve as a point of reference to support NLP researchers in devising strategies to clean, normalise or embrace nonstandard content.

Cite

CITATION STYLE

APA

Sharou, K. A., Li, Z., & Specia, L. (2021). Towards a Better Understanding of Noise in Natural Language Processing. In International Conference Recent Advances in Natural Language Processing, RANLP (pp. 53–62). Incoma Ltd. https://doi.org/10.26615/978-954-452-072-4_007

Towards a Better Understanding of Noise in Natural Language Processing

Abstract

Cite

Register to see more suggestions