Pre-trained Data Augmentation for Text Classification

12Citations
Citations of this article
26Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Data augmentation is a widely adopted method for improving model performance in image classification tasks. Although it still not as ubiquitous in Natural Language Processing (NLP) community, some methods have already been proposed to increase the amount of training data using simple text transformations or text generation through language models. However, recent text classification tasks need to deal with domains characterized by a small amount of text and informal writing, e.g., Online Social Networks content, reducing the capabilities of current methods. Facing these challenges by taking advantage of the pre-trained language models, low computational resource consumption, and model compression, we proposed the PRE-trained Data AugmenTOR (PREDATOR) method. Our data augmentation method is composed of two modules: the Generator, which synthesizes new samples grounded on a lightweight model, and the Filter, that selects only the high-quality ones. The experiments comparing Bidirectional Encoder Representations from Transformer (BERT), Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM) and Multinomial Naive Bayes (NB) in three datasets exposed the effective improvement of accuracy. It was obtained 28.5% of accuracy improvement with LSTM on the best scenario and an average improvement of 8% across all scenarios. PREDATOR was able to augment real-world social media datasets and other domains, overcoming the recent text augmentation techniques.

Cite

CITATION STYLE

APA

Queiroz Abonizio, H., & Barbon Junior, S. (2020). Pre-trained Data Augmentation for Text Classification. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12319 LNAI, pp. 551–565). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-61377-8_38

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free