Simple Baseline Machine Learning Text Classifiers for Small Datasets

10Citations
Citations of this article
35Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Text classification is important to better understand online media. A major problem for creating accurate text classifiers using machine learning is small training sets due to the cost of annotating them. On this basis, we investigated how SVM and NBSVM text classifiers should be designed to achieve high accuracy and how the training sets should be sized to efficiently use annotation labor. We used a four-way repeated-measures full-factorial design of 32 design factor combinations. For each design factor combination 22 training set sizes were examined. These training sets were subsets of seven public text datasets. We study the statistical variance of accuracy estimates by randomly drawing new training sets, resulting in accuracy estimates for 98,560 different experimental runs. Our major contribution is a set of empirically evaluated guidelines for creating online media text classifiers using small training sets. We recommend uni- and bi-gram features as text representation, btc term weighting and a linear-kernel NBSVM. Our results suggest that high classification accuracy can be achieved using a manually annotated dataset of only 300 examples.

Cite

CITATION STYLE

APA

Riekert, M., Riekert, M., & Klein, A. (2021). Simple Baseline Machine Learning Text Classifiers for Small Datasets. SN Computer Science, 2(3). https://doi.org/10.1007/s42979-021-00480-4

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free