Simple Baseline Machine Learning Text Classifiers for Small Datasets

Martin Riekert; Matthias Riekert; Achim Klein

Journal ArticleOPEN ACCESS

Simple Baseline Machine Learning Text Classifiers for Small Datasets

SN Computer Science (2021) 2(3)

DOI: 10.1007/s42979-021-00480-4

10Citations

35Readers

Abstract

Text classification is important to better understand online media. A major problem for creating accurate text classifiers using machine learning is small training sets due to the cost of annotating them. On this basis, we investigated how SVM and NBSVM text classifiers should be designed to achieve high accuracy and how the training sets should be sized to efficiently use annotation labor. We used a four-way repeated-measures full-factorial design of 32 design factor combinations. For each design factor combination 22 training set sizes were examined. These training sets were subsets of seven public text datasets. We study the statistical variance of accuracy estimates by randomly drawing new training sets, resulting in accuracy estimates for 98,560 different experimental runs. Our major contribution is a set of empirically evaluated guidelines for creating online media text classifiers using small training sets. We recommend uni- and bi-gram features as text representation, btc term weighting and a linear-kernel NBSVM. Our results suggest that high classification accuracy can be achieved using a manually annotated dataset of only 300 examples.

Author supplied keywords

Cite

CITATION STYLE

APA

Riekert, M., Riekert, M., & Klein, A. (2021). Simple Baseline Machine Learning Text Classifiers for Small Datasets. SN Computer Science, 2(3). https://doi.org/10.1007/s42979-021-00480-4

Simple Baseline Machine Learning Text Classifiers for Small Datasets

Abstract

Author supplied keywords

Cite

Register to see more suggestions