Recently, some statistical topic modeling approaches have been widely applied in the field of supervised document classification. However, there are few researches on these approaches under label noise, which widely exists in real-world applications. For example, many large-scale datasets are collected from websites or annotated by varying quality human-workers, and then have a few mislabeled items. In this paper, we propose two robust topic models for document classification problems: Smoothed Labeled LDA (SL-LDA) and Adaptive Labeled LDA (AL-LDA). SL-LDA is an extension of Labeled LDA (L-LDA), which is a classical supervised topic model. The proposed model overcomes the shortcoming of L-LDA, i.e., overfitting on noisy labels, through Dirichlet smoothing. AL-LDA is an iterative optimization framework based on SL-LDA. At each iterative procedure, we update the Dirichlet prior, which incorporates the observed labels, by a concise algorithm based on maximizingentropy and minimizingcross-entropy principles. This method avoids identifying the noisy label, which is a common difficulty existing in label noise cleaning algorithms. Quantitative experimental results on noisycompletelyatrandom (NCAR) and MultipleNoisySources (MNS) settings demonstrate our models have outstanding performance under noisy labels. Specially, the proposed AL-LDA has significant advantages relative to the state-of-the-art topic modeling approaches under massive label noise.
CITATION STYLE
Wang, W., Guo, B., Shen, Y., Yang, H., Chen, Y., & Suo, X. (2021). Robust supervised topic models under label noise. Machine Learning, 110(5), 907–931. https://doi.org/10.1007/s10994-021-05967-y
Mendeley helps you to discover research relevant for your work.