Text classification model enhanced by unlabeled data for latex formula

Hua Cheng; Renjie Yu; Yixin Tang; Yiquan Fang; Tao Cheng

Journal ArticleOPEN ACCESS

Text classification model enhanced by unlabeled data for latex formula

Applied Sciences (Switzerland) (2021) 11(22)

DOI: 10.3390/app112210536

4Citations

15Readers

Abstract

Generic language models pretrained on large unspecific domains are currently the foundation of NLP. Labeled data are limited in most model training due to the cost of manual annotation, especially in domains including massive Proper Nouns such as mathematics and biology, where it affects the accuracy and robustness of model prediction. However, directly applying a generic language model on a specific domain does not work well. This paper introduces a BERT-based text classification model enhanced by unlabeled data (UL-BERT) in the LaTeX formula domain. A two-stage Pretraining model based on BERT(TP-BERT) is pretrained by unlabeled data in the LaTeX formula domain. A double-prediction pseudo-labeling (DPP) method is introduced to obtain high confidence pseudo-labels for unlabeled data by self-training. Moreover, a multi-rounds teacher– student model training approach is proposed for UL-BERT model training with few labeled data and more unlabeled data with pseudo-labels. Experiments on the classification of the LaTex formula domain show that the classification accuracies have been significantly improved by UL-BERT where the F1 score has been mostly enhanced by 2.76%, and lower resources are needed in model training. It is concluded that our method may be applicable to other specific domains with enormous unlabeled data and limited labelled data.

Author supplied keywords

Cite

CITATION STYLE

APA

Cheng, H., Yu, R., Tang, Y., Fang, Y., & Cheng, T. (2021). Text classification model enhanced by unlabeled data for latex formula. Applied Sciences (Switzerland), 11(22). https://doi.org/10.3390/app112210536

Text classification model enhanced by unlabeled data for latex formula

Abstract

Author supplied keywords

Cite

Register to see more suggestions