Nowadays, despite the huge amount of digitized information, the biggest drawback to use machine learning in text mining is the lack of availability of a set of tagged data due to mainly, that it requires a great user effort that it is not always viable. In this paper, with the aim of reducing the great workload required to manually processing the contents of large volumes of documents, we present a methodology based on probabilistic inference and active learning to label documents in Spanish using a semi-supervised approach. First, a vector representation of the documents is generated, and then an interactive learning process to apply both, automatic and manual labeling is proposed. To evaluate the accuracy of the predictions and the efficiency of the methodology, different configurations regarding the automatic and manual labeling processes have been studied. The proposed methodology reduces the need for a large corpus of manually labeled texts by introducing a self-labeling process during training. We have shown that both tagging approaches can be combined maintaining accuracy and reducing user intervention.
CITATION STYLE
Nimo-Járquez, D., Narvaez-Rios, M., Rivas, M., Yáñez, A., Bárcena-González, G., Guerrero-Lebrero, M. P., … Galindo, P. L. (2019). AL4LA: Active Learning for Text Labeling Based on Paragraph Vectors. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11506 LNCS, pp. 679–687). Springer Verlag. https://doi.org/10.1007/978-3-030-20521-8_56
Mendeley helps you to discover research relevant for your work.