The long-standing automobile e-commerce websites in China have accumulated huge amounts of auto reviews, and extracting keyphrases of these reviews can assist researchers and practitioners in obtaining online users’ typical opinions and acquiring their underlying motivations. However, there haven’t existed any relevant text corpora so far. In this paper, the authors propose a semi-unsupervised scheme to construct a comprehensive auto-keyphrases corpus from online collected reviews in Chinese automobile e-commerce websites by Position Rank, which performs very well in keyphrases extraction from texts in the scenario of scarce labeled data. The iterative annotation process consists of three-round labeling and two-round corrections. During the process of the three-round unsupervised labeling, the computing model will extract seven most important words as the keyphrases of the whole paragraph. Between each labeling phase, there are manual check, correction, re-check and arbitration stages, in which the previous labeling errors are corrected and new vocabulary and rules are summarized up to further improve the unsupervised model. For comparison, the paper runs the experiments using another two unsupervised approaches: TF-IDF and Text Rank, the experimental results also show that Position Rank is a more efficient and effective method for keyphrases extraction. By the time this paper was written, the auto-keyphrases corpus had contained 110,023 entries, and there are still much room for improvement in corpus volume and labeling quality.
CITATION STYLE
Li, Y., Qian, C., Che, H., Wang, R., Wang, Z., & Zhang, J. (2019). On the Semi-unsupervised Construction of Auto-keyphrases Corpus from Large-Scale Chinese Automobile E-Commerce Reviews. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11856 LNAI, pp. 452–464). Springer. https://doi.org/10.1007/978-3-030-32381-3_37
Mendeley helps you to discover research relevant for your work.