Enhancing Chinese Word Segmentation via Pseudo Labels for Practicability

6Citations
Citations of this article
54Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Pre-trained language models (e.g., BERT) significantly alleviate two traditional challenging problems for Chinese word segmentation (CWS): segmentation ambiguity and out-of-vocabulary (OOV) words. However, such improvements are usually achieved on traditional benchmark datasets and not close to an important goal of CWS: practicability (i.e., low complexity as a standalone task and high beneficiality to downstream tasks). To make a trade-off between traditional evaluation and practicability for CWS, we propose a semi-supervised neural method via pseudo labels. The neural method consists of a teacher model and a student model, which distills knowledge from unlabeled data to the student model so as to improve both in-domain and out-of-domain CWS. Experiments show that our proposed method can not only keep the practicability of the lightweight student model but also improve the performance of segmentation effectively. We also evaluate a range of heterogeneous neural architectures of CWS on downstream Chinese NLP tasks. Results of further experiments demonstrate that our proposed segmenter is reliable and practical as a pre-processing step of the downstream NLP tasks at the minimum cost.

Cite

CITATION STYLE

APA

Huang, K., Liu, J., Huang, D., Xiong, D., Liu, Z., & Su, J. (2021). Enhancing Chinese Word Segmentation via Pseudo Labels for Practicability. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 4369–4381). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-acl.383

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free