Neural networks incorporating unlabeled and partially-labeled data for cross-domain Chinese word segmentation

23Citations
Citations of this article
20Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Most existing Chinese word segmentation (CWS) methods are usually supervised. Hence, large-scale annotated domain-specific datasets are needed for training. In this paper, we seek to address the problem of CWS for the resource-poor domains that lack annotated data. A novel neural network model is proposed to incorporate unlabeled and partially-labeled data. To make use of unlabeled data, we combine a bidirectional LSTM segmentation model with two character-level language models using a gate mechanism. These language models can capture co-occurrence information. To make use of partially-labeled data, we modify the original cross entropy loss function of RNN. Experimental results demonstrate that the method performs well on CWS tasks in a series of domains.

Cite

CITATION STYLE

APA

Zhao, L., Zhang, Q., Wang, P., & Liu, X. (2018). Neural networks incorporating unlabeled and partially-labeled data for cross-domain Chinese word segmentation. In IJCAI International Joint Conference on Artificial Intelligence (Vol. 2018-July, pp. 4602–4608). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2018/640

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free