Exploiting unlabeled text with different unsupervised segmentation criteria for chinese word segmentation

  • Zhao H
  • Kit C
N/ACitations
Citations of this article
14Readers
Mendeley users who have this article in their library.

Abstract

This paper presents a novel approach to improve Chinese word segmentation (CWS) that attempts to utilize unlabeled data such as training and test data without annotation for further enhancement of the state-of-the-art performance of supervised learning. The lexical information plays the role of information transformation from unlabeled text to supervised learning model. Four types of unsupervised segmentation criteria are used for word candidate extraction and the corresponding word likelihood computation. The information output by unsupervised segmentation criteria as features therefore is integrated into supervised learning model to strengthen the learning for the matching subsequence. The effectiveness of the proposed method is verified in data sets from the latest international CWS evaluation. Our experimental results show that character-based conditional random fields framework can effectively make use of such information from unlabeled data for performance enhancement on top of the best existing results.

Cite

CITATION STYLE

APA

Zhao, H., & Kit, C. (2008). Exploiting unlabeled text with different unsupervised segmentation criteria for chinese word segmentation. Research in Computing Science, 33, 93–104.

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free