Exploiting unlabeled text with different unsupervised segmentation criteria for chinese word segmentation

H. Zhao; C. Kit

Journal Article

Exploiting unlabeled text with different unsupervised segmentation criteria for chinese word segmentation

Zhao H
Kit C

Research in Computing Science (2008) 33 93–104

N/ACitations

14Readers

Abstract

This paper presents a novel approach to improve Chinese word segmentation (CWS) that attempts to utilize unlabeled data such as training and test data without annotation for further enhancement of the state-of-the-art performance of supervised learning. The lexical information plays the role of information transformation from unlabeled text to supervised learning model. Four types of unsupervised segmentation criteria are used for word candidate extraction and the corresponding word likelihood computation. The information output by unsupervised segmentation criteria as features therefore is integrated into supervised learning model to strengthen the learning for the matching subsequence. The effectiveness of the proposed method is verified in data sets from the latest international CWS evaluation. Our experimental results show that character-based conditional random fields framework can effectively make use of such information from unlabeled data for performance enhancement on top of the best existing results.

Cite

CITATION STYLE

APA

Zhao, H., & Kit, C. (2008). Exploiting unlabeled text with different unsupervised segmentation criteria for chinese word segmentation. Research in Computing Science, 33, 93–104.

Exploiting unlabeled text with different unsupervised segmentation criteria for chinese word segmentation

Abstract

Cite

Register to see more suggestions