Multi-grained Chinese Word Segmentation with Weakly Labeled Data

Chen Gong; Zhenghua Li; Bowei Zou; Min Zhang

Conference ProceedingsOPEN ACCESS

Multi-grained Chinese Word Segmentation with Weakly Labeled Data

COLING 2020 - 28th International Conference on Computational Linguistics, Proceedings of the Conference (2020) 2026-2036

DOI: 10.18653/v1/2020.coling-main.183

4Citations

56Readers

Abstract

In contrast with the traditional single-grained word segmentation (SWS), where a sentence corresponds to a single word sequence, multi-grained Chinese word segmentation (MWS) aims to segment a sentence into multiple word sequences to preserve all words of different granularities. Due to the lack of manually annotated MWS data, previous work train and tune MWS models only on automatically generated pseudo MWS data. In this work, we further take advantage of the rich word boundary information in existing SWS data and naturally annotated data from dictionary example (DictEx) sentences, to advance the state-of-the-art MWS model based on the idea of weak supervision. Particularly, we propose to accommodate two types of weakly labeled data for MWS, i.e., SWS data and DictEx data by employing a simple yet competitive graph-based parser with local loss. Besides, we manually annotate a high-quality MWS dataset according to our newly compiled annotation guideline, consisting of over 9,000 sentences from two types of texts, i.e., canonical newswire (NEWS) and non-canonical web (BAIKE) data for better evaluation. Detailed evaluation shows that our proposed model with weakly labeled data significantly outperforms the state-of-the-art MWS model by 1.12 and 5.97 on NEWS and BAIKE data in F1.

Cite

CITATION STYLE

APA

Gong, C., Li, Z., Zou, B., & Zhang, M. (2020). Multi-grained Chinese Word Segmentation with Weakly Labeled Data. In COLING 2020 - 28th International Conference on Computational Linguistics, Proceedings of the Conference (pp. 2026–2036). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.coling-main.183

Multi-grained Chinese Word Segmentation with Weakly Labeled Data

Abstract

Cite

Register to see more suggestions