Exploiting heterogeneous annotations for weibo word segmentation and POS tagging

4Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This paper describes our system designed for the NLPCC 2015 shared task on Chinese word segmentation (WS) and POS tagging forWeibo Text.We treat WS and POS tagging as two separate tasks and use a cascaded approach. Our major focus is how to effectively exploit multiple heterogeneous data to boost performance of statistical models. This work considers three sets of heterogeneous data, i.e., Weibo (WB, 10K sentences), Penn Chinese Treebank 7.0 (CTB7, 50K), and People’s Daily (PD, 280K). For WS, we adopt the recently proposed coupled sequence labeling to combine WB, CTB7, and PD, boosting F1 score from 93.76% (baseline model trained on only WB) to 95.58% (+1.82%). For POS tagging, we adopt an ensemble approach combining coupled sequence labeling and the guide-feature based method, since the three datasets have three different annotation standards. First, we convert PD into the annotation style of CTB7 based on coupled sequence labeling, denoted by PDCTB. Then, we merge CTB7 and PDCTB to train a POS tagger, denoted by TagCTB7+PDCTB, which is further used to produce guide features on WB. Finally, the tagging F1 score is improved from 87.93% to 88.99% (+1.06%).

Cite

CITATION STYLE

APA

Chao, J., Li, Z., Chen, W., & Zhang, M. (2015). Exploiting heterogeneous annotations for weibo word segmentation and POS tagging. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9362, pp. 495–506). Springer Verlag. https://doi.org/10.1007/978-3-319-25207-0_46

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free