Exploiting heterogeneous annotations for weibo word segmentation and POS tagging

Jiayuan Chao; Zhenghua Li; Wenliang Chen; Min Zhang

Conference Proceedings

Exploiting heterogeneous annotations for weibo word segmentation and POS tagging

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2015) 9362 495-506

DOI: 10.1007/978-3-319-25207-0_46

4Citations

5Readers

Get full text

Abstract

This paper describes our system designed for the NLPCC 2015 shared task on Chinese word segmentation (WS) and POS tagging forWeibo Text.We treat WS and POS tagging as two separate tasks and use a cascaded approach. Our major focus is how to effectively exploit multiple heterogeneous data to boost performance of statistical models. This work considers three sets of heterogeneous data, i.e., Weibo (WB, 10K sentences), Penn Chinese Treebank 7.0 (CTB7, 50K), and People’s Daily (PD, 280K). For WS, we adopt the recently proposed coupled sequence labeling to combine WB, CTB7, and PD, boosting F1 score from 93.76% (baseline model trained on only WB) to 95.58% (+1.82%). For POS tagging, we adopt an ensemble approach combining coupled sequence labeling and the guide-feature based method, since the three datasets have three different annotation standards. First, we convert PD into the annotation style of CTB7 based on coupled sequence labeling, denoted by PDCTB. Then, we merge CTB7 and PDCTB to train a POS tagger, denoted by TagCTB7+PDCTB, which is further used to produce guide features on WB. Finally, the tagging F1 score is improved from 87.93% to 88.99% (+1.06%).

Cite

CITATION STYLE

APA

Chao, J., Li, Z., Chen, W., & Zhang, M. (2015). Exploiting heterogeneous annotations for weibo word segmentation and POS tagging. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9362, pp. 495–506). Springer Verlag. https://doi.org/10.1007/978-3-319-25207-0_46

Exploiting heterogeneous annotations for weibo word segmentation and POS tagging

Abstract

Cite

Register to see more suggestions