TopWORDS-Seg: Simultaneous Text Segmentation and Word Discovery for Open-Domain Chinese Texts via Bayesian Inference

Changzai Pan; Maosong Sun; Ke Deng

Conference ProceedingsOPEN ACCESS

TopWORDS-Seg: Simultaneous Text Segmentation and Word Discovery for Open-Domain Chinese Texts via Bayesian Inference

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2022) 1 158-169

DOI: 10.18653/v1/2022.acl-long.13

8Citations

43Readers

Abstract

Processing open-domain Chinese texts has been a critical bottleneck in computational linguistics for decades, partially because text segmentation and word discovery often entangle with each other in this challenging scenario. No existing methods yet can achieve effective text segmentation and word discovery simultaneously in open domain. This study fills in this gap by proposing a novel method called TopWORDS-Seg based on Bayesian inference, which enjoys robust performance and transparent interpretation when no training corpus and domain vocabulary are available. Advantages of TopWORDS-Seg are demonstrated by a series of experimental studies.

Cite

CITATION STYLE

APA

Pan, C., Sun, M., & Deng, K. (2022). TopWORDS-Seg: Simultaneous Text Segmentation and Word Discovery for Open-Domain Chinese Texts via Bayesian Inference. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 158–169). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.acl-long.13

TopWORDS-Seg: Simultaneous Text Segmentation and Word Discovery for Open-Domain Chinese Texts via Bayesian Inference

Abstract

Cite

Register to see more suggestions