A hybrid approach of text segmentation based on sensitive word concept for NLP

Fuji Ren

Conference Proceedings

A hybrid approach of text segmentation based on sensitive word concept for NLP

Ren F

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2001) 2004 375-388

DOI: 10.1007/3-540-44686-9_37

0Citations

6Readers

Get full text

Abstract

Natural language processing, such as Checking and Correction of Texts, Machine Translation, and Information Retrieval, usually starts from words. The identification of words in Indo-European languages is a trivial task. However, this problem named text segmentation has been, and is still a bottleneck for various Asian languages, such as Chinese. There have been two main groups of approaches to Chinese segmentation: dictionary-based approaches and statistical approaches. However, both approaches have difficulty to deal with some Chinese text. To address the difficulties, we propose a hybrid approach using Sensitive Word Concept to Chinese text segmentation. Sensitive words are the compound words whose syntactic category is different from those of their components. According to the segmentation, a sensitive word may play different roles, leading to significantly different syntactic structures. In this paper, we explain the concept of sensitive words and their efficacy in text segmentation firstly, then describe the hybrid approach that combines the rule-based method and the probability-based method using the concept of sensitive words. Our experimental results showed that the presented approach is able to address the text segmentation problems effectively.

Cite

CITATION STYLE

APA

Ren, F. (2001). A hybrid approach of text segmentation based on sensitive word concept for NLP. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 2004, pp. 375–388). Springer Verlag. https://doi.org/10.1007/3-540-44686-9_37

A hybrid approach of text segmentation based on sensitive word concept for NLP

Abstract

Cite

Register to see more suggestions