Taming wild phrases

Cornelis H.A. Koster; Mark Seutter

Journal Article

Taming wild phrases

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2003) 2633 161-176

DOI: 10.1007/3-540-36618-0_12

19Citations

9Readers

Get full text

Abstract

In this paper the suitability of different document representations for automatic document classification is compared, investigating a whole range of representations between bag-of-words and bag-of-phrases. We look at some of their statistical properties, and determine for each representation the optimal choice of classification parameters and the effect of Term Selection. Phrases are represented by an abstraction called Head/Modifier pairs. Rather than just throwing phrases and keywords together, we shall start with pure HM pairs and gradually add more keywords to the document representation. We use the classification on keywords as the baseline, ' which we compare with the contribution of the pure HM pairs to classification accuracy, and the incremental contributions from heads and modifiers. Finally, we measure the accuracy achieved with all words and all HM pairs combined, which turns out to be only marginally above the baseline. We conclude that even the most careful term selection cannot overcome the differences in Document Frequency between phrases and words, and propose the use of term clustering to make phrases more cooperative. © Springer-Verlag Berlin Heidelberg 2003.

Author supplied keywords

Cite

CITATION STYLE

APA

Koster, C. H. A., & Seutter, M. (2003). Taming wild phrases. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2633, 161–176. https://doi.org/10.1007/3-540-36618-0_12

Taming wild phrases

Abstract

Author supplied keywords

Cite

Register to see more suggestions