Unsupervised feature selection for text data

Nirmalie Wiratunga; Rob Lothian; Stewart Massie

Conference Proceedings

Unsupervised feature selection for text data

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2006) 4106 LNAI 340-354

DOI: 10.1007/11805816_26

14Citations

22Readers

Get full text

Abstract

Feature selection for unsupervised tasks is particularly challenging, especially when dealing with text data. The increase in online documents and email communication creates a need for tools that can operate without the supervision of the user. In this paper we look at novel feature selection techniques that address this need. A distributional similarity measure from information theory is applied to measure feature utility. This utility informs the search for both representative and diverse features in two complementary ways: CLUSTER divides the entire feature space, before then selecting one feature to represent each cluster; and GREEDY increments the feature subset size by a greedily selected feature. In particular we found that GREEDY's local search is suited to learning smaller feature subset sizes while CLUSTER is able to improve the global quality of larger feature sets. Experiments with four email data sets show significant improvement in retrieval accuracy with nearest neighbour based search methods compared to an existing frequency-based method. Importantly both GREEDY and CLUSTER make significant progress towards the upper bound performance set by a standard supervised feature selection method. © Springer-Verlag Berlin Heidelberg 2006.

Cite

CITATION STYLE

APA

Wiratunga, N., Lothian, R., & Massie, S. (2006). Unsupervised feature selection for text data. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4106 LNAI, pp. 340–354). Springer Verlag. https://doi.org/10.1007/11805816_26

Unsupervised feature selection for text data

Abstract

Cite

Register to see more suggestions