Feature Selection and Feature Extraction for Text Categorization

  • Lewis D
  • 1

    Readers

    Mendeley users who have this article in their library.
  • N/A

    Citations

    Citations of this article.

Abstract

The effect of selecting varying numbers and kinds of features for use in predicting category membership was investigated on the Reuters and MUC-3 text categorization data sets. Good categorization performance was achieved using a statistical classifier and a proportional assignment strategy. The optimal feature set size for word-based indexing was found to be surprisingly low (10 to 15 features) despite the large training sets. The extraction of new text features by syntactic analysis and feature clustering was investigated on the Reuters data set. Syntactic indexing phrases, clusters of these phrases, and clusters of words were all found to provide less effective representations than individual words.

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

Authors

  • David D. Lewis

  • David D. Lewis

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free