Sign up & Download
Sign in

Collaborative Data Cleaning for Sentiment Classification with Noisy Training Corpus

by Xiaojun Wan
Advances in Knowledge Discovery and Data Mining (2011)

Abstract

Labeled review corpus is considered as a very valuable resource for the task of sentiment classification of product reviews. Fortunately, there are a large amount of product reviews on the Web, and each review is associated with a tag assigned by users to indicate its polarity orientation. We can download such reviews with tags and use them as training corpus for sentiment classification. However, users may assign the polarity tag arbitrarily and inaccurately, and some tags are not appropriate, which results in that the automatically constructed corpus contains many noises and the noisy instances will deteriorate the classification performance. In this paper, we propose the co-cleaning and tri-cleaning algorithms to collaboratively clean the corpus and thus improve the sentiment classification performance. The proposed algorithms use multiple classifiers to iteratively select and remove the most confidently noisy instances from the corpus. Experimental results verify the effectiveness of our proposed algorithms, and the tri-cleaning algorithm is most effective and promising.

Cite this document (BETA)

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

1 Reader on Mendeley
by Discipline
 
by Academic Status
 
100% Ph.D. Student
by Country
 
100% Portugal