Sign up & Download
Sign in

Imbalanced Sentiment Classification

by Shoushan Li, Guodong Zhou, Zhongqing Wang, Sophia Yat, Mei Lee, Rangyang Wang
Test (2011)

Abstract

Sentiment classification has undergone significant development in recent years. However, most existing studies assume the balance between negative and positive samples, which may not be true in reality. In this paper, we investigate imbalanced sentiment classification instead. In particular, a novel clustering- based stratified under-sampling framework and a centroid- directed smoothing strategy are proposed to address the imbalanced class and feature distribution problems respectively. Evaluation across different datasets shows the effectiveness of both the under-sampling framework and the smoothing strategy in handling the imbalanced problems in real sentiment classification applications.

Cite this document (BETA)

Available from dl.acm.org
Page 1
hidden

Imbalanced Sentiment Classification

Imbalanced Sentiment Classification

Shoushan Li† Guodong Zhou† Zhongqing Wang† Sophia Yat Mei Lee‡ Rangyang Wang†
†Natural Language Processing Lab
Soochow University, Suzhou, China
{shoushan.li, wangzq870305,wangrongyang.nlp} @g
mail.com, gdzhou@suda.edu.cn

‡ Language Centre
Hong Kong Baptist University, Hong Kong
sophiaym@gmail.com
ABSTRACT
Sentiment classification has undergone significant development
in recent years. However, most existing studies assume the
balance between negative and positive samples, which may not
be true in reality. In this paper, we investigate imbalanced
sentiment classification instead. In particular, a novel clustering-
based stratified under-sampling framework and a centroid-
directed smoothing strategy are proposed to address the
imbalanced class and feature distribution problems respectively.
Evaluation across different datasets shows the effectiveness of
both the under-sampling framework and the smoothing strategy
in handling the imbalanced problems in real sentiment
classification applications.
Categories and Subject Descriptors
I.2.7 [Artificial Intelligence]: Natural Language Processing –
Text analysis
General Terms
Algorithms, Experimentation
Keywords
Opinion Mining, Sentiment Classification, Imbalanced
Classification

1. INTRODUCTION
Sentiment classification aims to predict sentiment polarity of a
text [8] and it plays a critical role in many NLP applications
However, most existing studies on sentiment classification
assume the balance between the numbers of positive and
negative samples, which may not hold in practice. Actually,
many sentiment classification applications involve imbalanced
class distributions in that the sample number of one class in the
training data is much larger than the other class. We call this
specific kind of sentiment classification as imbalanced
sentiment classification, in which the class with a larger amount
of samples is referred to as majority class and the other class
with a smaller amount of samples is referred to as minority
class.
In fact, imbalanced classification has been proven challenging in
the machine learning research community [4]. Many approaches
have been proposed to deal with the imbalanced class
distribution problem, such as re-sampling [2], one-class
classification [3], and cost-sensitive learning [14].
Unfortunately, none of the above approaches can be readily
applied to imbalanced sentiment classification due to its specific
characteristics.
In imbalanced classification, majority class normally contains
more kinds of occurring features than minority class. For
simplicity, we refer to this phenomenon as imbalanced feature
distribution. Such phenomenon becomes worse in imbalanced
sentiment classification since sentiment classification often
involves a small number of positive and negative samples. It
further worsens due to the sparseness of effective sentimental
features. On one hand, sentiment classification faces the same
challenge of high feature dimension as text categorization. On
the other hand, the effective sentimental features in a sample are
rather rare in sentiment classification, considering infrequent
occurrence of sentimental words in text. For example, while the
feature dimension of a typical sentiment classifier may be up to
tens of thousands, there are only dozens of effective sentimental
features (e.g., sentimental words) in a sample.
The imbalanced feature distribution problem can cause severe
problems in the training process of imbalanced sentiment
classification. Normally, the features that merely occur in the
majority class (not occurring in minority class, called majority
unique features) can be a strong distinguishing clue in the
classifier. Nevertheless, considering that the number of effective
sentimental features (e.g., sentimental words) is significantly
fewer than that of other features (e.g., those words about facts)
in sentiment classification, most of the majority unique features
will contribute abnormally. As a result, if we use all the training
samples to train a classifier, the classifier will have a strong
tendency to wrongly predict a sample from the minority class as
the majority class. This indicates the necessity of dealing with
the imbalanced feature distribution problem in imbalanced
sentiment classification.
In this paper, we propose a clustering-based stratified under-
sampling framework to overcome the imbalanced class
distribution problem in imbalanced sentiment classification.
Under this framework, the samples in the majority class are first
grouped into several clusters and then a suitable number of
samples are selected from each cluster to form the training

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
CIKM’11, October 24–28, 2011, Glasgow, Scotland, UK.
Copyright 2011 ACM 978-1-4503-0717-8/11/10…$10.00.

2469
Page 2
hidden
samples of majority class. The intuition is that these selected
samples using the stratified under-sampling framework should
be more representative than those by random selection.
Moreover, a centroid-directed smoothing strategy is proposed to
overcome the imbalanced feature distribution problem by
linearly interpolating a sample with the centroid of the cluster to
which this sample belongs. Since the centroid represents the
average feature distribution of all occurring features in the
cluster, our smoothing strategy can greatly increase the sample
robustness and reduce its feature sparseness.
2. RELATED WORK
Early studies on sentiment classification mainly focus on
unsupervised learning methods, which build a sentiment
classifier without any labeled data. In such methods, the
relationship between two words (e.g., a seed word and any other
word) is usually first extracted from some knowledge resources,
such as WordNet and unlabeled data. Then, such relationship is
used to compute the semantic orientation of a word or even the
sentiment polarity of a text [12]. In general, the performance of
unsupervised learning methods is too low to meet the
requirements of real applications.
Compared to unsupervised leaning, supervised learning methods
often perform much better due to the availability of labeled data
and become more popular since the pioneer work on sentiment
classification by Pang et al. [8]. In particular, various kinds of
information have been explored to improve the bag-of-words
model [5][6][11]. Unfortunately, the performance of a
supervised learning method drops dramatically when adapted to
a new domain. This arouses wide interests on the research of
domain adaptation in sentiment classification [1].
Besides domain adaptation, the imbalanced class distribution
problem is another major reason which hurts the wide
application of sentiment classification. To the best of our
knowledge, our work is the first study on imbalanced sentiment
classification.
3. CLUSTERING-BASED STRATIFIED
UNDER-SAMPLING FRAMEWORK
3.1 Overview
Just as described in the introduction, imbalanced feature
distribution in imbalanced sentiment classification is much due
to the conflict between the high feature dimension problem (the
high number of possible features in sentiment classification) and
the feature sparseness problem (infrequent occurrence of
sentimental words in a sample). Such imbalance in the feature
distribution becomes even worse due to the imbalanced class
distribution since the number of occurring features in the
minority class would be much fewer than that in the majority
class.
To have a better understanding of the imbalanced feature
distribution phenomenon in imbalanced sentiment classification,
Table 2 gives the statistics over two typical domains on the
number of features occurring in the positive and negative
classes, denoted as n and n respectively, with the ratios of
/n n being around 2.


Table 1: Feature distributions on the number of occurring
features in the positive and negative classes across two
typical domains
Domain n n n
Beauty 7,315 4,364 8,945
Computer 12,646 5,527 14,465

3.2 Stratified Under-sampling
As a popular sampling method in statistics, stratified sampling
first groups the members of a population into a few relatively
homogeneous subgroups (i.e. strata) according to one certain
property and then selects samples from each stratum. It is
believed that stratified sampling is able to select better samples
to represent the distribution of the whole dataset. Previous work
justifies its effectiveness theoretically and empirically in both
general applications [7] and specific NLP applications such as
semantic relation extraction between named entities [9][10].
The basic motivation of our using clustering-based stratified
sampling is to select some "representative" samples from the
majority class. In particular, the same number of
“representative” samples is selected from the minority class.
Therefore, our sampling approach is basically a non-random
under-sampling approach. The reason why we adopt under-
sampling instead of over-sampling is basically due to its better
performance. Please refer to Figure 2 in Section 6.2 for details.
Clustering groups the samples in the majority class into several
strata. Considering that the strata may be skewed, the number of
selected samples from each cluster is tuned according to the size
of each stratum. Given MAN samples in the majority class and
MIN samples in the minority class, the number of samples
selected from the i-th stratum iS should be | |MIi i
MA
NN SN  .

Input: The training data and the number of strata being
clustered, denoted as K
Output: Balanced training data
Algorithm:
1) Cluster the samples in the majority class into K strata
using a clustering algorithm.
2) Calculate the number of samples being sampled for
each stratum iS , {1, 2,..., }i K
3) Perform intra-strata sampling in each stratum.
4) Combine the selected majority class samples from all
the strata to form the majority class training data
5) Merge the majority class training data and all minority
class data to obtain the balanced training dataset.
Figure 1: Clustering-based stratified under-sampling

2470

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

4 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
75% Ph.D. Student
 
25% Professor
by Country
 
50% United States
 
25% South Korea
 
25% Germany