Sentiment zones approach to extracting a training corpus in unsuper-vised sentiment classification.
Proceedings of the Eurolan Doctoral Consortium (2007)
Available from
Taras Zagibalov's profile on Mendeley.
or
Abstract
One of the main problems of automatic classification by means of unsupervised machine leaning is providing the system with the most adequate training data. In this paper we propose a sentiment zones approach for extracting a training subcorpus for sentiment classification. The proposed approach allowed us to increase performance of unsupervised classifiers.
Available from
Taras Zagibalov's profile on Mendeley.
Page 2
Sentiment zones approach to extracting a training corpus in unsuper-vised sentiment classification.
After all duplicate reviews were removed the final
version of the corpus comprises 29531 reviews of
which 23122 are positive (78%) and 6409 are neg-
ative (22%). The total number of different products
in the corpus is 10631, the number of product cat-
egories is 255, and most of the reviewed products
are either software products or consumer electron-
ics. Unfortunately some users misused the sentiment
tagging facility on the web-site and quite a lot of re-
views were tagged erroneously. However, the parts
of the reviews were tagged much more accurately so
we used only relevant (negative or positive) review
parts as the items of the test corpus. For the tests de-
scribed in this paper we used 10874 reviews, whose
parts were extracted to make a balanced test corpus
(5437 items of each sentiment)3.
3 Description of the Approach
We argue that using an automatically extracted sub-
corpus from a corpus of opinionated texts we can
train standard supervised classifiers and achieve re-
sults comparable in terms of accuracy to the results
of the same classifiers trained on a manually tagged
training corpus. We also argue that such a process
can be run iteratively to improve the results.
3.1 Sentiment Zone classifier
To extract a training subcorpus we need an unsuper-
vised classifier. Here we propose a Sentiment Zone
classifier. The proposed classifier is based on the
idea of a sentiment zone. A sentiment zone is a se-
quence of words or phrases that have same sentiment
direction. In this study we use sequence of charac-
ters between punctuation marks as a basic sentiment
zone. In order to tag a zone as either positive or neg-
ative we do the following calculations.
Using Sentiment dictionary (see 2) and maximum
match algorithm we extract items (words or phrases)
from a zone. As we have two parts of the dictionary
(positive and negative), we calculate two scores by
formula:
Scitem =
Ld
Lphrase
∗ Scd ∗Nd
where Ld - length of a dictionary item, Lphrase -
length of a phrase in characters; Scd - sentiment
3The corpus will be made available upon publication of the
paper.
score of a word (initially for all the words in the
dictionary the score is 1.0); Nd is a negation check
coefficient.
The negation check is a very simple routine based
on regex patterns to find out if the word is pre-
ceded by a negation within the limits of a phrase.
If a negation is found the score is multiplied by -1.
Currently we use only five frequent negations: bu,
buhui, meiyou, baituo, mianqu, bimian.
The sentiment score of a sentiment zone is the
sum of sentiment scores of all the items found in it:
Sczone =
∑
Scitem(1...n)
Thus we have two alternative sentiment scores for
every zone: positive (the sum of all scores of items
found in positive part of the dictionary) and negative
(sum of the scores for ‘negative’ words). Finally the
sentiment direction of a sentiment zone is found by
comparing the two alternative scores:
Sentimentzone = argmax(Sci|Scj)
where Sci is a sentiment score for one class and Scj
for the other.
Now we have a number of positive and negative
zones in a text. To define the sentiment direction for
the whole text, the classifier compares the number
of alternative sentiment zones for each item:
Sentimenttext =
∑
Sci(1...n) −
∑
Scj(1...n))
where Sc(1...n) is sentiment zones from 1 to n, that
are found in a review. Each zone can be either pos-
itive (i) or negative (j). If Sentiment > 0 the
whole text is classified as positive and vice versa.
If Sentiment = 0 the review is not classified.
In subsequent tests we also use the difference be-
tween the number of alternative zones as a threshold
value. The difference is calculated quite similar to
the sentiment direction:
Difference =
∣
∣
∣
∑
Sci(1...n) −
∑
Scj(1...n)
∣
∣
∣
As preliminary tests showed, the bigger differ-
ence in the number of alternative zones is, the more
accurate result we have. The results were evaluated
by means of precision4 and recall5(see table1). The
4Precision here is a proportion of correctly classified items
in the extracted subcorpus
5Recall here is the measure of the size of the extracted sub-
corpus compared to the corpus
version of the corpus comprises 29531 reviews of
which 23122 are positive (78%) and 6409 are neg-
ative (22%). The total number of different products
in the corpus is 10631, the number of product cat-
egories is 255, and most of the reviewed products
are either software products or consumer electron-
ics. Unfortunately some users misused the sentiment
tagging facility on the web-site and quite a lot of re-
views were tagged erroneously. However, the parts
of the reviews were tagged much more accurately so
we used only relevant (negative or positive) review
parts as the items of the test corpus. For the tests de-
scribed in this paper we used 10874 reviews, whose
parts were extracted to make a balanced test corpus
(5437 items of each sentiment)3.
3 Description of the Approach
We argue that using an automatically extracted sub-
corpus from a corpus of opinionated texts we can
train standard supervised classifiers and achieve re-
sults comparable in terms of accuracy to the results
of the same classifiers trained on a manually tagged
training corpus. We also argue that such a process
can be run iteratively to improve the results.
3.1 Sentiment Zone classifier
To extract a training subcorpus we need an unsuper-
vised classifier. Here we propose a Sentiment Zone
classifier. The proposed classifier is based on the
idea of a sentiment zone. A sentiment zone is a se-
quence of words or phrases that have same sentiment
direction. In this study we use sequence of charac-
ters between punctuation marks as a basic sentiment
zone. In order to tag a zone as either positive or neg-
ative we do the following calculations.
Using Sentiment dictionary (see 2) and maximum
match algorithm we extract items (words or phrases)
from a zone. As we have two parts of the dictionary
(positive and negative), we calculate two scores by
formula:
Scitem =
Ld
Lphrase
∗ Scd ∗Nd
where Ld - length of a dictionary item, Lphrase -
length of a phrase in characters; Scd - sentiment
3The corpus will be made available upon publication of the
paper.
score of a word (initially for all the words in the
dictionary the score is 1.0); Nd is a negation check
coefficient.
The negation check is a very simple routine based
on regex patterns to find out if the word is pre-
ceded by a negation within the limits of a phrase.
If a negation is found the score is multiplied by -1.
Currently we use only five frequent negations: bu,
buhui, meiyou, baituo, mianqu, bimian.
The sentiment score of a sentiment zone is the
sum of sentiment scores of all the items found in it:
Sczone =
∑
Scitem(1...n)
Thus we have two alternative sentiment scores for
every zone: positive (the sum of all scores of items
found in positive part of the dictionary) and negative
(sum of the scores for ‘negative’ words). Finally the
sentiment direction of a sentiment zone is found by
comparing the two alternative scores:
Sentimentzone = argmax(Sci|Scj)
where Sci is a sentiment score for one class and Scj
for the other.
Now we have a number of positive and negative
zones in a text. To define the sentiment direction for
the whole text, the classifier compares the number
of alternative sentiment zones for each item:
Sentimenttext =
∑
Sci(1...n) −
∑
Scj(1...n))
where Sc(1...n) is sentiment zones from 1 to n, that
are found in a review. Each zone can be either pos-
itive (i) or negative (j). If Sentiment > 0 the
whole text is classified as positive and vice versa.
If Sentiment = 0 the review is not classified.
In subsequent tests we also use the difference be-
tween the number of alternative zones as a threshold
value. The difference is calculated quite similar to
the sentiment direction:
Difference =
∣
∣
∣
∑
Sci(1...n) −
∑
Scj(1...n)
∣
∣
∣
As preliminary tests showed, the bigger differ-
ence in the number of alternative zones is, the more
accurate result we have. The results were evaluated
by means of precision4 and recall5(see table1). The
4Precision here is a proportion of correctly classified items
in the extracted subcorpus
5Recall here is the measure of the size of the extracted sub-
corpus compared to the corpus
Page 3
recall of positive reviews was always higher than the
one of negatives, thus to get a balanced subcorpus
we had to reduce the number of positive reviews.
The effective size6 of extracted subcorpora is indi-
cated in the last column of table 1.
Difference Precision Recall Subcorpus size
1 87.54 85.45 84.28
2 95.42 48.65 45.30
3 97.01 29.93 25.12
4 97.82 18.18 13.81
5 98.23 10.96 8.02
6 98.30 6.53 4.81
7 97.81 3.78 2.83
8 98.29 2.16 1.58
Table 1: Influence of the difference between the number of
alternative zones on precision and the number of classified re-
views.
3.2 Retraining
Being able to extract a training subcorpus it is possi-
ble to retrain the classifier. The subcorpus is used to
adjust the scores of the dictionary items and to find
new items (words or phrases) to be included into the
dictionary. First of all the subcorpus is pruned of
phrases whose relative frequency is very close in dif-
ferent classes. We calculate the frequency difference
using following formula:
difference =
|Fi − Fj |
(Fi + Fj)/2
where Fi is relative frequency in one class, and Fj is
relative frequency in another class. Then we look
for chunks of texts that have relatively high fre-
quency. Nevertheless duplicate reviews were filtered
out from the corpus, there are still a lot of ‘near du-
plicates’ (reviews with very little difference). To
avoid being flooded with parts of such duplicates
and to screen out low frequency items we set a
threshold for absolute frequency for an item to be
more than 5 times per a corpus. For all items, both
old ones and newly found, we compare relative fre-
quencies in both classes:
Fi
(Fi + Fj)
Finally the adjusted dictionary with new scores is
ready for new iteration in classifier.
6The size is expressed in percent compared to the size of the
corpus
4 Experiments
To test the approaches we designed two experi-
ments. The purpose of the first one is to check if
we can get a better training subcorpus after every it-
eration. A ‘better training subcorpus’ means that at
least one of the main characteristics of it (precision
or recall) was improved after an iteration while the
other one remains almost the same. For example, if
we get a higher precision at approximately same re-
call, the result of the iteration is regarded as positive.
The criterion for this improvement is F-measure:
2PR
P +R
, where P is precision and R is recall.
The second experiment is to test if the training
subcorpora gained by the means of the technique
tested in the first experiment can increase accuracy
of standard classifiers.
Using the approach that gives us control over the
precision and the size of a resulting subcorpus, we
want to test two alternative approaches for a train-
ing subcorpus extraction: size-driven and precision-
driven. The first one means that we use a train-
ing subcorpus of maximum possible size, while the
latter one means that we use a subcorpus of maxi-
mum possible precision providing its size is not too
small. It is quite easy to find the threshold value
for the size-driven approach: the smallest difference
(1) gives us a subcorpus of 87% of accuracy and of
almost 85% of the size of the original corpus. How-
ever, we have to use a rather arbitrary notion of a
subcorpus for the precision-driven approach. We
desided on a subcorpus with 97% of precision and
25% of the size, which is extracted with the dif-
ference threshold of 3 (see table 1) . As the pro-
posed approach can be used iteratively we will use
only these two values throughout all iterations. In
the both approaches we use the Sentiment dictionary
items as features. We also want to know if modifi-
cations made to the dictionary throughout the itera-
tions can also contribute to the performance of the
classifiers.
4.1 Experiment 1
In this experiment we ran the Sentiment Zone clas-
sifier three times for each difference value. After the
one of negatives, thus to get a balanced subcorpus
we had to reduce the number of positive reviews.
The effective size6 of extracted subcorpora is indi-
cated in the last column of table 1.
Difference Precision Recall Subcorpus size
1 87.54 85.45 84.28
2 95.42 48.65 45.30
3 97.01 29.93 25.12
4 97.82 18.18 13.81
5 98.23 10.96 8.02
6 98.30 6.53 4.81
7 97.81 3.78 2.83
8 98.29 2.16 1.58
Table 1: Influence of the difference between the number of
alternative zones on precision and the number of classified re-
views.
3.2 Retraining
Being able to extract a training subcorpus it is possi-
ble to retrain the classifier. The subcorpus is used to
adjust the scores of the dictionary items and to find
new items (words or phrases) to be included into the
dictionary. First of all the subcorpus is pruned of
phrases whose relative frequency is very close in dif-
ferent classes. We calculate the frequency difference
using following formula:
difference =
|Fi − Fj |
(Fi + Fj)/2
where Fi is relative frequency in one class, and Fj is
relative frequency in another class. Then we look
for chunks of texts that have relatively high fre-
quency. Nevertheless duplicate reviews were filtered
out from the corpus, there are still a lot of ‘near du-
plicates’ (reviews with very little difference). To
avoid being flooded with parts of such duplicates
and to screen out low frequency items we set a
threshold for absolute frequency for an item to be
more than 5 times per a corpus. For all items, both
old ones and newly found, we compare relative fre-
quencies in both classes:
Fi
(Fi + Fj)
Finally the adjusted dictionary with new scores is
ready for new iteration in classifier.
6The size is expressed in percent compared to the size of the
corpus
4 Experiments
To test the approaches we designed two experi-
ments. The purpose of the first one is to check if
we can get a better training subcorpus after every it-
eration. A ‘better training subcorpus’ means that at
least one of the main characteristics of it (precision
or recall) was improved after an iteration while the
other one remains almost the same. For example, if
we get a higher precision at approximately same re-
call, the result of the iteration is regarded as positive.
The criterion for this improvement is F-measure:
2PR
P +R
, where P is precision and R is recall.
The second experiment is to test if the training
subcorpora gained by the means of the technique
tested in the first experiment can increase accuracy
of standard classifiers.
Using the approach that gives us control over the
precision and the size of a resulting subcorpus, we
want to test two alternative approaches for a train-
ing subcorpus extraction: size-driven and precision-
driven. The first one means that we use a train-
ing subcorpus of maximum possible size, while the
latter one means that we use a subcorpus of maxi-
mum possible precision providing its size is not too
small. It is quite easy to find the threshold value
for the size-driven approach: the smallest difference
(1) gives us a subcorpus of 87% of accuracy and of
almost 85% of the size of the original corpus. How-
ever, we have to use a rather arbitrary notion of a
subcorpus for the precision-driven approach. We
desided on a subcorpus with 97% of precision and
25% of the size, which is extracted with the dif-
ference threshold of 3 (see table 1) . As the pro-
posed approach can be used iteratively we will use
only these two values throughout all iterations. In
the both approaches we use the Sentiment dictionary
items as features. We also want to know if modifi-
cations made to the dictionary throughout the itera-
tions can also contribute to the performance of the
classifiers.
4.1 Experiment 1
In this experiment we ran the Sentiment Zone clas-
sifier three times for each difference value. After the
Page 4
first run (see table 1), we obtained two subcorpora
(one for value 1 and the other for value 3) which
were used for retraining the same classifier (as de-
scribed in section 3.2).
4.1.1 Threshold value 1
The results obtained in the run of three iterations7
for the threshold value 1 are as follows (Iteration 0
is the initial classification, see table 1):
Iteration Precision Recall F-measure
0 87.54 85.46 86.49
1 90.29 94.29 92.25
2 90.67 94.49 92.54
3 90.42 94.74 92.53
Table 2: Results of three iterations (Recall).
From table 2 we can see that iterations 1 - 3 pro-
duced a better subcorpus. But as we need a balanced
corpus for training, it is important to know if the
results are better if we use a notion of corpus size
rather then recall.
Iteration Precision Size F-measure
0 87.54 84.28 85.88
1 87.94 88.64 89.46
2 86.23 90.47 90.57
3 88.68 90.91 90.66
Table 3: Results of three iterations (Corpus size in percent to
the size of the corpus).
We can see better results after all the iterations.
4.1.2 Threshold value 3
The results of the three iteration for the threshold
value 3 are following (Iteration 0 is the initial classi-
fication, see table 1): From table 2 we can see that all
Iteration Precision Recall F-measure
0 97.01 29.93 45.75
1 96.72 41.85 58.42
2 97.03 45.29 61.76
3 97.05 45.84 62.27
Table 4: Results of three iterations (Recall).
of the three iterations produced a better subcorpus.
Results for a balanced subcorpus are:
Judging from table 5 it is possible to conclude that
the classifier produced better corpora after all itera-
tions.
7More iterations did not improve accuracy.
Iteration Precision Size F-measure
0 97.01 25.12 39.91
1 96.37 35.39 51.82
2 96.43 35.28 51.75
3 96.24 36.11 52.64
Table 5: Results of three iterations (Corpus size).
4.2 Experiment 2
In this experiment we used the training corpora ob-
tained from the first experiment to train three stan-
dard classifiers: a Naive Bayes (NB), a Naive Bayes
multinomial (NBm) and a support vector machine
(SVM)8. The baselines (BL) for this experiment are
the results performed by the classifiers after being
trained using the sentiment dictionary as training
corpus. The ‘supervised results’ (SR) are the su-
pervised results obtained from the same set of clas-
sifiers, trained on the bigger part (66%) of the test
corpus using. These figures are compared with the
results of the abovenamed classifiers after they were
trained with the subcorpora extracted after each of
the four iterations (see tables 3 and 5 ).
4.2.1 Original sentiment dictionary items as
attributes
In these tests we used the Sentiment dictionary
items as attributes for the classifiers. For threshold
value 1 the results are:
BL 0 1 2 3 SR
NBm 77.79 81.64 82.11 82.32 82.03 81.85
NB 50.80 73.74 73.7 73.63 73.49 74.09
SVM 76.24 80.89 82.63 82.72 82.59 81.77
Table 6: Standard classifiers performance compared (accu-
racy in percent).
Table 6 shows that all three classifiers increased
their performance after being trained on the ex-
tracted subcorpus. One of them (Naive Bayes) failed
to achieve better results than those gained under the
supervised approach. For NB and NBm the best re-
sults were achieved with the subcorpus obtained af-
ter the second run.
Same settings were used to run the subcorpora ob-
tained after the iterations with threshold value 3:
The results presented in this table are slightly
worse than those in table 6, all classifiers outper-
8We used WEKA 3.4.10
(http://www.cs.waikato.ac.nz/˜ml/weka )
(one for value 1 and the other for value 3) which
were used for retraining the same classifier (as de-
scribed in section 3.2).
4.1.1 Threshold value 1
The results obtained in the run of three iterations7
for the threshold value 1 are as follows (Iteration 0
is the initial classification, see table 1):
Iteration Precision Recall F-measure
0 87.54 85.46 86.49
1 90.29 94.29 92.25
2 90.67 94.49 92.54
3 90.42 94.74 92.53
Table 2: Results of three iterations (Recall).
From table 2 we can see that iterations 1 - 3 pro-
duced a better subcorpus. But as we need a balanced
corpus for training, it is important to know if the
results are better if we use a notion of corpus size
rather then recall.
Iteration Precision Size F-measure
0 87.54 84.28 85.88
1 87.94 88.64 89.46
2 86.23 90.47 90.57
3 88.68 90.91 90.66
Table 3: Results of three iterations (Corpus size in percent to
the size of the corpus).
We can see better results after all the iterations.
4.1.2 Threshold value 3
The results of the three iteration for the threshold
value 3 are following (Iteration 0 is the initial classi-
fication, see table 1): From table 2 we can see that all
Iteration Precision Recall F-measure
0 97.01 29.93 45.75
1 96.72 41.85 58.42
2 97.03 45.29 61.76
3 97.05 45.84 62.27
Table 4: Results of three iterations (Recall).
of the three iterations produced a better subcorpus.
Results for a balanced subcorpus are:
Judging from table 5 it is possible to conclude that
the classifier produced better corpora after all itera-
tions.
7More iterations did not improve accuracy.
Iteration Precision Size F-measure
0 97.01 25.12 39.91
1 96.37 35.39 51.82
2 96.43 35.28 51.75
3 96.24 36.11 52.64
Table 5: Results of three iterations (Corpus size).
4.2 Experiment 2
In this experiment we used the training corpora ob-
tained from the first experiment to train three stan-
dard classifiers: a Naive Bayes (NB), a Naive Bayes
multinomial (NBm) and a support vector machine
(SVM)8. The baselines (BL) for this experiment are
the results performed by the classifiers after being
trained using the sentiment dictionary as training
corpus. The ‘supervised results’ (SR) are the su-
pervised results obtained from the same set of clas-
sifiers, trained on the bigger part (66%) of the test
corpus using. These figures are compared with the
results of the abovenamed classifiers after they were
trained with the subcorpora extracted after each of
the four iterations (see tables 3 and 5 ).
4.2.1 Original sentiment dictionary items as
attributes
In these tests we used the Sentiment dictionary
items as attributes for the classifiers. For threshold
value 1 the results are:
BL 0 1 2 3 SR
NBm 77.79 81.64 82.11 82.32 82.03 81.85
NB 50.80 73.74 73.7 73.63 73.49 74.09
SVM 76.24 80.89 82.63 82.72 82.59 81.77
Table 6: Standard classifiers performance compared (accu-
racy in percent).
Table 6 shows that all three classifiers increased
their performance after being trained on the ex-
tracted subcorpus. One of them (Naive Bayes) failed
to achieve better results than those gained under the
supervised approach. For NB and NBm the best re-
sults were achieved with the subcorpus obtained af-
ter the second run.
Same settings were used to run the subcorpora ob-
tained after the iterations with threshold value 3:
The results presented in this table are slightly
worse than those in table 6, all classifiers outper-
8We used WEKA 3.4.10
(http://www.cs.waikato.ac.nz/˜ml/weka )
Page 5
BL 0 1 2 3 SR
NBm 77.79 80.99 81.01 80.95 81.11 81.85
NB 50.80 74.2 73.11 72.73 72.73 74.09
SVM 76.24 80.77 81.65 81.89 81.79 81.77
Table 7: Standard classifiers performance compared (accu-
racy in percent).
formed the baseline, and only NBm failed to pro-
duce better results than a supervised classifier.
4.2.2 Modified sentiment dictionary as
attributes
In these tests we used modified Sentiment dictio-
nary items (as explained in section 3.2) as attributes
for the classifiers. Thus, the classifiers were not only
trained on extracted subcorpora, but also used differ-
ent set of attributes for each run. For threshold value
1 the results are:
BL 0 1 2 3 SR
NBm 77.79 88.18 88.32 88.54 88.13 81.85
NB 50.80 82.28 81.57 80.86 81.65 74.09
SVM 76.24 84.64 87.9 88.19 88.00 81.77
Table 8: Standard classifiers performance compared (accu-
racy in percent).
Table 8 shows that all three classifiers outper-
formed not only the baseline but also the supervised
results. It is interesting to note that after all iterations
the accuracy was higher than the one of supervised
classifier.
Same settings were used to run the subcorpora ob-
tained after the iterations with threshold value 3:
BL 0 1 2 3 SR
NBm 77.79 84.55 85.68 85.41 85.41 81.85
NB 50.80 77.8 81.3 80.94 81.09 74.09
SVM 76.24 82.72 84.53 85.7 85.52 81.77
Table 9: Standard classifiers performance compared (accu-
racy in percent).
The results presented in this table are again worse
than those of threshold value 1 (see table 6) , but
all of the classifiers outperformed the supervised ap-
proach, and, again, after all iterations.
5 Related Work
The approach presented in this paper is very close to
the bootstrapping technique described by Yarovsky
(1995): we use automatically built Sentiment dictio-
nary to extract a training subcorpus to train a clas-
sifier. The process can be run iteratively to increase
precision and coverage.
Sentiment classification using supervised ma-
chine learning was studied by Pang et al. (2002).
The authors showed that machine learning methods
(Naive Bayes, maximum entropy classification, and
support vector machines)with words as features do
not perform as well on sentiment classification as on
traditional topic-based categorization: the best accu-
racy achieved was 82.9%, using an SVM trained on
unigram features. A later study Pang and Lee (2004)
increased performance upto 87.2%, but the object of
classification was an opinionated sentence, not a text
(review or its part). Pang et al. also showed that bi-
grams are not effective at capturing context in sen-
timent extraction, but modelling the potentially im-
portant contextual effect of negation had some posi-
tive influence on performance. Following these find-
ings we also used a unigram-presence approach as
well as implemented a simple negation check. The
main difference is that the training corpus we used
is automatically generated, this enables us to regard
our approach to be within a unsupervised machine-
learning paradigm.
Das and Chen (2006) designed an algorithm
which comprises different supervised classifier al-
gorithms coupled together by a voting scheme for
extracting small investor sentiment from stock mes-
sage boards. Among the others they use a classifier
algorithm based on a word count of positive and neg-
ative connotation words. It makes this study close to
ours as we also use words counts for calculation sen-
timent scores. The difference is that the dictionary
we use (Sentiment Dictionary by Ku et al., see 2)
was generated automatically, rather than manually
crafted. The same (manual) approach to sentiment
lexicon construction combined with fuzzy-logic was
used by Huettner and Subasic (2001).
Hu and Liu (2004), Kim and Hovy (2004),Ku et
al. (2006) create a sentiment dictionary by means
of a set of seed words which is enlarged using other
dictionaries or thesauri, and the automatically gen-
erated dictionaries are then used for classification.
We use such a dictionary iteratively 1. to adjust it to
the corpus and 2. to obtain a training subcorpus for
machine learning classifiers.
NBm 77.79 80.99 81.01 80.95 81.11 81.85
NB 50.80 74.2 73.11 72.73 72.73 74.09
SVM 76.24 80.77 81.65 81.89 81.79 81.77
Table 7: Standard classifiers performance compared (accu-
racy in percent).
formed the baseline, and only NBm failed to pro-
duce better results than a supervised classifier.
4.2.2 Modified sentiment dictionary as
attributes
In these tests we used modified Sentiment dictio-
nary items (as explained in section 3.2) as attributes
for the classifiers. Thus, the classifiers were not only
trained on extracted subcorpora, but also used differ-
ent set of attributes for each run. For threshold value
1 the results are:
BL 0 1 2 3 SR
NBm 77.79 88.18 88.32 88.54 88.13 81.85
NB 50.80 82.28 81.57 80.86 81.65 74.09
SVM 76.24 84.64 87.9 88.19 88.00 81.77
Table 8: Standard classifiers performance compared (accu-
racy in percent).
Table 8 shows that all three classifiers outper-
formed not only the baseline but also the supervised
results. It is interesting to note that after all iterations
the accuracy was higher than the one of supervised
classifier.
Same settings were used to run the subcorpora ob-
tained after the iterations with threshold value 3:
BL 0 1 2 3 SR
NBm 77.79 84.55 85.68 85.41 85.41 81.85
NB 50.80 77.8 81.3 80.94 81.09 74.09
SVM 76.24 82.72 84.53 85.7 85.52 81.77
Table 9: Standard classifiers performance compared (accu-
racy in percent).
The results presented in this table are again worse
than those of threshold value 1 (see table 6) , but
all of the classifiers outperformed the supervised ap-
proach, and, again, after all iterations.
5 Related Work
The approach presented in this paper is very close to
the bootstrapping technique described by Yarovsky
(1995): we use automatically built Sentiment dictio-
nary to extract a training subcorpus to train a clas-
sifier. The process can be run iteratively to increase
precision and coverage.
Sentiment classification using supervised ma-
chine learning was studied by Pang et al. (2002).
The authors showed that machine learning methods
(Naive Bayes, maximum entropy classification, and
support vector machines)with words as features do
not perform as well on sentiment classification as on
traditional topic-based categorization: the best accu-
racy achieved was 82.9%, using an SVM trained on
unigram features. A later study Pang and Lee (2004)
increased performance upto 87.2%, but the object of
classification was an opinionated sentence, not a text
(review or its part). Pang et al. also showed that bi-
grams are not effective at capturing context in sen-
timent extraction, but modelling the potentially im-
portant contextual effect of negation had some posi-
tive influence on performance. Following these find-
ings we also used a unigram-presence approach as
well as implemented a simple negation check. The
main difference is that the training corpus we used
is automatically generated, this enables us to regard
our approach to be within a unsupervised machine-
learning paradigm.
Das and Chen (2006) designed an algorithm
which comprises different supervised classifier al-
gorithms coupled together by a voting scheme for
extracting small investor sentiment from stock mes-
sage boards. Among the others they use a classifier
algorithm based on a word count of positive and neg-
ative connotation words. It makes this study close to
ours as we also use words counts for calculation sen-
timent scores. The difference is that the dictionary
we use (Sentiment Dictionary by Ku et al., see 2)
was generated automatically, rather than manually
crafted. The same (manual) approach to sentiment
lexicon construction combined with fuzzy-logic was
used by Huettner and Subasic (2001).
Hu and Liu (2004), Kim and Hovy (2004),Ku et
al. (2006) create a sentiment dictionary by means
of a set of seed words which is enlarged using other
dictionaries or thesauri, and the automatically gen-
erated dictionaries are then used for classification.
We use such a dictionary iteratively 1. to adjust it to
the corpus and 2. to obtain a training subcorpus for
machine learning classifiers.
Page 6
Turney (2002) proposed an unsupervised learning
algorithm for classifying a review where the senti-
ment direction of a phrase is calculated as the poin-
wise mutual information (PMI) between the given
phrase and the word ‘excellent’ minus the PMI be-
tween the given phrase and the word ‘poor’. The au-
thor reports different accuracies (from 64% to 84%)
obtained after evaluation of classifications in differ-
ent domains, with average accuracy of 74% on 410
reviews. Although we do not use a PMI for calcu-
lating semantic orientation, we use sentiment scores
of words for calculating sentiment orientation of a
phrase in a very similar way: by comparing negative
and positive scores.
Aue and Gamon (2005), Eriksson (2006) and
Read (2005) amongst others have noted the influ-
ence of topic- and domain-dependency in sentiment
classification; the authors observed that a classifier
performs much better if trained on data from the
same domain as testing data.
Liu et al (2005) use notion of opinion segment and
of opinion set of a feature, which are close to the
notion of sentiment zone used in this paper. Both
concepts denote a chunk of syntactic units that are
characterized by some sentiment direction. The dif-
ference is that the first two are one or more sentences
long whereas we operate with phrases. Another dif-
ference is that Liu et al. use these chunks for ana-
lyzing and comparing opinions regarding a product
feature, thus they are ‘product feature-driven’: they
are located after a feature is found.
6 Conclusion and future work
The experimental data shows that the proposed ap-
proach helps to increase performance of an unsuper-
vised sentiment classifier and outperform the super-
vised approach. It was also observed that in these
tests a bigger size of a training corpus was more im-
portant rather than higher precision to achieve bet-
ter accuracy. A major improvement of accuracy was
achieved by using a modified set of attributes com-
bined with an extracted training subcorpus. The pro-
posed approach may also be domain-independent as
it uses a domain-independent dictionary in the first
iteration, but it needs validation by testing it on dif-
ferent corpora. Another question to be answered is
why the increase in performance was not linear.
One of the problems to be solved is the problem
of a very unbalanced output of the sentiment zone
classifier: although the approach seems to be effi-
cient for both negative and positive items, their per-
formance is too different. Positive reviews increase
precision keeping relatively high recall, while preci-
sion of positive reviews is accompanied by a rapid
loosing of recall. This phenomenon forces us to
make the corpus smaller in order to avoid complica-
tions of processing a skewed corpus. The disbalance
may have been caused either by the classifier while
the sentiment score calculation or it may be a distinct
feature of negative reviews in general: people tend to
be less emotional and more reasonable. It means that
we have to seek for ways of improvement the scor-
ing technique as well as to try some linguistic fea-
tures that might contribute to more accurate scores.
First of all we would like to test the negation check
again as at present this routine is implemented in a
rather simplistic way. Another feature to be tested
is sentiment intensifiers (words such as “very”, “ab-
solutely” and others). It also may be beneficial to
increase the score for subjectivity indicators (word
combinations like “I think” and some interjections).
Probably a more sophisticated technique may be ap-
plied to filtering and processing the training subcor-
pora. So far we use quite a simple technique based
on comparison of relative scores. The positive influ-
ence of the modified set of attributes on the accuracy
of classification enables us to pay special attention to
this phenomenon and investigate better ways of ex-
tracting and modifying attributes. All the more, this
domain (words and phrases extraction) is of special
interest in the context of Chinese linguistics, where
the problem of the word definition is far from being
solved, which significantly affects Chinese NLP.
References
Anthony Aue and Michael Gamon. 2005. Customiz-
ing sentiment classifiers to new domains: a case study.
Proceedings of RANLP.
Sanjiv R. Das and Mike Y. Chen. 2006. Yahoo! for
amazon: Sentiment extraction from small talk on the
web.
Brian Eriksson. 2006. Sentiment classifica-
tion of movie reviews using linguistic parsing.
algorithm for classifying a review where the senti-
ment direction of a phrase is calculated as the poin-
wise mutual information (PMI) between the given
phrase and the word ‘excellent’ minus the PMI be-
tween the given phrase and the word ‘poor’. The au-
thor reports different accuracies (from 64% to 84%)
obtained after evaluation of classifications in differ-
ent domains, with average accuracy of 74% on 410
reviews. Although we do not use a PMI for calcu-
lating semantic orientation, we use sentiment scores
of words for calculating sentiment orientation of a
phrase in a very similar way: by comparing negative
and positive scores.
Aue and Gamon (2005), Eriksson (2006) and
Read (2005) amongst others have noted the influ-
ence of topic- and domain-dependency in sentiment
classification; the authors observed that a classifier
performs much better if trained on data from the
same domain as testing data.
Liu et al (2005) use notion of opinion segment and
of opinion set of a feature, which are close to the
notion of sentiment zone used in this paper. Both
concepts denote a chunk of syntactic units that are
characterized by some sentiment direction. The dif-
ference is that the first two are one or more sentences
long whereas we operate with phrases. Another dif-
ference is that Liu et al. use these chunks for ana-
lyzing and comparing opinions regarding a product
feature, thus they are ‘product feature-driven’: they
are located after a feature is found.
6 Conclusion and future work
The experimental data shows that the proposed ap-
proach helps to increase performance of an unsuper-
vised sentiment classifier and outperform the super-
vised approach. It was also observed that in these
tests a bigger size of a training corpus was more im-
portant rather than higher precision to achieve bet-
ter accuracy. A major improvement of accuracy was
achieved by using a modified set of attributes com-
bined with an extracted training subcorpus. The pro-
posed approach may also be domain-independent as
it uses a domain-independent dictionary in the first
iteration, but it needs validation by testing it on dif-
ferent corpora. Another question to be answered is
why the increase in performance was not linear.
One of the problems to be solved is the problem
of a very unbalanced output of the sentiment zone
classifier: although the approach seems to be effi-
cient for both negative and positive items, their per-
formance is too different. Positive reviews increase
precision keeping relatively high recall, while preci-
sion of positive reviews is accompanied by a rapid
loosing of recall. This phenomenon forces us to
make the corpus smaller in order to avoid complica-
tions of processing a skewed corpus. The disbalance
may have been caused either by the classifier while
the sentiment score calculation or it may be a distinct
feature of negative reviews in general: people tend to
be less emotional and more reasonable. It means that
we have to seek for ways of improvement the scor-
ing technique as well as to try some linguistic fea-
tures that might contribute to more accurate scores.
First of all we would like to test the negation check
again as at present this routine is implemented in a
rather simplistic way. Another feature to be tested
is sentiment intensifiers (words such as “very”, “ab-
solutely” and others). It also may be beneficial to
increase the score for subjectivity indicators (word
combinations like “I think” and some interjections).
Probably a more sophisticated technique may be ap-
plied to filtering and processing the training subcor-
pora. So far we use quite a simple technique based
on comparison of relative scores. The positive influ-
ence of the modified set of attributes on the accuracy
of classification enables us to pay special attention to
this phenomenon and investigate better ways of ex-
tracting and modifying attributes. All the more, this
domain (words and phrases extraction) is of special
interest in the context of Chinese linguistics, where
the problem of the word definition is far from being
solved, which significantly affects Chinese NLP.
References
Anthony Aue and Michael Gamon. 2005. Customiz-
ing sentiment classifiers to new domains: a case study.
Proceedings of RANLP.
Sanjiv R. Das and Mike Y. Chen. 2006. Yahoo! for
amazon: Sentiment extraction from small talk on the
web.
Brian Eriksson. 2006. Sentiment classifica-
tion of movie reviews using linguistic parsing.
Page 7
http://www.cs.wisc.edu/∼apirak/cs/cs838/
eriksson final.pdf.
Minqing Hu and Bing Liu. 2004. Mining and summariz-
ing customer reviews. In SIGKDD 2004, pages 168–
177.
Soo-Min Kim and Eduard H. Hovy. 2004. Determin-
ing the sentiment of opinions. In Proceedings of
COLING-04, pages 1367–1373, Geneva, Switzerland,
August 23-27.
Lun-Wei Ku, Yu-Ting Liang, and Hsin-Hsi Chen. 2006.
Opinion extraction, summarization and tracking in
news and blog corpora. In Proceedings of AAAI-2006
Spring Symposium on Computational Approaches to
Analyzing Weblogs, volume AAAI Technical Report,
pages 100–107, March.
Bing Liu, Minqing Hu, and Junsheng Cheng. 2005.
Opinion observer: Analyzing and comparing opinions
on the web. In Proceedings of the 14th International
World Wide Web Conference, pages 342–351.
Bo Pang and Lillian Lee. 2004. A sentimental education:
Sentiment analysis using subjectivity summarization
based on minimum cuts. In Proceedings of the 42nd
Annual Meeting of the Association for Computational
Linguistics, pages 271–278, Barcelona, Spain.
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.
2002. Thumbs up? Sentiment classification using ma-
chine learning techniques. In Proceedings of the 2002
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 79–86, University of Penn-
sylvania.
Jonathon Read. 2005. Using emoticons to reduce depen-
dency in machine learning techniques for sentiment
classification. In Proceedings of the Student Research
Workshop of ACL-05.
Pero Subasic and Alison Huettner. 2001. Affect analysis
of text using fuzzy semantic typing. IEEE transac-
tions on fuzzy systems, 9(4):483–496, AUGUST.
Peter D. Turney. 2002. Thumbs up or thumbs down?
Semantic orientation applied to unsupervised classifi-
cation of reviews. In Proceedings of the 40th Annual
Meeting of the Association for Computational Linguis-
tics (ACL’02), pages 417–424, Philadelphia, Pennsyl-
vania.
David Yarowsky. 1995. Unsupervised word sense dis-
ambiguation rivaling supervised methods. In Proceed-
ings of the 33rd Annual Meeting of the Association
for Computational Linguistics, pages 189–196, Cam-
bridge, MA.
eriksson final.pdf.
Minqing Hu and Bing Liu. 2004. Mining and summariz-
ing customer reviews. In SIGKDD 2004, pages 168–
177.
Soo-Min Kim and Eduard H. Hovy. 2004. Determin-
ing the sentiment of opinions. In Proceedings of
COLING-04, pages 1367–1373, Geneva, Switzerland,
August 23-27.
Lun-Wei Ku, Yu-Ting Liang, and Hsin-Hsi Chen. 2006.
Opinion extraction, summarization and tracking in
news and blog corpora. In Proceedings of AAAI-2006
Spring Symposium on Computational Approaches to
Analyzing Weblogs, volume AAAI Technical Report,
pages 100–107, March.
Bing Liu, Minqing Hu, and Junsheng Cheng. 2005.
Opinion observer: Analyzing and comparing opinions
on the web. In Proceedings of the 14th International
World Wide Web Conference, pages 342–351.
Bo Pang and Lillian Lee. 2004. A sentimental education:
Sentiment analysis using subjectivity summarization
based on minimum cuts. In Proceedings of the 42nd
Annual Meeting of the Association for Computational
Linguistics, pages 271–278, Barcelona, Spain.
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.
2002. Thumbs up? Sentiment classification using ma-
chine learning techniques. In Proceedings of the 2002
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 79–86, University of Penn-
sylvania.
Jonathon Read. 2005. Using emoticons to reduce depen-
dency in machine learning techniques for sentiment
classification. In Proceedings of the Student Research
Workshop of ACL-05.
Pero Subasic and Alison Huettner. 2001. Affect analysis
of text using fuzzy semantic typing. IEEE transac-
tions on fuzzy systems, 9(4):483–496, AUGUST.
Peter D. Turney. 2002. Thumbs up or thumbs down?
Semantic orientation applied to unsupervised classifi-
cation of reviews. In Proceedings of the 40th Annual
Meeting of the Association for Computational Linguis-
tics (ACL’02), pages 417–424, Philadelphia, Pennsyl-
vania.
David Yarowsky. 1995. Unsupervised word sense dis-
ambiguation rivaling supervised methods. In Proceed-
ings of the 33rd Annual Meeting of the Association
for Computational Linguistics, pages 189–196, Cam-
bridge, MA.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
1 Reader on Mendeley
by Discipline
by Academic Status
100% Ph.D. Student
by Country
100% United Kingdom


