Sign up & Download
Sign in

Unsupervised and knowledge-poor approaches to sentiment analysis

by Taras Zagibalov
Word Journal Of The International Linguistic Association (2010)

Abstract

Sentiment analysis focuses upon automatic classiffication of a document's sentiment (and more generally extraction of opinion from text). Ways of expressing sentiment have been shown to be dependent on what a document is about (domain-dependency). This complicates supervised methods for sentiment analysis which rely on extensive use of training data or linguistic resources that are usually either domain-specific or generic. Both kinds of resources prevent classiffiers from performing well across a range of domains, as this requires appropriate in-domain (domain-specific) data. This thesis presents a novel unsupervised, knowledge-poor approach to sentiment analysis aimed at creating a domain-independent and multilingual sentiment analysis system. The approach extracts domain-specific resources from documents that are to be processed, and uses them for sentiment analysis. This approach does not require any training corpora, large sets of rules or generic sentiment lexicons, which makes it domain- and languageindependent but at the same time able to utilise domain- and language-specific information. The thesis describes and tests the approach, which is applied to diffeerent data, including customer reviews of various types of products, reviews of films and books, and news items; and to four languages: Chinese, English, Russian and Japanese. The approach is applied not only to binary sentiment classiffication, but also to three-way sentiment classiffication (positive, negative and neutral), subjectivity classifiation of documents and sentences, and to the extraction of opinion holders and opinion targets. Experimental results suggest that the approach is often a viable alternative to supervised systems, especially when applied to large document collections.

Cite this document (BETA)

Available from Taras Zagibalov's profile on Mendeley.
Page 1
hidden

Unsupervised and knowledge-poor approaches to sentiment analysis

Unsupervised and Knowledge-poor
Approaches to Sentiment Analysis
Taras Zagibalov
Submitted for the degree of Doctor of Philosophy
University of Sussex
September 2010
Page 2
hidden
ii
Declaration
I hereby declare that this thesis has not been and will not be, submitted in whole or in
part to another University for the award of any other degree.
Signature:.............................................
Taras Zagibalov
Page 3
hidden
iii
UNIVERSITY OF SUSSEX
TARAS ZAGIBALOV (DPHIL)
UNSUPERVISED AND KNOWLEDGE-POOR APPROACHES TO SENTIMENT ANALYSIS
SUMMARY
Sentiment analysis focuses upon automatic classi cation of a document's sentiment (and
more generally extraction of opinion from text). Ways of expressing sentiment have been
shown to be dependent on what a document is about (domain-dependency). This com-
plicates supervised methods for sentiment analysis which rely on extensive use of training
data or linguistic resources that are usually either domain-speci c or generic. Both kinds
of resources prevent classi ers from performing well across a range of domains, as this
requires appropriate in-domain (domain-speci c) data.
This thesis presents a novel unsupervised, knowledge-poor approach to sentiment ana-
lysis aimed at creating a domain-independent and multilingual sentiment analysis system.
The approach extracts domain-speci c resources from documents that are to be processed,
and uses them for sentiment analysis. This approach does not require any training corpora,
large sets of rules or generic sentiment lexicons, which makes it domain- and language-
independent but at the same time able to utilise domain- and language-speci c informa-
tion.
The thesis describes and tests the approach, which is applied to di erent data, including
customer reviews of various types of products, reviews of lms and books, and news items;
and to four languages: Chinese, English, Russian and Japanese. The approach is applied
not only to binary sentiment classi cation, but also to three-way sentiment classi cation
(positive, negative and neutral), subjectivity classi cation of documents and sentences,
and to the extraction of opinion holders and opinion targets. Experimental results suggest
that the approach is often a viable alternative to supervised systems, especially when
applied to large document collections.
Page 4
hidden
iv
Acknowledgements
I owe my deepest gratitude to my academic supervisor John Carroll for valuable advice
and friendly guidance, for encouragement and support. I am also grateful to Bill Keller,
my second supervisor, and David Weir, my Thesis committee member, for their guidance
and suggestions.
I am indebted to my colleagues for their support, especially to Jonathon Read, who
was always ready to help and advise me. I would like to deeply thank my friend Martine
Self and her family for their help and friendship.
I am grateful to Ford Foundation Fellowship Program who sponsored my research and
stay in the UK.
I owe a lot to my parents, Maria and Evgenij, for everything they have done for me,
for all their love and care.
This thesis would not have been possible without the love, support and patience of my
beloved wife Olesya. Thank you, my dear!
Page 5
hidden
vContents
List of Tables ix
List of Figures xii
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 The Scienti c Question . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Contributions of this Work . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Approach and Methodology . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Literature Review 7
2.1 Study of A ect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Private States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Categorical and Dimensional Paradigms . . . . . . . . . . . . . . . . 8
2.1.3 A ect Across Cultures . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.4 Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.5 Text Types and Domains . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Resource Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Challenges of Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.1 Cross-Domain Approaches . . . . . . . . . . . . . . . . . . . . . . . . 31
Page 6
hidden
vi
2.4.2 Cross-Language Approaches . . . . . . . . . . . . . . . . . . . . . . . 33
3 Features for Chinese Sentiment Classi cation 35
3.1 The `Word' in Chinese Language Processing . . . . . . . . . . . . . . . . . . 35
3.1.1 Preliminary Word Segmentation of Chinese Texts . . . . . . . . . . . 37
3.1.2 Preliminary Segmentation Experiment . . . . . . . . . . . . . . . . . 38
3.2 Words and Characters as Features for Sentiment Classi cation . . . . . . . 40
3.2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.2 Experimental Data and Classi cation Algorithm . . . . . . . . . . . 43
3.2.3 Evaluation Metrics and Statistical Signi cance Test . . . . . . . . . 43
3.3 Experiments with Classi cation Units . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Unigram-Based Classi cation . . . . . . . . . . . . . . . . . . . . . . 46
3.3.2 Zone-Based Classi cation . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.3 Sentence-Based Classi cation . . . . . . . . . . . . . . . . . . . . . . 50
3.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4 Sentiment Score Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.1 Negation Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.2 Length Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5.2 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5.3 Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5.4 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4 Classi er Improvements and Extensions 63
4.1 Dictionary Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.1 Adjustment to Corpus . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.2 Adjustment to Topic . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 Vocabulary Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.1 Seed-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.2 Automatic Seed Word Selection . . . . . . . . . . . . . . . . . . . . . 71
4.2.3 Iterative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Page 7
hidden
vii
4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3 Performance Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3.1 Score Di erence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3.2 Zone Di erence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3.3 Using Supervised Techniques to Extend Unsupervised Classi er . . . 94
4.3.4 Comparison of Supervised and Unsupervised Classi ers . . . . . . . 101
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5 Multilingual Sentiment Classi cation 106
5.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.1.1 Language-Speci c Issues . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.1.2 Book Review Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.1.3 Issues that may Affect Automatic Processing . . . . . . . . . . . . . 117
5.1.4 Movie Review Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.2 Supervised Classi cation Experiments . . . . . . . . . . . . . . . . . . . . . 120
5.2.1 Lexical Unit Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.2.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.3 Unsupervised Classi cation Experiments . . . . . . . . . . . . . . . . . . . . 123
5.3.1 Seed-Based Classi cation . . . . . . . . . . . . . . . . . . . . . . . . 123
5.3.2 Classi cation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.3.3 Score Di erence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.3.4 Zone Di erence for Result Ranking . . . . . . . . . . . . . . . . . . . 130
5.3.5 Combining with Supervised Machine Learning Techniques . . . . . . 130
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6 Multi-Aspect Sentiment Analysis 135
6.1 Three-Way Classi cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.1.1 Sentiment Classi cation . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.1.2 Subjectivity Classi cation . . . . . . . . . . . . . . . . . . . . . . . . 139
6.2 Sentence-Level Subjectivity and Sentiment Classi cation . . . . . . . . . . . 141
6.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.2.2 Classi cation Using an Existing Classi er . . . . . . . . . . . . . . . 142
6.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.2.4 Stand-Alone Subjectivity Classi cation . . . . . . . . . . . . . . . . 145
Page 8
hidden
viii
6.2.5 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.3 Opinion Holder and Opinion Target Extraction . . . . . . . . . . . . . . . . 152
6.3.1 Overview of the Approach . . . . . . . . . . . . . . . . . . . . . . . . 152
6.3.2 Language-speci c Adjustment . . . . . . . . . . . . . . . . . . . . . . 153
6.3.3 System Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7 Conclusion 161
7.1 Unsupervised Sentiment Classi cation . . . . . . . . . . . . . . . . . . . . . 161
7.2 Other Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.3 Cross-domain Sentiment Classi cation . . . . . . . . . . . . . . . . . . . . . 163
7.4 Multilingual Sentiment Classi cation . . . . . . . . . . . . . . . . . . . . . . 164
7.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.6 Practical Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Bibliography 168
Page 9
hidden
ix
List of Tables
3.1 Results of sentiment classi cation of product reviews from the web-site
IT168, with and without segmentation . . . . . . . . . . . . . . . . . . . . . 40
3.2 Results of unigram-based sentiment classi cation using di erent types of
features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Results of sentiment classi cation with the characters present only in a
single class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Results of zone-based sentiment classi cation . . . . . . . . . . . . . . . . . 50
3.5 Results of sentence-based sentiment classi cation . . . . . . . . . . . . . . . 51
3.6 Precision of the unigram, zone-based and sentence-based sentiment classi ers 53
3.7 Results of unigram-based sentiment classi cation with negation . . . . . . . 55
3.8 Results of zone-based sentiment classi cation with negation . . . . . . . . . 56
3.9 Results of sentence-based sentiment classi cation with negation . . . . . . . 57
3.10 Results of unigram-based sentiment classi cation with length ratio . . . . . 58
3.11 Results of zone-based sentiment classi cation with length ratio . . . . . . . 58
3.12 Results of sentence-based sentiment classi cation with length ratio . . . . . 59
3.13 Results of unigram-based sentiment classi cation with length ratio and neg-
ation check combined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.14 Results of word-based sentiment with di erent features . . . . . . . . . . . . 62
4.1 List of top 10 words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Results of word-based sentiment classi cation before and after feature ad-
justment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Results of combined classi er sentiment classi cation before and after fea-
ture adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 Average of the results of ve runs on a test corpus of the word classi er
sentiment classi cation before and after feature adjustment . . . . . . . . . 66
4.5 Product types and sizes of the test corpora. . . . . . . . . . . . . . . . . . . 67
Page 12
hidden
xii
List of Figures
4.1 Classi cation results with the seed list all with the score di erence technique. 90
4.2 Classi cation results the with the seed list all and with the zone di erence
technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.3 Classi cation results with the seed list all and the zone distance technique
(Topics). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.4 Classi cation results with extracted seeds and the zone distance technique
(Topics). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.5 Information retrieval simulation results with the seed list all and the zone
distance technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.6 Information retrieval simulation results with extracted seeds and the zone
distance technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.1 Distribution of documents by the number of words contained. . . . . . . . . 111
5.2 Information retrieval simulation results with the zone distance technique. . 130
5.3 Score di erence results for the movie review corpus. . . . . . . . . . . . . . 133
6.1 The distribution of Chinese customer reviews with respect to on Sentiment
Score and Sentiment Density. . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.2 The distribution of factual documents with respect to Sentiment Score and
Sentiment Density. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.3 The distribution of factual documents with respect to Sentiment Score and
Sentiment Density with the NTU Sentiment Dictionary. . . . . . . . . . . . 141
Page 13
hidden
1Chapter 1
Introduction
1.1 Background
This thesis is about the automated analysis of sentiment in written language. Sentiment
analysis is concerned not with the topic or factual content in it, but rather with the opinion
expressed in a document. Sentiment analysis has often been broken down into a set of sub-
tasks, including subjectivity classi cation, opinion classi cation (sentiment classi cation),
opinion holder and opinion target extraction, and feature-based opinion mining.
Opinion classi cation is usually framed as a two-way classi cation of positive and
negative sentiment, and has been applied at di erent levels: phrases, sentences, documents
and collections of documents. An opinion may have a holder (a person or a group that
expresses an opinion) and a target (an object which is being discussed or evaluated).
Feature-based opinion mining tries to nd opinions about particular features of a product
or service (as opposed to an overall opinion about something).
Automatic classi cation of document sentiment (and more generally extraction of opin-
ion from text) has recently attracted much interest. One of the main reasons for this is
the importance of such information to companies, other organizations, and individuals.
Applications include marketing research tools that help a company see market or media
reaction towards their brands, products or services. Another type of application is search
engines that help potential purchasers make an informed choice of a product they want
to buy. Such search engines include a sentiment classi cation subsystem that may not
only present to a customer overall sentiment about a product, but also select positive or
negative reviews to illustrate advantages and shortcomings of a product.
Automated sentiment analysis provides a range of possibilities for researchers in hu-
manities whose studies involve analysis of large amount of human-generated data. For
Page 14
hidden
2example, in media studies one might be interested to see if sentiments regarding the same
events are shared in mainstream media and in social media. Analysis of user-generated
content may be very helpful in political studies. For example, monitoring of political de-
bates in social media may help to estimates prospects of political candidates in elections
or evaluate e ectiveness of political campaigns. The study of \the language of hatred"
contributes to e orts against political and religious extremism and intolerance. Many
aspects of social studies may bene t from automatic analysis of sentiments expressed by
people in ever-growing social networks. This approach o ers unintrusive and fast access
to large amount of data.
In recent white paper addressing the role of sentiment analysis in organisations, Grimes
(2010) noted that \one axiom of full-circle sentiment analysis is ability to use all relevant
sentiment sources". This obviously includes resources in di erent languages, of di erent
genres and written in di erent styles. The most widely used approach to opinion and
subjectivity classi cation is based on supervised machine learning, in which a system
learns from human-annotated training data how to classify documents. However, a major
obstacle for automatic classi cation of sentiment and subjectivity is often a lack of training
data, which limits the applicability of approaches based on supervised machine learning.
With the rapid growth in the amount of textual data and the emergence of new domains of
knowledge it is virtually impossible to maintain corpora of annotated data that cover all {
or even most { areas of interest. The cost of manual annotation also adds to the problem.
Re-using the same corpus for training classi ers for new domains is also not e ective:
several studies report decreased accuracy in cross-domain classi cation (Engstrom, 2004;
Read, 2005; Aue and Gamon, 2005). Indeed, a classi er trained in a lm review domain
might consider word unpredictable (e.g. unpredictable plot) to be used to express a positive
characteristic. However, the same word in an car review might be a marker of a negative
sentiment (e.g. unpredictable stirring) (Turney, 2002). A similar problem has also been
observed in classi cation of documents created over di erent time periods (Read, 2005).
Some words were found to express a certain sentiment only for a de nite period of time.
Word ice-axe, for example, was a strong indicator of a positive sentiment because it was
frequently used in mostly positive reviews of a lm that featured a particularly stirring
scene involving this tool.
Rule-based or dictionary-based classi cations also have similar limitations and they
also rely on a large set of manually created resources used for classi cation.
A major current challenge, therefore, is to be able to automatically extract sentiment
Page 15
hidden
3information from a variety of documents in di erent languages and from di erent domains.
Most existing solutions are based on adapting systems designed for one language (or
domain) to another. Obviously, there are di erences between cultures, languages and
even within a language (consider the di erence in the language used for evaluations of a
company nancial prospects in a business newspaper and reviews of a hard-rock festival in
a participant's blog). Such di erences make adaptation problematic. Porting a sentiment
analysis system to new languages is even more dicult.
This thesis proposes an approach based on the idea of nding all data needed for
classi cation within the documents to be classi ed. Domain-speci c data is often hard to
nd, and generic resources, such as for example, sentiment lexicons, often fail to include
all relevant markers of opinion. Even well-known and `obvious' markers of sentiment may
demonstrate a sharp twist in their meaning in certain domains. For example, Ghose et al.
(2007) found that the word good is an indicator of negative sentiment in the domain of
eBay customer reviews: to describe something really good customers tend to use perfect
and excellent, reserving good for polite expression of negative appraisal (as in the package
is good (but might have been better)).
To overcome this problem the approach investigated in this thesis is to bootstrap
sentiment-related data from documents using a very limited number of seed lexical units.
This approach is used across domains, as well as across languages.
1.2 Research Overview
1.2.1 The Scienti c Question
The main goal of the research presented in this thesis is to investigate the extent to which it
is possible to build an unsupervised domain-independent cross-lingual sentiment analysis
system. Such a system could be of great utility due to the ever-growing amount of all
kinds of unstructured information in di erent languages which often contain opinions and
evaluations.
1.2.2 Hypotheses
The research explores ve main hypotheses:
 Hypothesis 1: Unsupervised systems can be developed for performing sentiment
analysis in di erent domains and in di erent languages that perform comparably
with supervised systems.
Page 16
hidden
4 Hypothesis 2: Unsupervised sentiment analysis may not require much domain- or
language-speci c input. Such a system might require only a basic indication of what
positive and negative sentiments are, in the form of lexical `seeds'.
 Hypothesis 3: A sentiment-related vocabulary automatically extracted from a corpus
can produce similar or better results compared to a specialised hand-built sentiment
vocabulary.
 Hypothesis 4: An automatically acquired training corpus in conjunction with ma-
chine learning techniques can produce sentiment classi cation results similar or close
to a standard supervised approach.
 Hypothesis 5: A uniform notion of `lexical unit' can be used across languages for
sentiment analysis tasks.
1.2.3 Contributions of this Work
This thesis presents a number of novel and signi cant contributions to research in senti-
ment analysis:
1. An unsupervised knowledge-poor approach to domain-independent sentiment ana-
lysis
2. Use of the approach as a means of multilingual sentiment analysis
3. Sentiment zones (sequences of characters between punctuation marks) as units of
classi cation
4. Sentiment score (a score based on the relative frequencies of units in documents of
opposite sentiment) as a technique for sentiment classi cation
5. Score-di erence technique for ltering out noise in sentiment classi cation. The
technique is based on calculating the di erence between opposite sentiment scores
of an item.
6. Zone-di erence technique for ranking sentiment classi cation. Zone-di erence is a
di erence of zones of opposite sentiment in a document.
7. An unsupervised opinion holder and opinion target extraction technique
8. A scale-based sentiment classi cation, as an alternative to a traditional binary clas-
si cation
Page 19
hidden
7Chapter 2
Literature Review
This chapter presents an overview of approaches to sentiment analysis and the various
research paradigms used. Section 2.1 describes research in `a ect' which sets background
for sentiment analysis as part of NLP. The following section (2.2) describes di erent aspects
of sentiment analysis, covering its main tasks, as well as di erent types of features and
techniques used in this research eld; the section also surveys domains where sentiment
analysis is used. Approaches to resource development are discussed in Section 2.3. Section
2.4 discusses the most signi cant outstanding challenges in sentiment analysis.
2.1 Study of A ect
This section discusses the theoretical background of sentiment analysis, touching on rel-
evant work in linguistics, psychology and ethnography as these areas provide important
foundations for cross-lingual sentiment analysis.
2.1.1 Private States
The linguistic concept of non-factual information expressed in a text is relatively young.
Quirk et al. (1985) introduced the linguistic term private state that denotes mental or
emotional states, hidden from objective observation. Ban eld (1982) proposed a term
for the linguistic expression of private states: subjectivity. Thus subjectivity analysis is
aimed at identi cation of attributes of private states: the subject who expresses a private
state, the object about whom the state is expressed, the type of the attitude, the intensity
of private state etc. In this sense, subjectivity analysis and sentiment analysis are often
used interchangeably. Pang and Lee (2008) give a di erent, more narrow, NLP-speci c,
de nition of subjectivity analysis as classifying a given text (a text or a sentence) into one
Page 20
hidden
8of two classes: objective (not expressing any private state) or subjective (expressing one
or more private states).
2.1.2 Categorical and Dimensional Paradigms
Most research in sentiment analysis is based on one of two basic approaches: categorical
and dimensional. The rst approach puts all emotions into a nite number of categories
(e.g. anger, fear, sadness, surprise), while the other one delineates emotions according to
multiple dimensions rather than discrete categories.
The categorical approach is represented by the Cognitive Structure of Emotions (Or-
tony et al., 1988) which provides a taxonomy of emotions based on the di erent conditions
that cause them. But since this approach is based on psychological contexts (for example,
relations between people) which usually are not represented in the text, it is quite dicult
to base any NLP study on it.
Another theory within the categorical paradigm that is derived from psychology is
Appraisal Theory. It claims that all emotions are the result of evaluations (appraisals) of
events that cause speci c reactions in di erent people (Scherer and Schorr, 2001). Ap-
praisal Theory is applied to language by Systemic Functional Linguistics as a theory of
evaluation in text. Appraisal Theory analyses the way opinion is expressed in text and
provides taxonomies for systematic identi cation of expressions of opinions and emotions
in context. The taxonomies not only include words related to certain emotions or opinions
but also cover the way authors interact with other authors and their audience.
According to Appraisal Theory, appraisal consists of three subsystems that function
interactively: attitude, engagement and graduation. Attitude addresses one's feelings
(emotional reactions, judgements of people and appreciations of objects); Engagement is
concerned with the positioning of oneself with respect to the opinions of others and with
the respect to one's own opinions; Graduation considers the ways a language increases or
decreases the attitude and engagement in a text. Since this theory describes linguistic
means of expression of emotions (lists of words that convey appraisal, for example) it can
immediately be applied to NLP studies (for example, Read and Carroll, 2009).
Another way of representing a ect is to put it into a multi-dimensional semantic space.
For example, a two-factor structure of a ect (described by Watson and Tellegen, 1985)
puts emotion in two dimensions: Pleasantness (from happy to sad) and Engagement (from
surprised to quiet).
Osgood et al. (1971) delineates emotions according to multiple dimensions: the two
Page 21
hidden
9primary dimensions in this account are along a `good{bad' axis (the dimension of valence
or evaluation) and a `strong{weak' axis (the dimension of activation or intensity).
The dimensional understanding of a ect is very productive for NLP as a basis for
sentiment classi cation studies that also use (a very simpli ed) scale of sentiments ranging
from two-point (positive { negative) to multi-point classi cations (the ` ve-star' system of
Pang and Lee, 2005).
2.1.3 A ect Across Cultures
Since the research presented in this thesis addresses sentiment analysis in a multilingual
context, the cross-cultural aspects of a ect are also very relevant. Important questions
include: Is sentiment universal? Is it expressed in comparable ways and can a uni ed
approach be adopted? Is such an approach potentially applicable to other languages not
tested in this research?
Ekman and Friesen (1971) found that particular facial behaviours are universally asso-
ciated with particular emotions regardless of ethnic or cultural background. The existence
of cross-cultural constants in emotional behaviour suggests that similar constants may be
found in language. This was studied by Osgood et al. (1975) in 20 di erent countries with
the help of about 80 anthropologists, psychologists and linguists. The study was done in
the paradigm of semantic space measurement (Osgood et al., 1971; Osgood, 1976). The
authors' general objective was to demonstrate that three a ective dimensions of mean-
ing { Evaluation, Potency, and Activity (E-P-A) { are in fact, pancultural. They found
in particular found that the two most common modes of a ect quali cation across the
world are GOOD and BIG (or some close synonym). They ranked the quali ers found in
each ethno-linguistic community in terms of both frequency and diversity of usage (i.e.
productivity) and then correlated rankings in terms of translation equivalents, and found
sizable and signi cant relationships. Osgood et al. (1975) concluded that \Human beings,
no matter where they live or what language they speak, apparently abstract about the
same properties of things for making comparisons, and they order these di erent modes
of qualifying in roughly the same way in importance".
These ndings suggest that a uni ed approach to sentiment analysis across multiple
languages is in principle well-founded, providing a solid basis for the work presented in
this thesis.
Page 22
hidden
10
2.2 Sentiment Analysis
Sentiment analysis has been a popular research topic in recent years and has evolved
into a big and diverse research eld. A number of approaches have been used to create
new research prototype and applied sentiment analysis systems. This section surveys the
various tasks in sentiment analysis and methods utilised to perform them.
2.2.1 Tasks
There are four main tasks that are tackled in present day sentiment analysis research: sub-
jectivity analysis, sentiment classi cation, opinion summarisation, and opinion extraction
and mining.
Subjectivity Analysis
Subjectivity analysis, as indicated in Chapter 1, aims to distinguish subjective text (docu-
ments, sentences) from factual text. Subjective texts are those that express private states,
which di er them from objective (factual) text that expresses only objective information,
or facts.
Subjectivity analysis is a dicult task. The diculty is mostly caused by the nature of
private states that subjectivity analysis deals with. The subjective or objective nature of
text is hardly ever stated explicitly (Wiebe, 1994) which complicates automatic processing
of information that contains private states. Another challenging aspect of subjectivity
analysis is that documents are almost never entirely either objective or subjective. Even
a single sentence may contain factual information and some subjective evaluation of it.
However a number of studies demonstrate reasonable success in subjectivity analysis.
A widely used technique in NLP, supervised machine learning, is often applied to
subjectivity classi cation. Yu and Hatzivassiloglou (2003) describe document-level classi-
cation of news items using a Nave Bayes classi er. Their research also investigated three
approaches to identifying subjective sentences. The rst was based on a hypothesis that,
within a given topic, opinion sentences will be more similar to other opinion sentences than
to factual sentences. The second used a Nave Bayes classi er trained on documents that
were supposed to be subjective (e.g. editorials). The features included words, bigrams,
and trigrams, as well as the parts of speech in each sentence. Thirdly, the authors applied
an algorithm using multiple classi ers, each relying on a di erent subset of the features.
The study found that the Nave Bayes classi er proved to be the most e ective tool for
sentiment classi cation, multiple classi ers slightly increasing performance. Wilson et al.
Page 25
hidden
13
on a three-way classi cation (positive, negative and neutral) was proposed by Koppel and
Schler (2006) who stressed the importance of the neutral class for sentiment classi cation.
Sentiment and Subjectivity Pang and Lee (2004) propose a supervised machine-
learning method of determining polarity that applies text-categorization techniques to
subjective portions of a document only. These portions are extracted using minimum cuts
in graphs. The idea of minimum cuts is inspired by the observation that text spans occur-
ring near each other (within discourse boundaries) may share the same subjectivity status,
other things being equal (Wiebe, 1994). Pang and Lee found that subjectivity detection
can compress reviews into much shorter extracts that still retain polarity information at
a level comparable to that of the full review. These extracts can be used for polarity
classi cation which improves accuracy (from 82% to 86% for full reviews), suggesting that
they are not only shorter, but also \cleaner" representations of document polarity.
The role of neutral (objective) text in sentiment classi cation was studied by Koppel
and Schler (2006). The authors showed that in learning polarity, neutral examples cannot
be ignored. Using only negative and positive training examples does not permit accur-
ate classi cation of neutral examples. Moreover, better distinction between positive and
negative examples can be achieved using neutral training examples. Properly combining
pairwise learned classi ers leads to extremely signi cant improvement in overall classi ca-
tion accuracy. But the combination of the classi ers depends on the nature of the corpus,
more speci cally on the nature of the neutral documents in the corpus { whether they are
truly neutral or in fact balanced (containing both sentiments).
Supervised Sentiment Classi cation Sentiment can be expressed in numerous ways
and some studies have investigated what parts of the language are the most important
for detecting sentiments. For example, Alm et al. (2005) used 14 kinds of features for
supervised machine learning experiments into recognizing emotional passages and on de-
termining their valence (i.e. positive versus negative) with a corpus of children stories.
The authors used a very large set comprising 14 di erent kinds of features: word lists,
syntactic, story-related, orthographic, conjunctions, content BOW (\bag-of-words"), some
of which were found automatically, some manually.
Another type of features was used by Whitelaw et al. (2005b). They used adjectival
appraisal groups as features for supervised sentiment classi cation of lm reviews. The
appraisal groups, coherent groups of words that express together a particular attitude, are
part of a full appraisal expression as de ned in Appraisal Theory (Martin and White, 2005).
Page 26
hidden
14
The list of appraisal groups was produced semi-automatically, and manually modi ed to
lter out noise. In total, 1329 terms were produced from 400 seed terms.
Other studies have experimented not only with di erent features but also with vari-
ous machine learning classi ers (most notably Support Vector Machines, Nave Bayes,
and Maximum Entropy) and their combinations. Das and Chen (2007) tried a classi er
voting technique for extracting small investor sentiment (buy, sell or hold) from stock
message boards. Their approach was based on voting amongst ve classi ers: nave clas-
si er (simply counting words with positive or negative meaning), vector distance classi er
(a standard vector-based approach), discriminant-based classi er (counting discriminant
scores of each word), adjective-adverb phrase classi er (counting only noun phrases with
adjectives or adverbs) and a Nave Bayes classi er. The features were a hand-picked collec-
tion of nance domain words. In particular, they observed that the Nave Bayes classi er
performed quite well, producing fewer false positives.
Sentiment Classi cation and Linguistics A more linguistic-driven approach was in-
vestigated by Eriksson (2006), who explored a linguistic method that facilitates sentiment
analysis by using more information from a text than traditional methods based on ma-
chine learning. Eriksson's Linguistic Tree Transformation Algorithm is designed to exploit
the syntactic dependencies between words in a sentence and to disambiguate word senses.
Another technique introduced by Eriksson is an objective sentence removal algorithm.
The approach specially addresses two major problems in the area of sentiment analysis,
the non-local dependencies problem and the word-sense disambiguation problem. The
Linguistic Tree Transform Algorithm uses parsing to nd all bigrams (mostly adjective
{ noun phrases) relevant to the sentiment analysis task, while ltering out all irrelevant
ones. Then an Objective Sentence Removal Algorithm lters out all sentences that do not
contain topic words of interest (such as for lm reviews, the names of the lms, directors
and screenwriters or some topic-related nouns). The algorithm is based on the assumption
that some prior knowledge in this domain is readily available for automatic processing.
These two algorithms produce a pruned version of the initial corpus containing only opin-
ionated sentences relevant to the topic (for example, plot descriptions are removed). 100%
accuracy is reported for the experiments with a frequency SVM model run on the data
produced by the two algorithms.
Linguistically-motivated features help improve existing state-of-the-art sentiment clas-
si cation results in a task of detecting implicit sentiment, a novel vision of sentiment
classi cation proposed by Greene and Resnik (2009). Obviously implicit sentiment can-
Page 27
hidden
15
not be detected by traditional indicators, such as words. This enabled the authors to
investigate the syntactic \packaging" of ideas, studied previously by Greene (2007).
Opinion Summarisation
Opinion Summarisation aims to aggregate opinions on a given topic from multiple doc-
uments (probably from di erent sources) rather than classifying individual documents.
Most approaches start with nding documents relevant to the topic and then classifying
retrieved documents according to their sentiment. The topic might be found automatic-
ally from a set of documents (Hu and Liu, 2004; Chen et al., 2005; Feiguina and Lapalme,
2007) or given as a query (Eguchi and Lavrenko, 2006). The latter approach is close to
opinionated information retrieval as it ranks documents or sentences according to both
topic and sentiment relevance.
Some approaches use a variety of tools for opinion summarisation. In the domain
of lm review summarisation, Zhuang et al. (2006) describe a multi-knowledge based
approach that uses WordNet, movie casts and labelled training data (1100 reviews), as
well as grammatical rules linking feature words and opinion words.
Ku et al. (2006b) present a comprehensive system that summarises web blogs on a
given topic (e.g. animal cloning). The summarisation is then presented by representative
sentences augmented by an opinionated curve showing supportive and non-supportive
degree along the time-line. The authors use a multi-level (word - sentence - document)
sentiment classi cation system for detecting opinion direction.
Opinion summarisation can be combined with other techniques to produce an all-round
practical application. Liu et al. (2005) describes a system called Opinion Observer which is
capable of semi-automatic sentiment extraction, sentiment summarizing and visualisation.
The system is able to compare sentiments about di erent products. The system is based
on supervised rule discovery from a hand-labelled training corpus.
Opinion Extraction and Mining
Opinion extraction and opinion mining (the two terms are commonly used interchange-
ably) are concerned with extraction of certain aspects of opinion. One such aspect is the
opinion holder (a person or a group that expresses an opinion) and another is the opinion
target (something which is being discussed or evaluated). Feature-based opinion mining
nds to nd opinions about particular features of a product or service (as opposed to an
overall opinion about something).
Page 28
hidden
16
Opinion Holder Extraction There are two main types of approach to opinion holder
extraction: one based on machine learning and the other using knowledge-based tech-
niques. An example of the rst type is presented by Kim and Hovy (2006) who used
a machine learning technique for opinion holder extraction. As features for their Max-
imum Entropy classi er they used selected structural features from a deep parse, based
on a frame representation of opinionated expressions. The frame was built around an
opinion word, with semantic relations between it and opinion holder and target derived
from semantic role labelling within the frames. Choi et al. (2005) consider opinion holder
extraction to be an information extraction task and use a combination of two techniques:
named entity recognition (by training Conditional Random Fields) and information ex-
traction (AutoSlog, a supervised extraction pattern learner). The former models source
identi cation as a sequence-tagging task; the latter learns extraction patterns.
Knowledge-based approaches utilise hand-build lexicons, parsing, heuristics and onto-
logies. For example, Bloom et al. (2007) describe an opinion holder extraction approach
based on a hand-built lexicon, a combination of heuristic shallow parsing and dependency
parsing, and expectation-maximization word sense disambiguation; they match phrases in
the text with domain-dependent holder type taxonomies.
Kim et al. (2008) exploited a set of communication and appraisal verbs, SentiWordNet,
a named entity recognizer, and a syntactic parser for opinion holder extraction. In each
sentence they looked for the most opinionated word and then ascended the tree to its
rst ancestor node with verbal part of speech, and looked for its subject (a noun phrase)
which was assumed to contain opinion holder candidates. If a subject was not found,
then `author' was set as the opinion holder of the sentence. If a subject was found, then
from the NP chunk, any named entities or opinion holder candidates were extracted as
the opinion holder. If no named entity or opinion holder candidate was found, then the
holder again defaulted to the `author' of the document. Regardless of the previous step, if
a sentence included quotation marks, then the speaker of the quote was extracted as the
opinion holder.
Kim and Hovy (2004) present a system that combines sentiment summarisation and
opinion mining: it nds people who expressed opinion on a given topic as well as orientation
of the opinion. The system operates in four steps. First it selects sentences that contain
both the topic phrase and holder candidates, found by means of BBN's named entity
tagger. Next, it delimits the holder-phrase region. Then the sentence sentiment classi er
calculates the polarity of all sentiment-bearing words individually. Finally, the system
Page 30
hidden
18
on a corpus bootstrapped from a small manually-created corpus. Popescu and Etzioni
(2005) present a system and claim to be the rst to report precision and recall on the tasks
of opinion phrase extraction and opinion phrase polarity determination in the context of
known product features and sentences. This system intensively uses the knowledge mining
tool, KnowItAll, a Web information-extraction system (Etzioni et al., 2005), to extract
product features and opinions regarding them.
Zhang and Varadarajan (2006) identify a new task in opinion extraction: predict-
ing the utility (or, reliability, usefulness, informativeness) of product reviews. Utility is
de ned as a multi-aspect feature of customer reviews that combines subjectivity with deep
technical analysis of a product's features. The authors build regression models by incor-
porating a diverse set of features including lexical similarity, part of speech tags and lexical
subjectivity clues.
Titov and McDonald (2008) present a novel framework for extracting the features of
objects from online user reviews. They build statistical models to induce multi-grain top-
ics. The models not only extract features, but also cluster them into coherent topics, e.g.,
waitress and bartender are part of the same topic, sta , for restaurants. This di erentiates
it from much of the previous work which extracts aspects through term frequency analysis
with minimal clustering.
Question Answering
Question answering (QA) is well-established research topic in NLP. A new facet of it is
presented by opinion QA. Yu and Hatzivassiloglou (2003) study separating opinions from
fact, at both the document and sentence level, in the context of QA. Ku et al. (2007a)
de ne six opinion question types and use an information retrieval system to detect question
focus. The retrieved information is then processed to match the sentiment of the query.
2.2.2 Techniques
Research in sentiment analysis uses a number of techniques, such as supervised machine
learning, rule- and knowledge-based and some others described beneath.
Supervised Machine Learning
Supervised machine learning is the most frequently used technique in sentiment classi ca-
tion. To date, the majority of studies have used support vector machines (SVM) and Nave
Bayes (NB). A study of the e ectiveness of machine learning techniques was carried out
Page 33
hidden
21
languages with scarce resources using on-line dictionaries.
Rilo and Wiebe (2003) describe a a semi-supervised technique that learns extraction
patterns from a training corpus produced by high-precision classi ers and then applies the
newly found patterns to nd more subjective sentences. The classi ers use a manually
created set of features (words and n-grams) to produce two sets of sentences: objective
and subjective. The two sets are then used by a pattern learner to nd patterns that are
mostly used in subjective sentences. The process of learning is based on application of
a large set of syntactic templates to the corpus and extracting all possible patterns that
match the templates. The frequencies of the patterns obtained for each of the classes of the
sentences (objective and subjective) are compared and the most subjectivity-associated
patterns are used to enlarge the feature set of the classi ers. In a later study, Wiebe and
Rilo (2005) extend the system by applying machine learning techniques to the extracted
sentences to increase recall.
Reference Data A di erent approach to unsupervised sentiment classi cation is de-
scribed by Ghose et al. (2007). The authors use an economic context to nd out what
makes a review positive or negative. The approach is based on the observation that on-
line merchants on eBay with positive feedback can sell products for higher prices than
competitors with negative evaluations. This makes it possible to use techniques from eco-
nometrics to identify the `economic value of text' and assign a `dollar value' to each text
snippet, measuring sentiment strength and polarity e ectively and without the need for
any annotated resources.
An alternative approach was explored by Read (2009). To nd a document's sentiment
orientation Read compared the document with some prototypes (positive and negative
texts) using their constituents (words and phrases).
Linguistic Resources Subasic and Huettner (2001) present an approach based on a fu-
sion of natural-language processing and fuzzy logic techniques for analysing a ect content
in free text. The linguistic resource for the approach is a hand-crafted fuzzy a ect lexicon,
from which other resources are generated: a fuzzy thesaurus and a ect category groups. A
text is tagged with a ect categories from the lexicon, and the a ect categories' centralities
and intensities are combined using techniques from fuzzy logic to produce a ect sets {
fuzzy sets that represent the a ect quality of a document.
Zhuang et al. (2006) use WordNet, statistical analysis and movie knowledge for movie
review mining and summarisation.
Page 37
hidden
25
convey negative sentiment, while adjective + noun is often used for expressing positive
sentiment. Wiebe et al. (2004) used collocations to identify xed n-grams, for example:
worst-adj of-prep all-det. They also proposed a generalised version of collocations, where
certain classes of words are represented by a POS-tagged variable. For example, U-adj as-
prep represents a phrase that consists of a unique (occurring only once) adjective and the
preposition `as'. This generalised collocation matches phrases like `drastic as', `perverse
as' and `predatory as'.
Gamon (2004) analysed the e ectiveness of linguistic features and found that part of
speech trigrams and an NP consisting of a pronoun followed by a punctuation character
were important for sentiment classi cation of customer reviews.
A broader context was used by Rilo et al. (2003). They created discourse features to
capture the density of sentiment indicators in the text surrounding a sentence. Pang and
Lee (2004) combined traditional bag-of-words features with inter-sentence level contextual
information in a minimum cut formulation.
Stylistic
Some studies have used stylistic attributes for sentiment analysis tasks. Wiebe et al.
(2004) used words that occurred only once (hapax legomena) to improve the accuracy of
subjectivity classi cation. They observed a signi cantly higher presence of unique words
in subjective texts compared to objective documents in a Wall Street Journal corpus and
noted that \Apparently, people are creative when they are being opinionated". Gamon
(2004) used the length of constituents (sentence, clauses, adverbial/adjectival phrases, and
noun phrases) for sentiment classi cation of feedback surveys. Abbasi et al. (2008) used
a wide array of English and Arabic stylistic attributes including lexical, structural, and
function word style markers and reported high accuracy in blog sentiment analysis.
Feature Selection
Gamon (2004) describes a series of experiments for determining an optimal set of features
for the supervised sentiment polarity classi cation task. He tested three kinds of features:
linguistic features, surface features and word n-grams. The rst kind was obtained by
means of a tool that provided a phrase structure tree and a logical form for each string.
The second kind consisted of word n-grams, function word frequencies and POS ngrams.
Gamon observed that the presence of very abstract linguistic analysis features improves
the performance of the classi ers and concluded that a ect and style are linked in a more
Page 43
hidden
31
the most detailed level of annotation.
2.4 Challenges of Sentiment Analysis
The ways in which opinions are expressed vary between languages and also within a
single language (so-called \domain-dependency").For example, the word horrible, in a
description of a plot of a horror lm does not necessarily bear any sentiment-related
meaning. However these word is a reliable indicator of negative sentiment in most other
domains (e.g. horrible performance). Turney (2002) observes that \for example, the
adjective \unpredictable", may have a negative orientation in an automotive review, in a
phrase such as \unpredictable steering" but it could have a positive orientation in a movie
review, in a phrase such as \unpredictable plot"". This problem is further complicated
by ambiguity of word meaning in di erent contexts. This problem was studied by Wilson
et al. (2005) who give an example of the word trust :
(1) Philip Clapp, president of the National Environment Trust...
The word trust, which has positive prior polarity, in this context has neutral meaning since
it is part of named entity.
Domain-dependency decreases the performance of classi ers trained, or using data
from a di erent domain (Engstrom, 2004). Read (2005) also noted a temporal depend-
ency where even in the same domain people use di erent means of expressing sentiment
over time. A major current challenge is how to automatically extract sentiment inform-
ation from documents in di erent languages and in di erent domains. Most existing ap-
proaches are based on adapting systems designed for one language (or domain) to another.
Obviously, there are di erences between cultures, languages and even within a language
(consider the di erence between evaluations of company nancial prospects in a business
newspaper and reviews of a hard-rock festival in a participant's blog). Such di erences
make adaptation dicult.
2.4.1 Cross-Domain Approaches
Aue and Gamon (2005) try to overcome the problem of domain-dependency of sentiment
analysis by means of using labelled data from other domains. They investigate and com-
pare four approaches:
1. training on a mixture of labelled data from other domains where such data are
Page 44
hidden
32
available;
2. training a classi er as above, but limiting the set of features to those observed in
the target domain;
3. using ensembles of classi ers from domains where there is available labelled data;
4. combining small amounts of labelled data with large amounts of unlabelled data in
the target domain. This approach does not use any out-of-domain data; instead,
it uses a generative Nave Bayes classi er using the Expectation Maximization al-
gorithm.
The four approaches were tested on four di erent corpora: movie reviews, book reviews,
product support services and knowledge base web survey data. It was found that the
approaches that used some data from the target domain (approaches 3 and 4) performed
better than ones that used only out-of-domain training data (1 and 2). The best accuracy
was achieved by the last approach, which still requires (small) amounts of annotated in-
domain data.
Blitzer et al. (2007) describe another way of overcoming domain-dependency by means
of the adaptation of a classi er trained in one domain to another. The authors raise the
problems of accuracy loss and domain similarity. The main idea underlying the approach
is Structural Correspondence Learning (SCL) developed by the authors in previous papers.
Since the authors use Mutual Information for nding new `pivot features' in unlabelled
domains, the full name of the approach is SCL-MI. The main intuition is that even when
key opinion words are completely distinct for each domain, if they have high correlation
with excellent and low correlation with awful in unlabelled data, then it is possible to
align them. The approach consists of three steps:
1. Using a labelled corpus from one domain and unlabelled corpora from both a new
domain and the old one, nd pivot features which occur frequently in both domains.
2. SCL models the correlations between the pivot features and all other features by
training linear pivot predictors to predict occurrences of each pivot in the unlabelled
data from both domains (Ando and Zhang, 2005; Blitzer et al., 2006). This is based
on the calculation of correlation (MI) of pivot features (such as excellent) and non-
pivot features (like fast, dual-core).
3. For some domains the features found are not well-aligned (thus not good enough for
sentiment classi cation). To correct misalignment the authors manually label 50 top
Page 46
hidden
34
national versions of the WordNet lexicon to identify subjective expressions.
Boiy and Moens (2008) performed a number of machine learning experiments in sen-
timent analysis in Dutch, English and French. Although the experiments treated these
languages separately (no speci c multi-lingual adaptation techniques were used), they
note language-speci c particularities that a ect sentiment analysis. The importance of
such language-speci c features for multilingual processing is discussed by Bender (2009),
who argues that even approaches encoding little linguistic information can bene t from
language-speci c specialisation.
Page 47
hidden
35
Chapter 3
Features for Chinese Sentiment
Classi cation1
There are some distinctive characteristics of the Chinese language that are known to a ect
language processing. This chapter presents an investigation of these in connection with
sentiment classi cation. Section 3.1 outlines problems with conceptualising Chines text as
comprising a sequence of `words'. In particular, the problem of automatically segmenting
text into words is discussed and tested in an experiment. The diculty of splitting Chinese
text into words raises the issue of what kind of basic unit of processing to use in sentiment
analysis. Section 3.2 describes kinds of units to be experimented on and the data for
the experiments as well as basic concepts, algorithms and evaluation metrics. Section
3.3 reports experiments in sentiment classi cation and discusses the results. Section 3.4
describes extensions to the techniques presented previously and discuses the results. All
the experimental results are summarised in section 3.5.
3.1 The `Word' in Chinese Language Processing
One of the central problems in Chinese NLP in general and in Chinese sentiment analysis
in particular is what the basic unit of processing should be. The problem is caused by
a distinctive feature of the Chinese language: the absence of orthographically marked
word boundaries, while it is widely assumed that a word is of extreme importance for
computational language processing. The absence of word delimiters cannot be solved
by simply using dictionary lookup (or any other method) to segment a text into words,
1The experiments and part of the discussion in this chapter were presented in a condensed form at the
Student Workshop at the 45th Meeting of the Association for Computational Linguistics and at the 2007
EUROLAN Doctoral Consortium (Zagibalov, 2007a,b)
Page 52
hidden
40
Accuracy Precision Recall F-Measure
NBm (Segmented) 83.59 0.84 0.84 0.84
NBm (Not segmented) 85.61 0.86 0.86 0.86
SVM (Segmented) 81.67 0.83 0.82 0.82
SVM (Not segmented) 85.50 0.86 0.86 0.86
Table 3.1: Results of sentiment classi cation of product reviews from the web-site IT168,
with and without segmentation
3.2 Words and Characters as Features for Sentiment Clas-
si cation
In the absence of preliminary word segmentation, there are two possible types of feature
that could be used in Chinese sentiment classi cation: (vocabulary) words8 and characters.
This section reports experiments into these two types The experiments evaluate various
techniques that can facilitate classi cation including a simple negation check, as there is
no a general agreement as to whether feature is useful for sentiment classi cation. This
section also describes and tests an approach which divides the text into zones.
Processing based on words and characters are tested separately and in combination.
The latter approach is inspired by results published by Nie et al. (2000) who found that
for Chinese processing (IR in particular) the most e ective kinds of features were a com-
bination of dictionary look up (using the longest-match algorithm) together with single-
character unigrams. Yuen et al. (2004) showed that Chinese characters constitute a dis-
tinct sub-lexical unit which, though having a smaller number of distinct types, has greater
linguistic signi cance than words. Their experiments on sentiment classi cation of words
by means of characters proved to be e ective, achieving a precision of 80.23% and a recall
of 85.03% with only 20 characters.
3.2.1 Basic Concepts
To introduce the approach I present some de nitions of the concepts that are used in the
experiments.
8The notion of used is that of Vocabulary Word as de ned by Li (2000) being the set of of vocabulary
items listed in a dictionary.
Page 53
hidden
41
Basic Units
A basic unit is the smallest linguistic unit used for processing. In this Chapter I experiment
with two kinds of basic units: words and characters.
 Word Noting the theoretical and practical diculty of word segmentation in the
Chinese language, I use the notion of `vocabulary word', which is any sequence of
characters that forms a vocabulary item in the NTU sentiment dictionary. To avoid
confusion, I will also use term `dictionary item' (DI) as a synonym of `vocabulary
word'.
 Character A character is any Chinese character (hieroglyph), excluding punctu-
ation marks and other symbols (stars, bullet points etc.).
Classi cation Units
A classi cation unit is a contiguous segment of a document and can be either of the basic
units or a larger unit, as indicated below.
 Unigram Unigram is a classi cation unit that consists of a single instance of a basic
unit.
 Zone Zone is a classi cation unit that includes one or more basic units and usually
is a sub-sentence unit. Zones are delimited by any non-character symbol (comma,
full-stop, semicolon, quotation marks etc). If a sentence does not have any delimiters
except for the nal full-stop, the whole sentence is a zone. The idea of using zones
for classi cation comes from the observations that sentiment classi cation bene ts
from consideration of word context, but that sentences may contain two or more
opposite sentiments. Thus I decided to include a unit that is usually longer than a
word but smaller than a sentence.
 Sentence Sentence is a sequence of basic units that ends with a full-stop, question
mark, exclamation mark or similar symbol that usually marks the end of a sentence.
Frequency
The sentiment score (see below) is based on a basic unit's relative (normalised) frequency:
Fa =
Na
N
(3.1)
Page 54
hidden
42
where Na is the number of times a occurred in a collection of documents and N is the
total number of basic units (lexical units or characters, as appropriate) in the collection
of documents.
Sentiment Score
Each word (dictionary item) occurring in the positive side of the dictionary is assigned a
positive sentiment score of 1 and negative sentiment score 0, and vice versa for words in
the negative side.
 Word Score The unsupervised approach does not suppose obtaining any data from
the test corpus. So initially all the words had a score 1 for the class (sentiment) they
present and 0 for the class they are not present.
 Character Scores The characters for the experiments are extracted from the NTU
sentiment dictionary. Most of the characters occur in both sides of the dictionary:
positive and negative. The score for a character with respect to sentiment i (positive
or negative) is:
Sai =
Fi
Fj
(3.2)
where Fi is the unit's frequency in a document collection of sentiment i, Fj is the
character's relative frequency in the opposite side of the dictionary.
The experiments also test modi ed sentiment scores: scores with a low or zero
frequency `penalty' and presence-based binary scores. Apart from the sentiment
score as described above, the experiments test four score modi cations9
1. All characters were assigned the basic scores based on the relative frequency
calculations, but if Sai < 1, then Sa0i = Sai 1. The intuition is that if a
character is less frequent in one side of the dictionary than in the other, then
it should be `penalised' by being assigned a negative score.
2. If Sai > 0, then Sa0i = 1. This score is based on presence of a character in the
relevant side of the dictionary, regardless of its frequency.
3. If Sai  1, then Sa0i = 1, else Sa
0
i = 0. This score is a binary version of the
basic score.
9In the experiments the score modi cations are represented by the numbers 1, 2, 3, 4.
Page 65
hidden
53
Basic Unit Kinds Unigram Zone Sentence
Chars 0.68 0.69 0.69
Chars 1 0.66 0.68 0.67
Chars 2 0.52 0.52 0.52
Chars 3 0.68 0.72 0.70
Chars 4 0.70 0.71 0.71
Words 0.87 0.88 0.88
Words and Chars 0.72 0.72 0.72
Words and Chars 1 0.69 0.70 0.70
Words and Chars 2 0.57 0.58 0.58
Words and Chars 3 0.74 0.76 0.75
Words and Chars 4 0.73 0.73 0.73
Table 3.6: Precision of the unigram, zone-based and sentence-based sentiment classi ers
Words and Characters Words and characters when combined together performed
relatively well, showing the best features of both: accuracy was never too bad, and coverage
was fairly good. In unigram-based classi cation, three out of ve combinations (with the
basic score and modi cations 3 and 4) performed signi cantly better (at 99% level) than
the other kinds of basic units, with the highest accuracy of 0.73 (see Table 3.2). The
combination of characters and words was able to classify many more documents than the
word-based classi er (at least 86% against 77%). It is also worth noting that all character-
based classi ers bene ted from combination with words and performed better in all the
tests.
Classi cation Units
Another task of the experiments was to explore the in
uence of the classi cation unit
on classi cation performance. I compared the performance of the classi ers based on
unigrams, zones and sentences.
Unigrams The highest accuracy achieved with unigram-based classi cation was 0.73
(characters combined with words), the average accuracy was 0.66 (0.67 if the lowest and
the highest results are excluded).
Page 66
hidden
54
Zones The introduction of zones decreased performance signi cantly: the highest ac-
curacy was achieved by the word-based classi er (0.68) and average accuracy was 0.61.
Sentences The results of sentence-based classi cation are very close to zone-based: the
average was 0.62 with the top result being 0.67.
The results obtained from the experiments indicate that the best classi er is one based
on the combination of words and characters. It is also possible to conclude that scoring
based on normalised frequency is better for Chinese sentiment classi cation than a binary
score. The presence-based binary score is not suitable for character-based classi cation,
but performs well with words. The results also suggest that for a sentiment classi cation
a unigram-based approach is the best.
3.4 Sentiment Score Extensions
Although the preliminary experiments reported above produced some promising results,
the characteristics of sentiment, and language more generally, suggest some possible ex-
tensions to the techniques which might lead to improved results. The extensions include
score calculation adjustments for negation, input data degree of skew and basic unit length.
This section presents the results of the experiments carried out using the same classi er
as above (see Algorithm 1 and Algorithm 2) with the only di erence being in the score
calculation.
3.4.1 Negation Check
Negation plays an important role in language. It is also important in evaluative language,
as good and not good express di erent sentiments in most contexts. Most researchers agree
that including information about negation improves sentiment classi cation accuracy but
detecting and integrating this information may be a dicult task (see Section 2.2.2). In
this study the negation check is a very simple routine, based on regular expression patterns
to nd out if a word or a character is preceded by a negation up to 2 characters previously.
If a negation is found the score is multiplied by -1:
Sa0 = Sa  1 (3.3)
Page 68
hidden
56
Accuracy
Basic Unit Kinds Overall Positive Negative Precision Coverage
Chars 0.66 0.73 0.58 0.75 0.88
Chars 1 0.67 0.81 0.53 0.76 0.88
Chars 2 0.48 0.02 0.93 0.51 0.93
Chars 3 0.66 0.55 0.78 0.76 0.87
Chars 4 0.67 0.67 0.68 0.76 0.88
Words 0.72 0.71 0.72 0.90 0.79
Words and Chars 0.69 0.74 0.64 0.78 0.89
Words and Chars 1 0.69 0.81 0.57 0.78 0.89
Words and Chars 2 0.54 0.12 0.95 0.59 0.91
Words and Chars 3 0.71 0.60 0.81 0.80 0.88
Words and Chars 4 0.72 0.71 0.72 0.78 0.89
Table 3.8: Results of zone-based sentiment classi cation with negation
Zone-Based Classi cation
The zone-based classi cation results (see Table 3.8) show the same kind of improvement:
all of the classi ers improved their classi cation on the class on which they performed
worse in the previous experiments (see Table 3.4).
Sentence-Based Classi cation
Table 3.9 shows signi cant improvements in sentence-based classi cation compared to clas-
si cation without the negation check.
Overall, the experiments show that negation signi cantly improved the performance
of all the classi ers (except modi cation 2) by producing more balanced output. Another
notable di erence introduced by the negation check is a signi cant improvement of the
word-based classi er using zones: in previous experiments this classi er did not show any
signi cant variation in performance between the various classi cation settings (see Tables
3.2, 3.4 and 3.5).
Page 69
hidden
57
Accuracy
Basic Unit Kinds Overall Positive Negative Precision Coverage
Chars 0.67 0.77 0.57 0.73 0.92
Chars 1 0.67 0.83 0.51 0.73 0.92
Chars 2 0.47 0.03 0.92 0.51 0.93
Chars 3 0.65 0.52 0.77 0.73 0.88
Chars 4 0.69 0.69 0.68 0.75 0.92
Words 0.69 0.69 0.69 0.89 0.78
Words and Chars 0.71 0.78 0.63 0.77 0.92
Words and Chars 1 0.70 0.83 0.56 0.75 0.92
Words and Chars 2 0.53 0.13 0.94 0.58 0.91
Words and Chars 3 0.70 0.59 0.81 0.78 0.90
Words and Chars 4 0.72 0.71 0.71 0.77 0.92
Table 3.9: Results of sentence-based sentiment classi cation with negation
3.4.2 Length Ratio
Unlike characters, words (dictionary items) have di erent lengths and can capture various
portions of context. For example, if a dictionary item covers most of a phrase a classi er
can more reliably detect the phrase's sentiment. For example in the sentence ž(/
&
{(It's really neither sh nor fowl! ) there are two matching dictionary items in the
sentiment dictionary: ž( (really) and
&
{ (neither sh nor fowl). The rst item
is in the positive side of the dictionary and the second is in the negative. If a classi er
compares their scores (1 for positive and -1 for negative), then it will not be able to make
any decision, but if it were to compare their lengths (2 and 4) and combine this with their
scores (2  1 = 2 and 4  1 = 4), the whole sentence would be tagged negative.
A length-sensitive sentiment score can be de ned as:
Score =
L2w
Lcu
(3.4)
where Lw is the length of a word and Lcu is the length of the relevant enclosing classi cation
unit. The numerator Lw is squared to in
uence importance of longer units.
Since all characters have length 1, there is no point in testing character-only classi ers
in conjunction with the length ratio.
Page 76
hidden
64
Seeds on their own cannot produce a good classi cation due to their small number.
Section 4.3 describes a way to overcome this problem by applying an iterative approach.
This section also tests two techniques for increasing the precision of the iterative classi er:
ltering scores of found lexical units, to reduce the number of non-discriminative lexical
units and using di erence between positive and negative zones to rank classi cation results
by their reliability. Further classi cation accuracy improvements are based on extending
the unsupervised classi er with supervised techniques: Nave Bayes (multinomial) and
Support Vector Machine. The machine-learning extension is based on using classi cation
data produced by an unsupervised classi er to train supervised classi ers.
Section 4.5 summarises the experimental results described in this Chapter.
4.1 Dictionary Adjustment
A major disadvantage of a generic sentiment dictionary is that it does not take into
account domain-speci c ways of expressing sentiments. Quite often the same word might
have opposite meanings in di erent contexts (e.g. `unpredictable plot ' and `unpredictable
steering '). One possible solution is to assign domain-dependent sentiment scores to every
dictionary item. These scores would re
ect how an item is connected with sentiment in a
particular domain. This section presents experiments on dictionary adjustment by means
of calculating domain-dependent sentiment scores. The scores can be obtained from a
preliminary tagged corpus, but such an approach would no longer be unsupervised. To
keep the system unsupervised I used a classi er described in the previous Chapter (Section
3.2.2) to extract a sentiment-classi ed subcorpus from a raw corpus. The most important
feature of such a subcorpus is precision (providing the recall is high enough) rather than
accuracy. As the experiments described in the previous chapter show, the highest precision
was achieved by a word-based classi er with the negation check and using zones as the
unit of classi cation. This classi er was used as the basis for the experiments described
in this Chapter.
4.1.1 Adjustment to Corpus
I used the classi er to extract a subcorpus by labelling documents in the raw corpus accord-
ing to the classi cation results. The extracted subcorpus, consisting of 6447 documents
of which 3178 are classi ed as positive and 3269 are classi ed as negative, was used as a
training corpus in subsequent experiments. The corpus built using this data did not have
a very high accuracy (0.72), but it was balanced having similar number of positive and
Page 78
hidden
66
Accuracy Precision Recall F-measure
Before adjustment 0.72 0.90 0.72 0.80
After adjustment 0.74 0.91 0.74 0.82
Table 4.2: Results of word-based sentiment classi cation before and after feature adjust-
ment
Accuracy Precision Recall F-measure
Before adjustment 0.79 0.79 0.79 0.79
After adjustment 0.83 0.83 0.83 0.83
Table 4.3: Results of combined classi er sentiment classi cation before and after feature
adjustment
Accuracy Precision Recall F-measure
Before adjustment 0.72 0.90 0.72 0.80
After adjustment 0.74 0.91 0.74 0.81
Table 4.4: Average of the results of ve runs on a test corpus of the word classi er
sentiment classi cation before and after feature adjustment
Page 79
hidden
67
Corpus/product type Number of Reviews
Mobile phones 2317
Digital cameras 1705
MP3 players 779
Monitors 683
Oce equipment (copiers, multifunction devices, scanners) 611
Printers (laser, inkjet) 569
Computer peripherals (mice, keyboards, speakers) 457
Video cameras and lenses 361
Networking (routers, network cards) 350
Computer parts (CD-drives, motherboards) 308
Table 4.5: Product types and sizes of the test corpora.
Table 4.4 shows that words with adjusted scores perform slightly better (the improve-
ment is statistically signi cant) than without.
4.1.2 Adjustment to Topic
The corpus used in the previous experiments consisted of customer reviews of consumer
electronics of di erent kinds. This provides me an opportunity to split the corpus into
di erent topic-based subcorpora (topics for short) and test the ability of the approach to
nd topic-dependent scores for the items in the sentiment dictionary. The experiments
presented below used the same corpus as described in Section 3.1.2, but in order to to
extract domain-speci c scores, the corpus was split into 10 topics (see Table 4.5).
Five of the corpora combine smaller ones of 100{250 reviews each (as indicated in
parentheses in Table 4.5) in order to have reasonable amounts of data in each. Each
corpus has equal numbers of positive and negative reviews so that it is possible to derive
strong comparator accuracy gures by applying supervised classi ers3 (studying the e ect
of skewed class distributions is out of the scope of this study).
Table 4.6 compares the results of two classi cations. The left side of the table presents
the results of classi cation using the sentiment dictionary without any topic-speci c ad-
justment. The right side contains results of classi cation using the same dictionary but
with scores calculated on the basis of the extracted subset of documents. Although all
3This corpus is publicly available at http://www.informatics.sussex.ac.uk/users/tz21/
Page 80
hidden
68
Corpus No Scores Scores
P R F P R F
Mobile phones 0.87 0.71 0.78 0.87 0.72 0.79
Digital cameras 0.88 0.63 0.74 0.87 0.64 0.74
MP3 players 0.90 0.71 0.79 0.89 0.72 0.80
Monitors 0.87 0.71 0.78 0.87 0.74 0.80
Oce equipment 0.90 0.72 0.80 0.87 0.74 0.80
Printers 0.90 0.71 0.79 0.88 0.71 0.79
Computer peripherals 0.93 0.79 0.85 0.91 0.81 0.86
Video 0.90 0.75 0.82 0.86 0.73 0.79
Networking 0.85 0.65 0.74 0.83 0.68 0.74
Computer parts 0.84 0.65 0.73 0.82 0.62 0.71
Macroaverage 0.88 0.70 0.78 0.87 0.71 0.78
Table 4.6: Classi cation results of di erent topics with the sentiment vocabulary with
(Scores) and without topic-adjusted scores (No Scores). P is precision, R is recall, F is
F-measure. Di erence in the results for all corpora is statistically signi cant.
the results are signi cantly di erent (in terms of the paired t-test) there is only a slight
increase in recall at the expense of precision.
4.1.3 Discussion
Calculating domain-speci c scores for lexical items improved performance across the cor-
pus but only marginally altered results of classi cation of the same corpus split into sep-
arate topics. This may be due to the generic nature of the dictionary: it contains only
generic indicators of sentiment and is missing a lot of domain- and topic-speci c ones.
Thus a larger corpus has a better chance to improve performance with this generic sen-
timent dictionary as its items occur more frequently than in a small corpus. But if the
same collection is split into topical corpora where the role of domain-relevant words is
more important (the smaller collection is the more important every lexical unit becomes)
then a generic dictionary fails to improve even after being adjusted with domain-related
scores. Another important feature of a sentiment corpus is its topical coherence. The more
closely related (in terms of the topic) documents are, the more important topic-related
words may be and the smaller the improvement one can expect with a generic sentiment
Page 81
hidden
69
dictionary. This explains why the generic dictionary performed better on a more generic
corpus compared to the smaller more topic-oriented collections extracted from it.
4.2 Vocabulary Extraction
The experiments in the previous section suggest that a generic sentiment dictionary has
limited potential to improve performance even with domain-speci c scores used for ad-
justment of the dictionary item scores. If it is not possible to substantially increase per-
formance by adjusting an existing generic dictionary then the next possibility to explore
is creating domain-speci c vocabularies.
4.2.1 Seed-Based Approach
Although the experiments described above suggest that classi cation results can poten-
tially be improved by adjusting the vocabulary to the domain, the in
exibility of the
precompiled vocabulary prevents it from full adjustment to a domain. Moreover, the
vocabulary-based approach prevents a system from being multilingual as the very need
for a comprehensive dictionary inevitably makes the system language-dependent. Another
problem of the dictionary-based approach is that it is virtually impossible to include all
important domain-related words. One way to solve the problem may be nding domain-
related lexical units from a subcorpus which was extracted by an unsupervised classi er
and calculating their sentiment scores for a given topic. This would pave the way to creat-
ing a domain-speci c vocabulary to be used for classi cation. But this technique requires
extraction of a subcorpus from a corpus to be classi ed so that words can be extracted
from it and scores calculated for them. Such a subcorpus is a product of classi cation
that needs some input data to start with. This input could be several lexical units (seeds)
used for initial classi cation and extraction of the subcorpus.
Seeds
The experiments below test a number of seeds, which were selected intuitively without
any special preliminary study of their potential e ectiveness for the task of sentiment
classi cation. This approach is justi ed by the unsupervised paradigm of the research, as
any `learned' data would contradict it. Two types of seed word lists were investigated: six
one-word seed lists (see Table 4.7) and three multi-word seed lists consisting of the single
seeds in various combinations (see Table 4.8). All the seeds had their sentiment scores
set to 1 and the classi er was run with the seed lists taking the place of the sentiment
Page 84
hidden
72
Corpus good allPOS all
P R F P R F P R F
Mobile phones 0.77 0.27 0.40 0.81 0.32 0.46 0.85 0.41 0.55
Digital cameras 0.76 0.19 0.30 0.80 0.24 0.37 0.86 0.35 0.50
MP3 players 0.77 0.21 0.33 0.83 0.28 0.42 0.88 0.35 0.50
Monitors 0.68 0.22 0.34 0.73 0.28 0.41 0.79 0.34 0.47
Oce equipment 0.81 0.22 0.35 0.86 0.31 0.45 0.89 0.39 0.55
Printers 0.76 0.20 0.31 0.80 0.27 0.40 0.86 0.33 0.48
Computer peripherals 0.71 0.24 0.36 0.75 0.30 0.43 0.79 0.35 0.48
Video cameras and lenses 0.75 0.19 0.31 0.82 0.29 0.43 0.87 0.36 0.51
Networking 0.63 0.21 0.31 0.67 0.25 0.37 0.75 0.31 0.44
Computer parts 0.69 0.18 0.28 0.73 0.21 0.32 0.81 0.30 0.44
Macroaverage 0.73 0.21 0.33 0.78 0.28 0.41 0.84 0.35 0.49
Di erence -0.02 -0.02 -0.02 -0.02 -0.01 -0.01 -0.01 -0.02 -0.02
Table 4.10: Classi cation results with the seed good, and seed lists allPOS and all. P
is precision, R is recall, F is F-measure. Di erence shows the change in performance
compared with the corpus-wise classi cation (see Table 4.9). The di erences in the results
for all seed lists are statistically signi cant.
Lexical Unit
As discussed in the previous chapter (Section 3.1.1), the concept of `word' segmentation
in Chinese NLP and so the term `seed word' is not very accurate since it is not possible to
guarantee that extracted units will always form words in the normally understood sense.
Fortunately, the results of the experiments with di erent kinds of features (Section 3.5.1)
showed that high accuracy can be achieved by a combination of both words and characters,
which makes it possible not to use words as basic units. Instead, I use lexical units
which could be any combination of characters constituting parts of words, words or even
phrases. This approach avoids the need for word segmentation, and can also capture some
grammatical and syntactic information, because lexical units can incorporate grammar
words and parts of grammatical constructions. Example (1) shows a combination of two
words that was extracted as one unit. This unit provides a context for each of its two
members and potentially is a better indicator of sentiment than either of them on their
own. The lexical unit in Example (2) consists of two function words, the rst being a
grammar word with quite a complex meaning (mostly related to the sentence level) and
Page 85
hidden
73
a modal verb. Separately these two words have no relation to sentiment but combined
together they are often used to show that something can be easily done or improved, which
relates to sentiment. Example (3) comprises a combination of a negated modal verb with
the rst part of a number of words with meaning \setting up; switching to" (e.g. ¾n
{ install, set up; ¾ { set to (some value); ¾:ê¨ { switch to an automatic mode).
Thus the unit is capable of representing a whole set of similar phrases that describe the
inability of a device or a piece of software to perform a certain action, which most probably
expresses negative sentiment. This unit has also advantage of being more frequent than
any of the full forms. To avoid confusion in what follows I will use the term `lexical unit'
(LU) rather than `word'. In the context of these experiments the term `seed' means a LU
used as a seed.
(1) Â }
appearance good
the appearance is good
(2) 1 ïå
already can
OK; has become possible
(3)
ý ¾
not able set . . .
not able to set . . .
Lexical Unit Extraction To nd lexical units that are candidates for being seeds, the
process starts by looking for the longest character sequences that occur in any two zones
across all documents in the corpus (using the Longest Common Substring algorithm).
Although the process is computationally quite expensive it needs be run only once5. The
application of this approach to the corpus produced more than 121 thousand lexical units.
The list was ltered to exclude non-character symbols (digits, Latin chars, hyphens, but
other in-word symbols were preserved). To reduce the list, all lexical units that occurred
less than 10 times in the corpus were excluded. The nal version of the lexical item list
comprised 5492 items.
5If eciency were to be an issue, the corpus could be represented as sux tree to facilitate faster
extraction of lexical units that reoccur.
Page 89
hidden
77
Corpus Seed Corpus Seed
Monitors } (good) Video
cameras
and
lenses
p (clear - of sound or image)
¿ (convenient; cheap) ¹¿ (comfortable)
p (clear) ž( (practical)
ô (straight) ó (perfect)
¹¿ (comfortable) = (cool)
á ( ll, ful ll)
) (sharp)

(comfortable)
= (cool)
Mobile
phones
} (good) Digital
cameras
} (good)
/ (support) ¿ (convenient; cheap)
¿ (convenient; cheap) ¹¿ (comfortable)
¹¿ (comfortable) p (clear - of sound or image)
p (clear - of sound or image)  (special)
³ (sucient) = (cool)
}( (easy to use) á (satis ed)

(comfortable) ( (durable)
º' (user friendly) 
(comfortable)
AE (smooth and easy) ó (perfect)
Z (distinct) ž (straight)
= (cool) 3š (stable)
}† (has become better) ¹¿† (has become comfortable)
( (durable) ¢ (polite)
¹¿„ (comfortable) æÆ (detailed)
á„ (satis ed)
” ( t, suit)
¹¿† (has become comfortable)
( (applicable)
zK (handy)
Ñf (science, scienti c)
Networking 3š (stable) Printers } (good)
MP3
players
} (good) Computer
peripherals
} (good)
¿ (convenient; cheap) ¿ (convenient;cheap)
¹¿ (comfortable) ¹¿ (comfortable)
ž( (practical) Æ (precise)
uO (sensitive) 
(comfortable)

(comfortable) `ï (habitual)
= (cool) AE (smooth and easy)
¹¿† (has become comfortable) 3š (stable)
Computer
parts
} (good) Oce
equipment
} (good)
3š (stable) ¹¿ (comfortable)
3š (stable)
ž( (practical)
Table 4.11: Seeds automatically identi ed for each corpus.
Page 92
hidden
80
Corpus Only Positive Pos & Neg all Seed List
P R F P R F P R F
Mobile phones 0.86 0.51 0.64 0.89 0.57 0.70 0.85 0.41 0.55
Digital cameras 0.82 0.35 0.49 0.88 0.45 0.60 0.86 0.35 0.50
MP3 players 0.83 0.34 0.48 0.87 0.41 0.55 0.88 0.35 0.50
Monitors 0.74 0.43 0.55 0.80 0.48 0.60 0.79 0.34 0.47
Oce equipment 0.86 0.34 0.49 0.90 0.43 0.58 0.89 0.39 0.55
Printers 0.76 0.20 0.31 0.84 0.26 0.40 0.86 0.33 0.48
Computer peripherals 0.79 0.41 0.54 0.83 0.45 0.58 0.79 0.35 0.48
Video cameras and lenses 0.93 0.28 0.43 0.94 0.37 0.53 0.87 0.36 0.51
Networking 0.92 0.18 0.30 0.93 0.27 0.42 0.75 0.31 0.44
Computer parts 0.76 0.28 0.41 0.82 0.37 0.51 0.81 0.30 0.44
Macroaverage 0.83 0.33 0.46 0.87 0.41 0.55 0.84 0.35 0.49
Table 4.14: Classi cation results with only positive extracted seeds (Only Positive), the
same seeds augmented with generic negative seeds (Pos & Neg) and all seed list (all Seed
List). P is precision, R is recall, F is F-measure. For all corpora the di erences between
the results for all corpora are statistically signi cant except for those marked with .
4.2.3 Iterative Approach
In the context of real-world applications, most of the results presented in the previous
experiments would probably be acceptable in terms of precision; however they are very
low in recall, especially compared to the vocabulary-based classi er described earlier. This
means that the seeds on their own are not sucient and the classi er needs more lexical
units with appropriately calculated scores to perform better.
One way of extracting more lexical units from the corpus is to run the classi er iterat-
ively. Each new iteration uses a subset consisting of classi ed documents from the corpus
for extracting new lexical units and calculating their scores. The newly found set of lexical
units with scores assigned is then used for creating a new set of classi ed documents that
form a new subset for the next iteration (see Algorithm 5).
Iteration Stopping Criterion
An iterative approach requires a way to control the number of iterations. I used a goal
driven stopping criterion which means that iterations should stop once the goal is achieved.
Page 94
hidden
82
Mobile phones Monitors
Iter P R F C P R F C
1 0.86 0.41 0.56 1209 0.79 0.34 0.48 386
2 0.87 0.80 0.83 189 0.83 0.76 0.79 57
3 0.86 0.80 0.83 157 0.85 0.80 0.82 34
4 0.85 0.80 0.82 156 0.83 0.79 0.81 33
5 0.85 0.79 0.82 158 0.83 0.80 0.81 29
6 0.85 0.79 0.81 163 0.83 0.79 0.81 29
7 0.84 0.79 0.81 157 0.83 0.80 0.81 31
8 0.84 0.78 0.81 162 0.83 0.80 0.82 30
Table 4.15: Results of sentiment classi cation of 10 iterations with seed list all applied to
two topics Mobile phones and Monitors. Iter is the number of iterations, P is precision,
R is recall, F is F-measure; C is the number of documents that were NOT classi ed.
Classi cation Results: Over the whole Corpus
The next set of experiments tests the performance of the same set of seeds as presented
in Section 4.2.1 on the whole corpus but using the iterative technique. After a number
of iterations the classi er produced good results with positive seeds (see Table 4.16) com-
pared to the non-iterative classi er (Table 4.9). The most signi cant progress was made
in overall accuracy of classi cation, but the results are also less skewed. The best results
were were for group of seeds all. All the other positive seeds also performed quite well re-
gardless of how many seeds there were in the list. In contrast, all negative seeds performed
poorly, barely improving over the nave baseline. The reason for this is a very unbalanced
classi cation: almost all documents get tagged as positive, which results in near-baseline
performance. The skew towards positive classi cation (which is not expected from the
negative seeds) is the result of the skew towards negative classi cations during the rst
iteration: the extracted subcorpus contains many more negative documents compared to
positive ones, which a ects extraction of lexical units and score calculation for them. The
system extracts too many negative lexical units with very low scores (because there are
too many documents classi ed as negative) and several high frequency supposedly positive
lexical units (with high scores as the number of positive documents is low). This leads
to a skew towards positive classi cation in subsequent iterations. This suggests that such
classi cations should be avoided when the iteration control chooses the best iteration and
Page 95
hidden
83
Seed list name P R F Acc AccP AccN Iterations
good 0.79 0.72 0.75 0.72 0.77 0.68 9
very good 0.77 0.71 0.74 0.71 0.74 0.68 12
comfortable 0.78 0.72 0.75 0.72 0.73 0.71 5
bad 0.53 0.50 0.52 0.50 0.94 0.06 2
too bad 0.51 0.49 0.50 0.49 0.98 0.01 2
poor 0.54 0.50 0.52 0.50 0.93 0.07 2
allPOS 0.79 0.72 0.75 0.72 0.77 0.68 10
allNEG 0.55 0.51 0.53 0.51 0.93 0.09 2
all 0.85 0.78 0.81 0.78 0.81 0.75 3
Table 4.16: Results of sentiment classi cation after iterations. P is precision, R is recall, F
is F-measure; Acc is accuracy, AccP is accuracy of the positive class and AccN is accuracy
of the negative class.
that the iteration control should be extended with a skew-control rule.
Skew Control The motivation behind skew control is to prevent a classi er from pro-
ducing highly skewed classi cations. To do so, the skew control needs some approximate
`idea' of what a balanced classi cation is. Such a `gold standard' can be provided by the
rst (seed-only) iteration:
G =
min(Ci; Cj)
max(Ci; Cj)
(4.1)
where G is the `gold standard' for the balance, and Ci and Cj are the number of classi ed
documents of each class (either positive or negative). During the iterative classi cation
procedure, if the classi cation skew deviates from G then the iterations are stopped.
This means that the skew control uses the balance of the initial classi cation to compare
with all subsequent classi cations. However, if the system uses the exact value of the `gold
standard' (which is likely not to be perfect), then good classi cations which are slightly
di erent in balance will be regarded as skewed and thus ignored. For this reason the system
in fact does not use a strict comparison but instead use a `window' of 50%. For example,
if the initial iteration classi ed 100 positive documents and 100 negative documents, then
the `gold standard' would be 1; an acceptable balance should be at least 0.5 (a smaller
class can be half of the size of the bigger one). So if the next classi cation nds 100
Page 97
hidden
85
Seed list name Top 10 words in positive list
good Í\€ (control is (easy)), Zå¾ (carefully made),
If (x optics)
w ((it) has), (
 (quality is rather good)
IfØ& (x optical zoom), Í\€U (easy control), 5ñø (5 inch)
Hœú (output), Ÿý0Ì (rich in features)
very good Л† (supplied, provided), Í\€ (control is (easy)), Zå¾ (carefully made)
If (optics),
IfØ& (x optical zoom), w ((it) has)
(
 (quality is rather good), 5ñø(5 inch), Í\€U (easy control), DVD+
comfortable
If (optics),
IfØ& (x optical zoom), Hœú (output)
Л† (supplied, provided), 8úr ([extrem]emly outstanding)
^8ú (extremely out[standing]), dpi,  (feel comfortable)
Hœúr (outstanding output), Z徯 (carefully made)
bad CRT, Ù>ó± (these speakers), 8- (during the game)
>:h„ ((of) monitor), >Ï¡ (CRT)
Nó® (subwoofer), U1 ((some) distortion), (8 (in the game)
àU1(geometric distortion), k± (satellite speakers)
too bad Ç(† (used), w ((it) has), ri؟ (colour reduction), ¾¡ (visual design)
¾¡ (design), Л† (supplied, provided), Ç( ((it) uses)
IfØ& (optical zoom), Ÿý: (reach in features), Í\€ (control is (easy))
poor Ç(† (used), w ((it) has), ¾¡ (visual design), ¾¡ (design)
Л† (supplied, provided), IfØ& (optical zoom), Ÿý: (rich in features)
Í\€ (control is (easy)), Ç(((it) uses), ý:' (rich in features)
allPOS Zå¾ (carefully made),
If (x optics),
IfØ& (x optical zoom)
(
 (quality is rather good), 5ñø (5 inch), Í\€U (easy control)
Hœú (output), ó(
 (good sound quality)
ŸýPh (full of features), w ((it) has)
allNEG
If (x optics),
IfØ& (x optical zoom)
ŸýPh (full of features), (feel comfortable)
Hœú (output), ^8ú (extremely out[standing]), „‚ (control (of)), dpi
8úr ([extr]emly outstanding), Hœúr (outstanding output)
all Zå¾ (carefully made),
If (x optics),
IfØ& (x optical zoom)
Л† (supplied, provided), (
 (quality is rather good), 5ñø (5 inch)
Í\€U (easy control), ŸýPh (full of features), w ((it) has), Hœú (output)
Table 4.17: Top 10 positive lexical units found on completion of iterations.
Page 99
hidden
87
Corpus allPOS Extracted
P R F P R F
Mobile phones 0.82 0.76 0.79 0.86 0.80 0.83
Digital cameras 0.74 0.66 0.70 0.74 0.67 0.70
MP3 players 0.76 0.71 0.74 0.75 0.70 0.72
Monitors 0.81 0.77 0.79 0.81 0.78 0.79
Oce equipment 0.79 0.71 0.75 0.80 0.73 0.76
Printers 0.80 0.73 0.76 0.75 0.68 0.72
Computer peripherals 0.61 0.56 0.58 0.61 0.57 0.59
Video cameras and lenses 0.67 0.63 0.65 0.50 0.47 0.48
Networking 0.68 0.25 0.37 0.81 0.72 0.76
Computer parts 0.55 0.51 0.53 0.50 0.46 0.48
Macroaverage 0.72 0.63 0.67 0.71 0.66 0.68
Table 4.19: Classi cation results with allPos seed list and only positive extracted seeds
Extracted. P is precision, R is recall, F is F-measure. Di erences between the two sets of
results are statistically signi cant except for the corpora marked with .
formed better in terms of recall but precision was almost the same as that of the generic
seeds (see Table 4.19). In two topics (Computer parts and Video) the extracted seeds failed
to perform better than the nave baseline, and the generic seeds failed to do so in topics
Networking and Computer parts. The result of classi cation of the topic Networking illus-
trates the importance of a seed's domain-relevance: only one extracted seed outperformed
three generic ones. However in the topics Video and Computer parts generic seeds per-
formed better. The performance of the extracted seeds was most probably compromised
by a small size of these two topic corpora (only 361 and 308 documents respectively, see
Table 4.5) and that the collections combined reviews of related but nevertheless di erent
items (video cameras and lenses; CD-drives and motherboards). But on a big topic such as
Mobile phones the extracted seeds performed much better, mostly due to a large number
of extracted seeds (21 lexical units, see Table 4.11).
Another comparable pair of seed lists are the all seed list and the extracted seeds
combined with generic negative seeds (the same as the ones in all). Negative seeds helped
both of the seed lists to increase performance, but the generic seeds gained more compared
to the extracted ones (see Table 4.20). Although slightly better in recall, the generic seeds
Page 100
hidden
88
Corpus all ExtractedNeg
P R F P R F
Mobile phones 0.85 0.80 0.82 0.89 0.83 0.86
Digital cameras 0.82 0.74 0.77 0.81 0.73 0.77
MP3 players 0.81 0.75 0.78 0.79 0.73 0.76
Monitors 0.83 0.80 0.81 0.83 0.80 0.81
Oce equipment 0.81 0.75 0.78 0.83 0.76 0.80
Printers 0.82 0.75 0.78 0.82 0.75 0.78
Computer peripherals 0.82 0.78 0.80 0.84 0.79 0.81
Video cameras and lenses 0.77 0.73 0.75 0.70 0.66 0.68
Networking 0.75 0.31 0.44 0.83 0.72 0.77
Computer parts 0.67 0.63 0.65 0.67 0.63 0.65
Macroaverage 0.80 0.70 0.74 0.80 0.74 0.77
Table 4.20: Classi cation results with generic seeds (all) and extracted seeds combined
with generic negative seeds (ExtractedNeg). P is precision, R is recall, F is F-measure.
are similar in terms of precision. Again, similarly to the previous experiments, on a large
document collection (Mobile phones) the extracted seeds performed much better than the
generic ones. Both classi ers performed well (much higher than the nave baseline) on all
of the topics, which con rms the importance of negative seeds.
4.2.4 Discussion
The experiments presented above showed that although features (vocabulary) adjusted
to the domain produce better sentiment classi cation, a vocabulary-based approach has
limited ability to adapt to a domain: it is not possible to foresee all possible sentiment-
bearing lexical units in all possible domains. An alternative approach, based on using
seeds for classi cation proved to be e ective when used with multiple iterations. All
seeds consisting of both positive and negative lexical units managed to bootstrap a better
vocabulary from the corpus than the extracted ones. The biggest disadvantage of the latter
is absence of negative lexical units. However, augmented with generic negative seeds, the
extracted seeds performed quite well in terms of recall, especially on large document
collections. Generally, iterations allow the bootstrapping of a domain-related sentiment
vocabulary which in some cases performs better than the generic sentiment vocabulary

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

9 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
33% Ph.D. Student
 
22% Lecturer
 
11% Student (Bachelor)
by Country
 
33% United States
 
11% United Kingdom
 
11% Austria